计算属于给定日期时间间隔的数据帧中的唯一值
问题描述:
我有一个数据帧:
ID time_begin time_end
A1 2019-04-04 08:04:56 2019-04-04 08:15:26
A11 2019-04-04 08:14:22 2019-04-04 08:42:04
BB 2019-04-04 08:44:53 2019-04-04 09:02:11
C5 2019-04-04 09:01:12 2019-04-04 09:22:07
C1 2019-04-04 09:03:51 2019-04-04 09:34:20
DD 2019-04-04 09:22:42 2019-04-04 09:42:42
DD1 2019-04-04 09:25:42 2019-04-04 09:47:55
因此,我想获得跨时间戳半小时值的最大ID数在时间样本内20分钟。
So, I want to get maximum of numbers of ID which cross half-hour values of timestamp within a sample of "time" of 20 minutes.
- 因此,对于时间戳从8:00到8:20 ,我们将获得2个唯一ID(2019年4月04 08:04:56和2019-04-04 08:14:22)。但是只有ID A11跨8:20(因为它的持续时间为2019-04-04 08:14:22到2019-04-04 08:42:04),所以我们只计算它。
- 对于时间戳从8:20到8:40 ,我们获得0个唯一ID。
- 时间戳从8:40到9:00 ,我们获得1个唯一ID BB。
- 对于从9:00到9:20的时间戳,我们得到2个ID C5和C1。 li>
- 对于时间戳从9:20到9:40 ,我们得到2个ID DD和DD1,因为两者都跨越9:40。所需的结果是将其作为数据帧获取:
- So for timestamp from 8:00 to 8:20 we get 2 unique ID (2019-04-04 08:04:56 and 2019-04-04 08:14:22). But only ID A11 cross 8:20 (because it lasts from 2019-04-04 08:14:22 to 2019-04-04 08:42:04), so we count only it.
- For timestamp from 8:20 to 8:40 we get 0 unique ID.
- For timestamp from 8:40 to 9:00 we get 1 unique ID BB. It crosses 9:00 value, so we count it.
- For timestamp from 9:00 to 9:20 we get 2 ID C5 and C1.
- For timestamp from 9:20 to 9:40 we get 2 ID DD and DD1 because both cross 9:40. Desired result is to get it as dataframe:
time ID_num
2019-04-04 08:00:00 1
2019-04-04 08:20:00 0
2019-04-04 08:40:00 1
2019-04-04 09:00:00 2
2019-04-04 09:20:00 2
我该怎么做?我棘手的部分是仅在ID跨越值8:00、8:20、8:40等时才对ID进行计数。
How could i do that? I tricky part is that counting IDs only if they cross values 8:00, 8:20, 8:40,.....
答
这似乎可以通过交叉合并和查询来完成:
This looks like it can be done with a cross merge and query:
# set up the time intervals
intervals = pd.date_range(df.time_begin.min().floor('20min'),
df.time_end.max().ceil('20min'), freq='20min')
ref = pd.DataFrame({'dummy':1,'start':intervals[:-1], 'end':intervals[1:]})
(df.assign(dummy=1)
.merge(ref, on='dummy')
.query('start < time_begin < end <= time_end') # your logic is here
.groupby('start')
.size()
.reindex(intervals[:-1], fill_value=0)
)
输出:
2019-04-04 08:00:00 1
2019-04-04 08:20:00 0
2019-04-04 08:40:00 1
2019-04-04 09:00:00 2
2019-04-04 09:20:00 2
2019-04-04 09:40:00 0
Freq: 20T, dtype: int64