计算属于给定日期时间间隔的数据帧中的唯一值

问题描述:

我有一个数据帧:

ID           time_begin               time_end
A1      2019-04-04 08:04:56      2019-04-04 08:15:26
A11     2019-04-04 08:14:22      2019-04-04 08:42:04
BB      2019-04-04 08:44:53      2019-04-04 09:02:11
C5      2019-04-04 09:01:12      2019-04-04 09:22:07
C1      2019-04-04 09:03:51      2019-04-04 09:34:20
DD      2019-04-04 09:22:42      2019-04-04 09:42:42
DD1     2019-04-04 09:25:42      2019-04-04 09:47:55

因此,我想获得跨时间戳半小时值的最大ID数在时间样本内20分钟。

So, I want to get maximum of numbers of ID which cross half-hour values of timestamp within a sample of "time" of 20 minutes.


  • 因此,对于时间戳从8:00到8:20 ,我们将获得2个唯一ID(2019年4月04 08:04:56和2019-04-04 08:14:22)。但是只有ID A11跨8:20(因为它的持续时间为2019-04-04 08:14:22到2019-04-04 08:42:04),所以我们只计算它。

  • 对于时间戳从8:20到8:40 ,我们获得0个唯一ID。

  • 时间戳从8:40到9:00 ,我们获得1个唯一ID BB。

  • 对于从9:00到9:20的时间戳,我们得到2个ID C5和C1。
  • li>
  • 对于时间戳从9:20到9:40 ,我们得到2个ID DD和DD1,因为两者都跨越9:40。所需的结果是将其作为数据帧获取:

  • So for timestamp from 8:00 to 8:20 we get 2 unique ID (2019-04-04 08:04:56 and 2019-04-04 08:14:22). But only ID A11 cross 8:20 (because it lasts from 2019-04-04 08:14:22 to 2019-04-04 08:42:04), so we count only it.
  • For timestamp from 8:20 to 8:40 we get 0 unique ID.
  • For timestamp from 8:40 to 9:00 we get 1 unique ID BB. It crosses 9:00 value, so we count it.
  • For timestamp from 9:00 to 9:20 we get 2 ID C5 and C1.
  • For timestamp from 9:20 to 9:40 we get 2 ID DD and DD1 because both cross 9:40. Desired result is to get it as dataframe:
time                    ID_num
2019-04-04 08:00:00        1
2019-04-04 08:20:00        0
2019-04-04 08:40:00        1
2019-04-04 09:00:00        2
2019-04-04 09:20:00        2

我该怎么做?我棘手的部分是仅在ID跨越值8:00、8:20、8:40等时才对ID进行计数。

How could i do that? I tricky part is that counting IDs only if they cross values 8:00, 8:20, 8:40,.....

这似乎可以通过交叉合并和查询来完成:

This looks like it can be done with a cross merge and query:

# set up the time intervals
intervals = pd.date_range(df.time_begin.min().floor('20min'), 
                          df.time_end.max().ceil('20min'), freq='20min')

ref = pd.DataFrame({'dummy':1,'start':intervals[:-1], 'end':intervals[1:]})

(df.assign(dummy=1)
   .merge(ref, on='dummy')
   .query('start < time_begin < end <= time_end')    # your logic is here
   .groupby('start')
   .size()
   .reindex(intervals[:-1], fill_value=0)
)

输出:

2019-04-04 08:00:00    1
2019-04-04 08:20:00    0
2019-04-04 08:40:00    1
2019-04-04 09:00:00    2
2019-04-04 09:20:00    2
2019-04-04 09:40:00    0
Freq: 20T, dtype: int64