如何查找日期范围重叠的行?
我有一个包含如下数据的数据框(数据的微小子集):
I have a dataframe that contains data like below (tiny subset of data):
我试图找出一种方法,我可以创建一个新的数据框,其中包含具有相同值的所有行:carrier
, flightnumber
, departureAirport
和 arrivalAirport
但也有重叠的日期范围.
I'm trying to figure out a way where I can create a new dataframe that contains all rows that have the same values for : carrier
, flightnumber
, departureAirport
and arrivalAirport
but also have date ranges that overlap.
重叠我的意思是一行的 effectiveDate
介于 effectiveDate
和 discontinuedDate
之间,另一条记录具有相同的值我提到的专栏.
By overlap I mean the effectiveDate
for one row falls between the effectiveDate
and discontinuedDate
for another record that has the same values for the other columns I mentioned.
所以在我上面的例子中,前两行将被视为一个例子(并且都应该包含在新的数据框中),但第三行不是.
So in my above example, the first two rows would be considered an example of this (and should both be included in the new dataframe), but the third row is not.
我假设我想使用 groupby,但我并不完全清楚我将应用什么聚合函数.以下是我目前所拥有的:
I'm assuming I want to use groupby, but I'm not entirely clear on what aggregation function I would apply. Below is what I have so far:
df.groupby(['carrier','flightnumber','departureAirport','arrivalAirport'])['effectiveDate', 'discontinuedDate'].min()
但显然我需要应用一个确定重叠的函数而不是min()
.我将如何识别重叠而不是返回该组的最小值?
but obviously I need to apply a function that determines overlap instead of min()
. How would I go about identifying overlap instead of returning the minimum values for this group?
更新:
carrier flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
4U 9748 DUS GVA 2017-05-09 2017-07-12
4U 9748 DUS GVA 2017-05-14 2017-07-16
4U 9748 DUS GVA 2017-07-18 2017-08-27
AG 1234 SFO DFW 2017-03-09 2017-05-12
AG 1234 SFO DFW 2017-03-14 2017-05-16
更新 2:
就输出而言,我希望有任何重叠且具有相同值的 carrier
、flightnumber
、departureAirport
和 arrivalAirport
在新数据帧中返回.这些行不需要包含任何其他数据.所以对于上面的示例数据,像下面这样的数据帧将是我想要的输出:
As far as output goes I'd like to have any rows that overlap and have the same values for carrier
, flightnumber
, departureAirport
and arrivalAirport
returned in a new dataframe. There does not need to be any additional data included for these rows. So for the above example data, a dataframe like the one below would be my desired output:
carrier flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
4U 9748 DUS GVA 2017-05-09 2017-07-12
4U 9748 DUS GVA 2017-05-14 2017-07-16
AG 1234 SFO DFW 2017-03-09 2017-05-12
AG 1234 SFO DFW 2017-03-14 2017-05-16
请注意,仅排除了一条记录(9748
的第三条记录) - 这是因为它的日期范围与同一航班的其他记录不重叠.
Notice that only one record has been excluded (the third for 9748
) - this is because it's date range does not overlap with other records for the same flight.
高级概念
- 按所有日期排序,然后在完全重叠时优先
effectiveDate
. - 累计求和到交替的和在排序之前初始化的负值.关键是当累积总和高于
1
时会发生重叠.当总和下降到0
时,连续组结束. - 取消排序并确定出现零的位置......这些是重叠组的结尾.
- 在这些断点上拆分数据帧索引,只采用拆分大小大于
1
的拆分. - 连接传递的分割并使用
loc
获取切片数据帧.
- Sort by all dates and then by prioritizing
effectiveDate
if there is exact overlap. - Cumulatively sum to alternating ones and negative ones that were initialized prior to sorting. The point is that an overlap happens when the cumulative sum is above
1
. A contiguous group ends when the sum drops to0
. - Unsort the sorting and identify where zeros happen... these are the end of overlapping groups.
- Split the dataframe index on these break points and only take the splits where the size of the split is greater than
1
. - Concatenate the passing splits and use
loc
to get the sliced dataframe.
def overlaping_groups(df):
n = len(df)
cols = ['effectiveDate', 'discontinuedDate']
v = np.column_stack([df[c].values for c in cols]).ravel()
i = np.tile([1, -1], n)
a = np.lexsort([-i, v])
u = np.empty_like(a)
u[a] = np.arange(a.size)
e = np.flatnonzero(i[a].cumsum()[u][1::2] == 0)
d = np.diff(np.append(-1, e))
s = np.split(df.index.values, e[:-1] + 1)
return df.loc[np.concatenate([g for j, g in enumerate(s) if d[j] > 1])]
gcols = ['carrier', 'flightnumber', 'departureAirport', 'arrivalAirport']
df.groupby(gcols, group_keys=False).apply(overlaping_groups)
carrier flightnumber departureAirport arrivalAirport effectiveDate discontinuedDate
0 4U 9748 DUS GVA 2017-05-09 2017-07-12
1 4U 9748 DUS GVA 2017-05-14 2017-07-16
3 AG 1234 SFO DFW 2017-03-09 2017-05-12
4 AG 1234 SFO DFW 2017-03-14 2017-05-16