合并带有时间戳和间隔的数据帧
我有两个数据帧。
df1包含数字和时间戳。
df1contains number and timestamps. It is a very large set.
df1.head()
Out[292]:
2016-08-31 08:09:00 1.0
2016-08-31 08:11:00 7.0
2016-08-31 08:14:00 90.0
df2包含间隔
d2.head()
Out[292]:
start stop C
2016-08-31 08:09:00 2016-08-31 08:12:00 'a'
2016-08-31 08:13:00 2016-08-31 08:20:00 'b'
2016-08-31 08:20:00 2016-08-31 08:45:00 'c'
我想向其中添加新列 C
df1
,这样 C
的值对应于 df2 $ c中的值$ c>包含索引
df1
的时间间隔。
I would like to add a new column C
to df1
such that the value of C
is corresponding to the value in df2
for the interval which contained the index of df1
.
预期结果
df1.head()
Out[292]: C
2016-08-31 08:09:00 1.0 'a'
2016-08-31 08:11:00 7.0 'a'
2016-08-31 08:14:00 90.0 'b'
到目前为止,我已经尝试过:
So far, I have tried:
df1.loc[:,'C']=df1.index.map(lambda i:df2[np.logical_and(i>df2.starti<df2.stop)].C)
,但是效率极低并且崩溃在某些情况下,df2的间隔列表中缺少索引值。
but it is extremely inefficient and crashes for some cases where there are values of the index which are missing from the interval list in df2.
如何有效地做到这一点?
How to do this efficiently?
通过 IntervalIndex.from_arrays
首先:
Create IntervalIndex
by IntervalIndex.from_arrays
first:
s = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')
print (s)
IntervalIndex([[2016-08-31 08:09:00, 2016-08-31 08:12:00],
[2016-08-31 08:13:00, 2016-08-31 08:20:00],
[2016-08-31 08:20:00, 2016-08-31 08:45:00]],
closed='both',
dtype='interval[datetime64[ns]]')
然后 set_index
由新的 IntervalIndex
设置为由 值
:
Then set_index
by new IntervalIndex
set to new column by array created by values
:
df1['C'] = df2.set_index(s).loc[df1.index, 'C'].values
print (df1)
A C
2016-08-31 08:09:00 1.0 a
2016-08-31 08:11:00 7.0 a
2016-08-31 08:14:00 90.0 b
编辑:
s = pd.IntervalIndex.from_arrays(df2['start'].astype(np.int64),
df2['stop'].astype(np.int64), 'both')
print (s)
IntervalIndex([[1472630940000000000, 1472631120000000000],
[1472631180000000000, 1472631600000000000],
[1472631600000000000, 1472633100000000000]],
closed='both',
dtype='interval[int64]')
df1['C'] = df2.set_index(s).loc[df1.index.astype(np.int64), 'C'].values
print (df1)
A C
2016-08-31 08:09:00 1.0 a
2016-08-31 08:11:00 7.0 a
2016-08-31 08:14:00 90.0 b