pandas :删除连续的重复项
在熊猫中仅丢弃连续重复项的最有效方法是什么?
What's the most efficient way to drop only consecutive duplicates in pandas?
drop_duplicates给出了这一点:
drop_duplicates gives this:
In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])
In [4]: a.drop_duplicates()
Out[4]:
1 1
2 2
4 3
dtype: int64
但是我想要这个:
In [4]: a.something()
Out[4]:
1 1
2 2
4 3
5 2
dtype: int64
使用 shift
:
a.loc[a.shift(-1) != a]
Out[3]:
1 1
3 2
4 3
5 2
dtype: int64
因此,以上代码使用布尔条件,我们将数据框与移位-1行的数据框进行比较,以创建掩码
So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask
另一种方法是使用 diff
:
Another method is to use diff
:
In [82]:
a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64
但是如果您有很多行,这比原始方法要慢.
But this is slower than the original method if you have a large number of rows.
更新
感谢Bjarke Ebert指出一个细微的错误,我实际上应该使用shift(1)
或只是shift()
,因为默认值为1,这将返回第一个连续的值:
Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1)
or just shift()
as the default is a period of 1, this returns the first consecutive value:
In [87]:
a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64
请注意索引值的不同,谢谢@BjarkeEbert!
Note the difference in index values, thanks @BjarkeEbert!