pandas :删除连续的重复项

问题描述：

在熊猫中仅丢弃连续重复项的最有效方法是什么?

What's the most efficient way to drop only consecutive duplicates in pandas?

drop_duplicates给出了这一点:

drop_duplicates gives this:

In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [4]: a.drop_duplicates()
Out[4]: 
1    1
2    2
4    3
dtype: int64

但是我想要这个:

In [4]: a.something()
Out[4]: 
1    1
2    2
4    3
5    2
dtype: int64

答

使用 shift :

a.loc[a.shift(-1) != a]

Out[3]:

1    1
3    2
4    3
5    2
dtype: int64

因此，以上代码使用布尔条件，我们将数据框与移位-1行的数据框进行比较，以创建掩码

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

另一种方法是使用 diff :

Another method is to use diff:

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1    1
2    2
4    3
5    2
dtype: int64

但是如果您有很多行，这比原始方法要慢.

But this is slower than the original method if you have a large number of rows.

更新

感谢Bjarke Ebert指出一个细微的错误，我实际上应该使用shift(1)或只是shift()，因为默认值为1，这将返回第一个连续的值:

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1    1
2    2
4    3
5    2
dtype: int64

请注意索引值的不同，谢谢@BjarkeEbert！

Note the difference in index values, thanks @BjarkeEbert!

pandas :删除连续的重复项

相关推荐