根据条件从 pandas 数据框列中删除低计数
我有以下熊猫数据框:
new = pd.Series(np.array([0, 1, 0, 0, 2, 2]))
df = pd.DataFrame(new, columns=['a'])
我通过以下方式输出每个值的出现次数:
I output the occurrences of each value by:
print df['a'].value_counts()
然后我有以下内容:
0 3
2 2
1 1
dtype: int64
现在我想删除列 'a' 值小于 2 的行.我可以遍历 df['a'] 中的每个值,如果其值计数小于 2,则将其删除,但它需要一个多列的大型数据框需要很长时间.我无法弄清楚什么是有效的方法来做到这一点.
Now I want to remove the rows whose column 'a' value is less than 2. I can iterate through each value in df['a'] and remove it if its value count is less than 2, but it takes a long time for a large data frame with multiple columns. I can't figure out what's an efficient way to do that.
你可以用你的条件分配你的 value_counts
子集,然后得到那个 Series
的索引,然后用 isin
您可以检查应该在原始数据中的值,然后将值传递给原始数据帧:
You could assign you subset your value_counts
with your condition then get index of that Series
then with isin
you could check for the values which should be in your original and then pass values to the original DataFrame:
s = df['a'].value_counts()
df[df.isin(s.index[s >= 2]).values]
工作原理:
In [133]: s.index[s >= 2]
Out[133]: Int64Index([0, 2], dtype='int64')
In [134]: df.isin(s.index[s >= 2]).values
Out[134]:
array([[ True],
[False],
[ True],
[ True],
[ True],
[ True]], dtype=bool)
In [135]: df[df.isin(s.index[s >= 2]).values]
Out[135]:
a
0 0
2 0
3 0
4 2
5 2