如何根据条件删除Pandas数据框中的列?
问题描述:
我有一个熊猫DataFrame,其中有许多NAN
值.
I have a pandas DataFrame, with many NAN
values in it.
如何删除诸如number_of_na_values > 2000
的列?
我试图那样做:
toRemove = set()
naNumbersPerColumn = df.isnull().sum()
for i in naNumbersPerColumn.index:
if(naNumbersPerColumn[i]>2000):
toRemove.add(i)
for i in toRemove:
df.drop(i, axis=1, inplace=True)
有没有更优雅的方法呢?
Is there a more elegant way to do it?
答
这是另一种替代方法,以使每列中具有小于或等于指定数量的nan的列保持不变:
Here's another alternative to keep the columns that have less than or equal to the specified number of nans in each column:
max_number_of_nas = 3000
df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]
在我的测试中,这似乎比李建勋在我测试的案例中建议的下降列方法稍快:
In my tests this seems to be slightly faster than the drop columns method suggested by Jianxun Li in the cases I tested:
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10000,5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5010
%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1000 loops, best of 3: 1.76 ms per loop
%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 100 loops, best of 3: 2.04 ms per loop
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5
%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1000 loops, best of 3: 662 µs per loop
%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 1000 loops, best of 3: 1.08 ms per loop