如何根据条件删除Pandas数据框中的列?

问题描述：

我有一个熊猫DataFrame，其中有许多NAN值.

I have a pandas DataFrame, with many NAN values in it.

如何删除诸如number_of_na_values > 2000的列?

我试图那样做:

toRemove = set()
naNumbersPerColumn = df.isnull().sum()
for i in naNumbersPerColumn.index:
    if(naNumbersPerColumn[i]>2000):
         toRemove.add(i)
for i in toRemove:
    df.drop(i, axis=1, inplace=True)

有没有更优雅的方法呢?

Is there a more elegant way to do it?

答

这是另一种替代方法，以使每列中具有小于或等于指定数量的nan的列保持不变:

Here's another alternative to keep the columns that have less than or equal to the specified number of nans in each column:

max_number_of_nas = 3000
df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]

在我的测试中，这似乎比李建勋在我测试的案例中建议的下降列方法稍快:

In my tests this seems to be slightly faster than the drop columns method suggested by Jianxun Li in the cases I tested:

np.random.seed(0)
df = pd.DataFrame(np.random.randn(10000,5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5010

%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1000 loops, best of 3: 1.76 ms per loop

%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 100 loops, best of 3: 2.04 ms per loop


np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5

%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1000 loops, best of 3: 662 µs per loop

%%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 1000 loops, best of 3: 1.08 ms per loop

如何根据条件删除Pandas数据框中的列?

相关推荐