在其他列中基于NaN的Python新列
我对Python还是很陌生,这是我有史以来的第一个问题,所以请对我保持温柔!
I'm quite new to Python and this is my first ever question so please be gentle with me!
我已经尝试了其他类似问题的答案,但仍然很困难.
I have tried out answers to other similar questions but am still quite stuck.
我正在使用Pandas,我有一个数据框,该数据框是来自多个不同的SQL表的合并,看起来像这样:
I am using Pandas and I have a dataframe which is a merge from multiple different SQL tables and looks something like this:
Col_1 Col_2 Col_3 Col_4
1 NaN NaN NaN
2 Y NaN NaN
3 Z C S
4 NaN B W
我不在乎Col_2 Col_3和Col_4中的值(请注意,这些值可以是字符串,整数或对象,具体取决于列)
I don't care about the values in Col_2 Col_3 and Col_4 (note these can be strings or integers or objects depending on the column)
我只是关心这些列中的至少一个是否已填充,因此理想情况下会希望添加第五列,例如:
I just care that at least one of these columns is populated so ideally would like a fifth column like:
Col_1 Col_2 Col_3 Col_4 Col_5
1 NaN NaN NaN 0
2 Y NaN NaN 1
3 Z C S 1
4 NaN B W 1
然后我想将列Col_2放到Col_4.
Then I want to drop the columns Col_2 to Col_4.
我最初的想法类似于下面的函数,但这将我的数据帧从50000行减少到50行.我不想删除任何行.
My initial thought was something like the function below, but this is reducing my dataframe from 50000 rows to 50. I don't want to delete any rows.
def function(row):
if (isnull.row['col_2'] and isnull.row['col_3'] and isnull.row['col_3'] is None):
return '0'
else:
return '1'
df['col_5'] = df.apply(lambda row: function (row),axis=1)
任何帮助将不胜感激.
Use any
and pass param axis=1
which tests row-wise this will produce a boolean array which when converted to int will convert all True
values to 1
and False
values to 0
, this will be much faster than calling apply
which is going to iterate row-wise and will be very slow:
In [30]:
df['Col_5'] = any(df[df.columns[1:]].notnull(), axis=1).astype(int)
df
Out[30]:
Col_1 Col_2 Col_3 Col_4 Col_5
0 1 NaN NaN NaN 0
1 2 Y NaN NaN 1
2 3 Z C S 1
3 4 NaN B W 1
In [31]:
df = df[['Col_1', 'Col_5']]
df
Out[31]:
Col_1 Col_5
0 1 0
1 2 1
2 3 1
3 4 1
这是 any
的输出:
In [34]:
any(df[df.columns[1:]].notnull(), axis=1)
Out[34]:
array([False, True, True, True], dtype=bool)
时间
In [35]:
%timeit df[df.columns[1:]].apply(lambda x: all(x.isnull()) , axis=1).astype(int)
%timeit any(df[df.columns[1:]].notnull(), axis=1).astype(int)
100 loops, best of 3: 2.46 ms per loop
1000 loops, best of 3: 1.4 ms per loop
因此,对于这样大小的df,在您的测试数据上,我的方法比其他答案快2倍以上
So on your test data for a df this size my method is over 2x faster than the other answer
更新
As you are running pandas version 0.12.0
then you need to call the top level notnull
version as that method is not available at df level:
any(pd.notnull(df[df.columns[1:]]), axis=1).astype(int)
我建议您进行升级,因为它将获得更多的功能和错误修复.
I suggest you upgrade as you'll get lots more features and bug fixes.