向 Pandas DataFrame 添加新列导致 NaN

问题描述：

我有一个带有以下交易数据的 Pandas DataFrame data:

I have a pandas DataFrame data with the following transaction data:

           A         date
0      M000833  2016-08-01
1      M000833  2016-08-01
2      M000833  2016-08-02
3      M000833  2016-08-02 
4      M000511  2016-08-05

我想要一个新列，其中包含每个消费者的访问次数(每天多次访问应视为 1).

I want a new column with the count of number of visits (multiple visits per day should be treated as 1) per consumer.

所以我尝试了这个:

import pandas as pd
data['noofvisits'] = data.groupby(['A'])['date'].nunique()

当我只运行该语句而不将其分配给 DataFrame 时，我会得到一个带有所需输出的 Pandas 系列.但是，上述语句导致:

When I just run the statement without assigning it to the DataFrame, I get a pandas series with the desired output. However, the above statement result in:

           A         date       noofvisits
0      M000833  2016-08-01         NaN         
1      M000833  2016-08-01         NaN
2      M000833  2016-08-02         NaN
3      M000833  2016-08-02         NaN
4      M000511  2016-08-05         NaN

预期输出为:

           A         date       noofvisits
0      M000833  2016-08-01         2         
1      M000833  2016-08-01         2
2      M000833  2016-08-02         2
3      M000833  2016-08-02         2
4      M000511  2016-08-05         1

这种方法有什么问题?为什么 noofvisits 列的结果是 NAs 而不是计数值?

What is wrong with this approach? Why does the column noofvisits results in NAs rather than the count values?

答

使用 transform 生成一个 Series，它的索引与原始 df 对齐:

Use transform to generate a Series with it's index aligned to the original df:

In[32]:
df['noofvisits'] = df.groupby(['A'])['date'].transform('nunique')
df

Out[32]: 
             A        date  noofvisits
index                                 
0      M000833  2016-08-01           2
1      M000833  2016-08-01           2
2      M000833  2016-08-02           2
3      M000833  2016-08-02           2
4      M000511  2016-08-05           1

直接分配的问题是你在 'A' 列上 grouping 所以这成为 groupby 聚合的索引，然后您尝试分配给您的 df 但索引不一致，因此 NaN 列值.

The problem with direct assigning is that you're grouping on column 'A' so this becomes the index of the groupby aggregation, you then try to assign to your df but the indices don't agree hence the NaN column values.

此外，即使索引值确实一致，形状仍然不同:

Also even if the index values did agree the shape is different anyway:

In[33]:
df.groupby(['A'])['date'].nunique()

Out[33]: 
A
M000511    1
M000833    2
Name: date, dtype: int64

向 Pandas DataFrame 添加新列导致 NaN

相关推荐