Python Pandas:Groupby和Apply多列操作

问题描述:

df1是具有4列的DataFrame.

df1 is DataFrame with 4 columns.

我想通过将df1与列'A'分组并在列'C'和'D'上进行多列操作来创建新的DataFrame(df2)

I want to created a new DataFrame (df2) by grouping df1 with Column 'A' with multi-column operation on column 'C' and 'D'

"AA"列=均值(C)+均值(D)

Column 'AA' = mean(C)+mean(D)

'BB'列= std(D)

Column 'BB' = std(D)

df1= pd.DataFrame({
    'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
    'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
    'C' : np.random.randn(8), 
    'D' : np.random.randn(8)})

   A      B         C         D
0  foo    one  1.652675 -1.983378
1  bar    one  0.926656 -0.598756
2  foo    two  0.131381  0.604803
3  bar  three -0.436376 -1.186363
4  foo    two  0.487161 -0.650876
5  bar    two  0.358007  0.249967
6  foo    one -1.150428  2.275528
7  foo  three  0.202677 -1.408699

def fun1(gg): # this does not work
    return pd.DataFrame({'AA':C.mean()+gg.C.std(), 'BB':gg.C.std()})


dg1 = df1.groupby('A')
df2 = dg1.apply(fun1)

这不起作用.似乎aggregation()仅适用于Series,并且不可能进行多列操作. 而apply()仅产生具有多列操作的Series输出. 还有其他方法可以通过多列操作生成多列输出(DataFrame)吗?

This does not work. It seems like aggregation() only works for Series and multi-column operation is not possible. And apply() only produce Series output with multi-column operation. Is there any other way to produce multi-column output (DataFrame) with multi-column operation?

f函数中是否有错字? AA应该是C.mean() + C.std()还是C.mean() + D.mean()

Do you have a typo in your f function? Should AA be C.mean() + C.std() or C.mean() + D.mean()

在第一种情况下,AA = C.mean() + C.std()

In this first case, AA = C.mean() + C.std(),

In [91]: df = df1.groupby('A').agg({'C': lambda x: x.mean() + x.std(),
                                    'D': lambda x  x.std()})

In [92]: df
Out[92]: 
            C         D
A                      
bar  1.255506  0.588981
foo  1.775945  0.442724

对于第二个C.mean() + D.mean()来说,情况并不尽如人意.当您为groupby对象上的.agg函数指定一个dict时,我认为没有办法从两列中获取值.

For the second one C.mean() + D.mean(), things aren't quite as nice. When you give the .agg function on groupby objects a dict, I don't think there's a way to get values from two columns.

In [108]: g = df1.groupby('A')

In [109]: df = pd.DataFrame({"AA": g.mean()['C'] + g.mean()['D'], "BB": g.std()['D']})

In [110]: df
Out[110]: 
           AA        BB
A                      
bar  0.532263  0.721351
foo  0.427608  0.494980

您可能希望将g.mean()和g.std()分配给临时变量,以避免计算两次.

You may want to assign g.mean() and g.std() to temporary variables to avoid calculating them twice.