使用多列的Pandas DataFrame聚合函数

问题描述:

是否有一种方法可以像DataFrame.agg方法中所使用的那样编写聚合函数,该函数可以访问多个要聚合的数据列?典型的用例是加权平均值,加权标准偏差函数.

Is there a way to write an aggregation function as is used in DataFrame.agg method, that would have access to more than one column of the data that is being aggregated? Typical use cases would be weighted average, weighted standard deviation funcs.

我希望能够写类似的东西

I would like to be able to write something like

def wAvg(c, w):
    return ((c * w).sum() / w.sum())

df = DataFrame(....) # df has columns c and w, i want weighted average
                     # of c using w as weight.
df.aggregate ({"c": wAvg}) # and somehow tell it to use w column as weights ...

是;使用.apply(...)函数,该函数将在每个子DataFrame上调用.例如:

Yes; use the .apply(...) function, which will be called on each sub-DataFrame. For example:

grouped = df.groupby(keys)

def wavg(group):
    d = group['data']
    w = group['weights']
    return (d * w).sum() / w.sum()

grouped.apply(wavg)