获取多个 pandas 数据框的均值

问题描述:

我正在生成许多形状相同的数据框,并且我想将它们相互比较.我希望能够获得整个数据框的均值和中位数.

I'm generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.

         Source.0  Source.1  Source.2  Source.3
cluster                                        
0        0.001182  0.184535  0.814230  0.000054
1        0.000001  0.160490  0.839508  0.000001
2        0.000001  0.173829  0.826114  0.000055
3        0.000432  0.180065  0.819502  0.000001
4        0.000152  0.157041  0.842694  0.000113
5        0.000183  0.174142  0.825674  0.000001
6        0.000001  0.151556  0.848405  0.000038
7        0.000771  0.177583  0.821645  0.000001
8        0.000001  0.202059  0.797939  0.000001
9        0.000025  0.189537  0.810410  0.000028
10       0.006142  0.003041  0.493912  0.496905
11       0.003739  0.002367  0.514216  0.479678
12       0.002334  0.001517  0.529041  0.467108
13       0.003458  0.000001  0.532265  0.464276
14       0.000405  0.005655  0.527576  0.466364
15       0.002557  0.003233  0.507954  0.486256
16       0.004161  0.000001  0.491271  0.504568
17       0.001364  0.001330  0.528311  0.468996
18       0.002886  0.000001  0.506392  0.490721
19       0.001823  0.002498  0.509620  0.486059

         Source.0  Source.1  Source.2  Source.3
cluster                                        
0        0.000001  0.197108  0.802495  0.000396
1        0.000001  0.157860  0.842076  0.000063
2        0.094956  0.203057  0.701662  0.000325
3        0.000001  0.181948  0.817841  0.000210
4        0.000003  0.169680  0.830316  0.000001
5        0.000362  0.177194  0.822443  0.000001
6        0.000001  0.146807  0.852924  0.000268
7        0.001087  0.178994  0.819564  0.000354
8        0.000001  0.202182  0.797333  0.000485
9        0.000348  0.181399  0.818252  0.000001
10       0.003050  0.000247  0.506777  0.489926
11       0.004420  0.000001  0.513927  0.481652
12       0.006488  0.001396  0.527197  0.464919
13       0.001510  0.000001  0.525987  0.472502
14       0.000001  0.000001  0.520737  0.479261
15       0.000001  0.001765  0.515658  0.482575
16       0.000001  0.000001  0.492550  0.507448
17       0.002855  0.000199  0.526535  0.470411
18       0.000001  0.001952  0.498303  0.499744
19       0.001232  0.000001  0.506612  0.492155

然后我想获取这两个数据帧的均值.

Then I want to get the mean of these two dataframes.

最简单的方法是什么?

请澄清一下,当所有数据框的索引和列完全相同时,我想获取每个特定单元格的均值.

Just to clarify I want to get the mean for each particular cell when the indexes and columns of all the dataframes are exactly the same.

因此,在我给出的示例中,[0,Source.0]的平均值为(0.001182 + 0.000001)/2 = 0.0005915.

So in the example I gave, the average for [0,Source.0] would be (0.001182 + 0.000001) / 2 = 0.0005915.

假设两个数据框具有相同的列,则可以将它们串联起来并在串联的框架上计算汇总统计信息:

Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:

import numpy as np
import pandas as pd

# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))

# concatenate them
df_concat = pd.concat((df1, df2))

print df_concat.mean()
# x   -0.163044
# y    2.120000
# dtype: float64

print df_concat.median()
# x   -0.192037
# y    2.000000
# dtype: float64

更新

如果要计算两个数据集中具有相同索引的每组行的统计信息,则可以使用.groupby()按行索引对数据进行分组,然后应用均值,中位数等.

Update

If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby() to group the data by row index, then apply the mean, median etc.:

by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()

print df_means.head()
#           x    y
# 0 -0.850794  1.5
# 1  0.159038  1.5
# 2  0.083278  1.0
# 3 -0.540336  0.5
# 4  0.390954  3.5

即使您的数据帧中的行数不相等,该方法也可以使用-如果两个数据帧之一中缺少特定的行索引,则均值/中位数将在现有的单个行上进行计算.

This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.