Python Pandas Dataframe按组中的最大值选择行

问题描述:

我有一个通过df.pivot创建的数据框:

I have a dataframe which was created via a df.pivot:

type                             start  end
F_Type         to_date                     
A              20150908143000    345    316
B              20150908140300    NaN    480
               20150908140600    NaN    120
               20150908143000  10743   8803
C              20150908140100    NaN   1715
               20150908140200    NaN   1062
               20150908141000    NaN    145
               20150908141500    418    NaN
               20150908141800    NaN    450
               20150908142900   1973   1499
               20150908143000  19522  16659
D              20150908143000    433     65
E              20150908143000   7290   7375
F              20150908143000      0      0
G              20150908143000   1796    340

我想为每个"F_TYPE"过滤并返回一行,只返回最大"to_date"的行.我想返回以下数据框:

I would like to filter and return a single row for each 'F_TYPE' only returning the row with the Maximum 'to_date'. I would like to return the following dataframe:

type                             start  end
F_Type         to_date                     
A              20150908143000    345    316
B              20150908143000  10743   8803
C              20150908143000  19522  16659
D              20150908143000    433     65
E              20150908143000   7290   7375
F              20150908143000      0      0
G              20150908143000   1796    340

谢谢..

一种标准方法是使用groupby(keys)[column].idxmax(). 但是,要使用 idxmax 您需要idxmax返回唯一索引值.获取唯一索引的一种方法是调用reset_index.

A standard approach is to use groupby(keys)[column].idxmax(). However, to select the desired rows using idxmax you need idxmax to return unique index values. One way to obtain a unique index is to call reset_index.

一旦从groupby(keys)[column].idxmax()获得索引值,就可以使用df.loc选择整个行:

Once you obtain the index values from groupby(keys)[column].idxmax() you can then select the entire row using df.loc:

In [20]: df.loc[df.reset_index().groupby(['F_Type'])['to_date'].idxmax()]
Out[20]: 
                       start    end
F_Type to_date                     
A      20150908143000    345    316
B      20150908143000  10743   8803
C      20150908143000  19522  16659
D      20150908143000    433     65
E      20150908143000   7290   7375
F      20150908143000      0      0
G      20150908143000   1796    340

注意:idxmax返回索引 labels ,不一定是普通索引.使用reset_index后,索引标签也恰好是普通标签,但由于idxmax返回标签(不是普通标签),最好始终始终将idxmaxdf.loc结合使用,而不是df.iloc(就像我最初在这篇文章中所做的那样.)

Note: idxmax returns index labels, not necessarily ordinals. After using reset_index the index labels happen to also be ordinals, but since idxmax is returning labels (not ordinals) it is better to always use idxmax in conjunction with df.loc, not df.iloc (as I originally did in this post.)