带有日期时间列的子集 pandas 数据框
跟进此问题如果熊猫数据帧是使用idx.min由一个字符串变量和一个datetime变量子集组成的,那么我们又如何由两个datetime变量子集呢?对于下面的示例数据框,我们如何对class == C中的行以及minimum base_date和maximum date_2日期进行子集化? [答案将是第3行]:
Following up this question where a pandas data frame is subset by one string variable and one datetime variable using idx.min, how could we subset by two date time variables? For the example data frame below, how would we subset rows from class == C, with the minimum base_date and the maximum date_2 date? [answer would be row 3]:
print(example)
slot_id class day base_date date_2
0 1 A Monday 2019-01-21 2019-01-24
1 2 B Tuesday 2019-01-22 2019-01-23
2 3 C Wednesday 2019-01-22 2019-01-24
3 4 C Wednesday 2019-01-22 2019-01-26
4 5 C Wednesday 2019-01-24 2019-01-25
5 6 C Thursday 2019-01-24 2019-01-22
6 7 D Tuesday 2019-01-23 2019-01-24
7 8 E Thursday 2019-01-24 2019-01-30
8 9 F Saturday 2019-01-26 2019-01-31
对于class == "C"和minimum base_date,我们可以使用:
For just class == "C" with the minimum base_date we can use:
df.iloc[pd.to_datetime(df.loc[df['class'] == 'C', 'base_date']).idxmin()]
但是,如果我们有2个或多个日期变量(例如max/min),那么索引解决方案仍然可行吗?索引子集是否包含2个或更多变量不暗示嵌套df.iloc?这是用2个或多个datetime变量处理子集的唯一方法吗?
However, if we had 2 or more date variables with conditions like max/min, would the index solution still be practical? Doesn't index subsetting with 2 or more variable imply nesting df.iloc? Is this the only way to do the subset with 2 or more datetime variables?
数据:
print(example.to_dict())
{'slot_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9}, 'class': {0: 'A', 1: 'B', 2: 'C', 3: 'C', 4: 'C', 5: 'C', 6: 'D', 7: 'E', 8: 'F'}, 'day': {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Wednesday', 4: 'Wednesday', 5: 'Thursday', 6: 'Tuesday', 7: 'Thursday', 8: 'Saturday'}, 'base_date': {0: datetime.date(2019, 1, 21), 1: datetime.date(2019, 1, 22), 2: datetime.date(2019, 1, 22), 3: datetime.date(2019, 1, 22), 4: datetime.date(2019, 1, 24), 5: datetime.date(2019, 1, 24), 6: datetime.date(2019, 1, 23), 7: datetime.date(2019, 1, 24), 8: datetime.date(2019, 1, 26)}, 'date_2': {0: datetime.date(2019, 1, 24), 1: datetime.date(2019, 1, 23), 2: datetime.date(2019, 1, 24), 3: datetime.date(2019, 1, 26), 4: datetime.date(2019, 1, 25), 5: datetime.date(2019, 1, 22), 6: datetime.date(2019, 1, 24), 7: datetime.date(2019, 1, 30), 8: datetime.date(2019, 1, 31)}}
数据预处理:
example = pd.DataFrame(example)
example['base_date'] = pd.to_datetime(example['base_date'].astype(str), format='%d%m%Y')
example['base_date'] = example['base_date'].dt.date
example['date_2'] = pd.to_datetime(example['date_2'].astype(str), format='%d%m%Y')
example['date_2'] = example['date_2'].dt.date
您可以使用transform
yourdf=example[example['base_date']==example.groupby('class')['base_date'].transform('min')]
如果仅用于C列
yourdf.loc[yourdf['class']=='C',:]
idxmin或idxmax还将仅返回满足min或max条件的第一个索引,因此,当存在多个max或min值时,它们仍仅显示一个索引
Also idxmin or idxmax will only return the first index met the min or max condition , so when there is multiple max or min values , they are still only show one index