使用多个条件和多个数据框的python pandas countifs

使用多个条件和多个数据框的python pandas countifs

问题描述:

尝试在Python中使用多个数据框创建等同于Excel的跨多个表的计数.

Trying to create--in Python using multiple data frames--the equivalent of a countifs in Excel that would span multiple sheets.

我需要根据当前数据框中的条件,在另一个数据框上添加新的记录列数.

I need a new column count of records on another data frame based on criteria from the current data frame.

请参见 Excel印象关于我想在python中做的事情,也此处.

See Excel impression of what I want to do in python, also here.

我的目标?

  • 计算学生数据框架上的考试
  • 通过学生证
  • 考试日期> =注册日期
  • 考试日期< =详细日期
  • 考试成绩> = 70
  • Count exams on students data frame
  • by Student ID
  • with exam date >= enroll date
  • with exam date <= detail date
  • with exam grade >= 70

基本上,Excel等价于...

Basically the Excel equivalent would be...

= COUNTIFS(摘要!$ B $ 1:$ B $ 11,> ="& Detail!B2, 摘要!$ B $ 1:$ B $ 11,< ="& Detail!C2, 摘要!$ C $ 1:$ C $ 11,> ="& 70, 摘要!$ A $ 1:$ A $ 11,"="& Detail!A2)

=COUNTIFS(Summary!$B$1:$B$11, ">="&Detail!B2, Summary!$B$1:$B$11, "<="&Detail!C2, Summary!$C$1:$C$11, ">="&70, Summary!$A$1:$A$11, "="&Detail!A2)

......其中摘要"是主要数据框架,详细信息"是我要对记录进行计数的次要数据框架.

...where Summary is the primary data frame and Detail is the secondary data frame where I want to count records.

在我的研究中找到了以下答案:

Found these answers in my research:

  • sumifs function in python
  • What is a good way to do countif in Python
  • Python Pandas counting and summing specific conditions

不太符合我的要求,因为它们没有跨越多个数据帧.我能够为单个数据框创建一个基本计数:

Not quite what I'm looking for, because they don't span multiple data frames. I was able to create a basic countifs for a singular data frame:

sum(1 for x in students['Student ID'] if x == 1)
sum(1 for x in exams['Exam Grade'] if x >= 70)

基本上,您要做的是设置两个数据框,对于通过考试"信息说df1,对于标记在上面说df2每次考试.

Basically what you'll want to do is set up two dataframes, say df1 for the "exams passed" information and df2 for the marks on each exam.

要开始工作,您可以像这样阅读excel文件:

To get yourself started, you can read in your excel files like this:

df1 = pd.read_excel('filename1.xlsx')
df2 = pd.read_excel('filename2.xlsx')

然后对于要分割df2df1中的每一行,并获取分割后的数据帧的长度.

Then for each row in df1 you want to segment df2 and get the length of the segmented dataframe.

首先,尽管您可能想为df1中的每一行列出信息,这可以通过以下方式完成:

First though you might want to make list of information for each row in df1, which could be done like this:

student_info = df1[['Student ID', 'Enrollment Date', 'Qualification Date']].values

然后,您可以像这样遍历行:

Then you can iterate through the rows like this:

N_exams_passed = [] # Store counts for each student in a list

for s_id, s_enroll, s_qual in student_info:
    N_exams_passed.append(len(df2[(df2['Student ID']==s_id) &
                                  (df2['Exam Date']>=s_enroll) &
                                  (df2['Exam Date']<=s_qual) &
                                  (df2['Grade']>=70)])
                          )

然后在df1中添加/替换列:

Then add/replace the column in df1:

df1['Exams Passed'] = N_exams_passed

为了正确地比较日期,您需要将它们转换为每个熊猫数据框中的datetime对象,我将由您自己决定.提示:您可以使用pd.to_datetime()功能.

In order to compare the dates properly you will need to convert them to datetime objects in each pandas dataframe, and I will leave this up to you. Hint: you can use the pd.to_datetime() function.