Python: pandas 合并了多个数据框

问题描述:

我有不同的数据框,需要根据日期列将它们合并在一起.如果我只有两个数据帧,我可以使用df1.merge(df2, on='date')来处理三个数据帧,而我使用df1.merge(df2.merge(df3, on='date'), on='date'),但是对多个数据帧进行处理变得非常复杂且难以理解.

I have diferent dataframes and need to merge them together based on the date column. If I only had two dataframes, I could use df1.merge(df2, on='date'), to do it with three dataframes, I use df1.merge(df2.merge(df3, on='date'), on='date'), however it becomes really complex and unreadable to do it with multiple dataframes.

所有数据框都有一个共同的列-date,但是它们没有相同的行数或列数,我只需要每个数据框都具有相同日期的那些行.

All dataframes have one column in common -date, but they don't have the same number of rows nor columns and I only need those rows in which each date is common to every dataframe.

因此,我正在尝试编写一个递归函数,该函数返回一个包含所有数据的数据框,但是它不起作用.那我应该如何合并多个数据框?

So, I'm trying to write a recursion function that returns a dataframe with all data but it didn't work. How should I merge multiple dataframes then?

我尝试了不同的方法,并遇到了out of rangekeyerror 0/1/2/3can not merge DataFrame with instance of type <class 'NoneType'>之类的错误.

I tried diferent ways and got errors like out of range, keyerror 0/1/2/3 and can not merge DataFrame with instance of type <class 'NoneType'>.

这是我写的脚本:

dfs = [df1, df2, df3] # list of dataframes

def mergefiles(dfs, countfiles, i=0):
    if i == (countfiles - 2): # it gets to the second to last and merges it with the last
        return

    dfm = dfs[i].merge(mergefiles(dfs[i+1], countfiles, i=i+1), on='date')
    return dfm

print(mergefiles(dfs, len(dfs)))

一个例子: df_1:

An example: df_1:

May 19, 2017;1,200.00;0.1%
May 18, 2017;1,100.00;0.1%
May 17, 2017;1,000.00;0.1%
May 15, 2017;1,901.00;0.1%

df_2:

May 20, 2017;2,200.00;1000000;0.2%
May 18, 2017;2,100.00;1590000;0.2%
May 16, 2017;2,000.00;1230000;0.2%
May 15, 2017;2,902.00;1000000;0.2%

df_3:

May 21, 2017;3,200.00;2000000;0.3%
May 17, 2017;3,100.00;2590000;0.3%
May 16, 2017;3,000.00;2230000;0.3%
May 15, 2017;3,903.00;2000000;0.3%

预期的合并结果:

May 15, 2017;  1,901.00;0.1%;  2,902.00;1000000;0.2%;   3,903.00;2000000;0.3%   

以下是在不涉及复杂查询的情况下合并多个数据框的最简洁,最易理解的方法.

Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.

只需将 DATE 作为索引进行合并,然后使用 OUTER 方法进行合并(以获取所有数据).

Just simply merge with DATE as the index and merge using OUTER method (to get all the data).

import pandas as pd
from functools import reduce

df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')

因此,基本上将所有具有的文件加载为数据框.然后使用mergereduce函数合并文件.

So, basically load all the files you have as data frame. Then merge the files using merge or reduce function.

# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]

您可以在上面的代码中添加尽可能多的数据帧.这是有关此方法的重要部分.不涉及复杂的查询.

you can add as many data-frames in the above code. This is the good part about this method. No complex queries involved.

要保留属于同一日期的值,您需要在DATE

To keep the values that belong to the same date you need to merge it on the DATE

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                            how='outer'), data_frames)

# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                            how='outer'), data_frames).fillna('void')

  • 因此,同一日期的值在同一行上.
  • 您可以使用fillna()填充不同列中不同框架的不存在的数据.
  • 然后根据需要将合并的数据写入csv文件.

    Then write the merged data to the csv file if desired.

pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)

这应该给你

DATE VALUE1 VALUE2 VALUE3 ....