在 pandas 中合并两个数据框时如何“模糊"匹配字符串

问题描述:

我有两个数据框df1df2.

df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
                     'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John  Weber', 'Ian Ford'],
                     'gender': ['M', 'F', 'M', 'M']})

我需要在列Name上将这两个数据框与pandas.merge结合在一起.但是,您注意到,两个数据帧的列Name之间有些细微的差异.假设他们是同一个人.如果我只是这样做:

I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:

pd.merge(df1, df2, how='inner', on='Name')

我只得到了只有一行的数据帧,即伊恩·福特".

I only got a dataframe back with only one row, which is 'Ian Ford'.

有人知道如何合并这两个数据框吗?我猜如果在字符串列上连接两个表,这是很常见的情况.我绝对不知道该如何处理.在此先多谢.

Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.

我在这里使用fuzzywuzzy

from fuzzywuzzy import fuzz
from fuzzywuzzy import process



df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])

df2.merge(df1,left_on='key',right_on='Name')
Out[1238]: 
        Name_x gender         key  Age      Name_y
0   adam Smith      M  Adam Smith   43  Adam Smith
1    Annie Kim      F    Anne Kim   21    Anne Kim
2  John  Weber      M  John Weber   55  John Weber
3     Ian Ford      M    Ian Ford   24    Ian Ford