通过使用 python 匹配国家/地区名称的子字符串,更新数据框中国家/地区代码列中的空值

问题描述:

我有两个数据框:Disaster,CountryInfo Disaster 有一列国家代码,其中包含一些空值,例如:

I have two data frames: Disaster, CountryInfo Disaster has a column country code which has some null values for example:

灾难:

 1.**Country**              - **Country_code** 
 2.India                    - Null         
 3.Afghanistan (the)        - AFD
 4.India                    - IND
 5.United States of America - Null

国家信息:

0.**CountryName**   - **ISO** 
1.India             - IND
2.Afganistan        - AFD
3.United States     - US

预期结果

          Country Country_code
 0          India          IND
 1    Afghanistan          AFD
 2          India          IND
 3  United States           US

我需要参考国家/地区名称的子字符串填写国家/地区代码.有人可以为此提出解决方案吗?

I need to fill the country code with reference to the substring of the country name.Can anyone suggest a solution for this?

应该这样做.您需要使用 rename 更改列名称,以便 dataframes 具有相同的列名称.然后,difflib 模块及其get_close_matches 方法可用于对Country 名称进行模糊匹配和替换.然后就是合并dataframes

This should do it. You need to change the column names with rename so that both dataframes have the same column names. Then, the difflib module and its get_close_matches method can be used to do a fuzzy match and replace of Country names. Then it is a simple matter of merging the dataframes

import pandas as pd
import numpy as np
import difflib

df1 = pd.DataFrame({'Country' : ['India', 'Afghanistan', 'India', 'United States of America'],
                        'Country_code' : ['Null', 'AFD', 'IND', 'Null']})
df1
                    Country Country_code
0                     India         Null
1               Afghanistan          AFD
2                     India          IND
3  United States of America         Null

df2 = pd.DataFrame({'Country' : ['India', 'Afghanistan', 'India', 'United States'],
                    'ISO' : ['IND', 'AFD', 'IND', 'USA']})
df2
          Country ISO
0          India  IND
1    Afghanistan  AFD
2          India  IND
3  United States  USA

df2.rename(columns={'ISO' : 'Country_code'}, inplace=True)
df2
         Country Country_code
0          India          IND
1    Afghanistan          AFD
2          India          IND
3  United States          USA

以下代码将df2中的Country列更改为df1Country列中的名称提供最接近的匹配.这是一种对子串进行模糊连接"的方法.

The following code will change the Country column in df2 with the names in the Country column in df1 that provide the closest match. This is a way of performing a kind of "fuzzy join" on the substrings.

df1['Country'] = df1.Country.map(lambda x: difflib.get_close_matches(x, df2.Country)[0])
df1
         Country Country_code
0          India         Null
1    Afghanistan          AFD
2          India          IND
3  United States         Null

现在您可以简单地mergedataframes,这将更新df1中缺失的Country_code行.>

Now you can simply merge the dataframes, which will update missing Country_code rows in df1.

df1.merge(df2, how='right', on=['Country', 'Country_code'])

         Country Country_code
0    Afghanistan          AFD
1          India          IND
2          India          IND
3  United States          USA