通过使用 python 匹配国家/地区名称的子字符串,更新数据框中国家/地区代码列中的空值
我有两个数据框:Disaster,CountryInfo Disaster 有一列国家代码,其中包含一些空值,例如:
I have two data frames: Disaster, CountryInfo Disaster has a column country code which has some null values for example:
灾难:
1.**Country** - **Country_code**
2.India - Null
3.Afghanistan (the) - AFD
4.India - IND
5.United States of America - Null
国家信息:
0.**CountryName** - **ISO**
1.India - IND
2.Afganistan - AFD
3.United States - US
预期结果
Country Country_code
0 India IND
1 Afghanistan AFD
2 India IND
3 United States US
我需要参考国家/地区名称的子字符串填写国家/地区代码.有人可以为此提出解决方案吗?
I need to fill the country code with reference to the substring of the country name.Can anyone suggest a solution for this?
应该这样做.您需要使用 rename
更改列名称,以便 dataframes
具有相同的列名称.然后,difflib
模块及其get_close_matches
方法可用于对Country
名称进行模糊匹配和替换.然后就是合并dataframes
This should do it. You need to change the column names with rename
so that both dataframes
have the same column names. Then, the difflib
module and its get_close_matches
method can be used to do a fuzzy match and replace of Country
names. Then it is a simple matter of merging the dataframes
import pandas as pd
import numpy as np
import difflib
df1 = pd.DataFrame({'Country' : ['India', 'Afghanistan', 'India', 'United States of America'],
'Country_code' : ['Null', 'AFD', 'IND', 'Null']})
df1
Country Country_code
0 India Null
1 Afghanistan AFD
2 India IND
3 United States of America Null
df2 = pd.DataFrame({'Country' : ['India', 'Afghanistan', 'India', 'United States'],
'ISO' : ['IND', 'AFD', 'IND', 'USA']})
df2
Country ISO
0 India IND
1 Afghanistan AFD
2 India IND
3 United States USA
df2.rename(columns={'ISO' : 'Country_code'}, inplace=True)
df2
Country Country_code
0 India IND
1 Afghanistan AFD
2 India IND
3 United States USA
以下代码将df2
中的Country
列更改为df1
中Country
列中的名称提供最接近的匹配.这是一种对子串进行模糊连接"的方法.
The following code will change the Country
column in df2
with the names in the Country
column in df1
that provide the closest match. This is a way of performing a kind of "fuzzy join" on the substrings.
df1['Country'] = df1.Country.map(lambda x: difflib.get_close_matches(x, df2.Country)[0])
df1
Country Country_code
0 India Null
1 Afghanistan AFD
2 India IND
3 United States Null
现在您可以简单地merge
dataframes
,这将更新df1
中缺失的Country_code
行.>
Now you can simply merge
the dataframes
, which will update missing Country_code
rows in df1
.
df1.merge(df2, how='right', on=['Country', 'Country_code'])
Country Country_code
0 Afghanistan AFD
1 India IND
2 India IND
3 United States USA