Pandas:使用列值的随机采样替换 NaN
我有一个 DataFrame,df
,包含几列.df
中的一些值是 NaN
.我想用一个有效值替换每个 NaN
,该值是通过从给定列中的其他值中随机抽样来选择的.
I have a DataFrame, df
, containing several columns. Some of the values in df
are NaN
. I want to replace each NaN
with a valid value, chosen by randomly sampling from other values in the given column.
例如,如果:
df[work] = [4, 7, NaN, 4]
我想用 4 2/3 的时间和 7 1/3 的时间替换 df[work][2]
.
I'd like to replace df[work][2]
with 4 2/3 of the time and 7 1/3 of the time.
这是我的尝试:
def resample_fillna(df):
for col in df.columns:
# get series consisting of non-NaN values
valid_series = df[col].dropna()
nan_indices = np.argwhere(np.isnan(df[col]))
for nan_index in nan_indices:
df[col][nan_index] = valid_series.sample(n=1)
我认为有更好、更 Pythonic 的方式.有什么想法吗?
I'm thinking there's a much better, more Pythonic way. Any thoughts?
谢谢!
让我们创建一些假数据,然后用同一列中的其他随机值填充缺失值.
Let's create some fake data and then fill the missing values with random other values from the same column.
np.random.seed(123)
data = np.random.randint(0, 10, (10,5))
df = pd.DataFrame(data, columns=list('abcde'))
df = df.where(df > 2)
df
a b c d e
0 NaN NaN 6.0 NaN 3.0
1 9.0 6.0 NaN NaN NaN
2 9.0 NaN NaN 9.0 3.0
3 4.0 NaN NaN 4.0 NaN
4 7.0 3.0 NaN 4.0 7.0
5 NaN 4.0 8.0 NaN 7.0
6 9.0 3.0 4.0 6.0 NaN
7 5.0 6.0 NaN NaN 8.0
8 3.0 5.0 NaN NaN 6.0
9 NaN 4.0 4.0 6.0 3.0
现在我们可以使用 apply
遍历每一列,并从非缺失值中进行替换采样.
Now we can loop through each column with apply
and sample with replacement from the non-missing values.
df.apply(lambda x: np.where(x.isnull(), x.dropna().sample(len(x), replace=True), x))
a b c d e
0 5.0 3.0 6.0 6.0 3.0
1 9.0 6.0 4.0 9.0 7.0
2 9.0 5.0 8.0 9.0 3.0
3 4.0 3.0 8.0 4.0 6.0
4 7.0 3.0 4.0 4.0 7.0
5 9.0 4.0 8.0 6.0 7.0
6 9.0 3.0 4.0 6.0 3.0
7 5.0 6.0 4.0 4.0 8.0
8 3.0 5.0 4.0 4.0 6.0
9 9.0 4.0 4.0 6.0 3.0