Pandas:使用列值的随机采样替换 NaN

问题描述:

我有一个 DataFrame,df,包含几列.df 中的一些值是 NaN.我想用一个有效值替换每个 NaN,该值是通过从给定列中的其他值中随机抽样来选择的.

I have a DataFrame, df, containing several columns. Some of the values in df are NaN. I want to replace each NaN with a valid value, chosen by randomly sampling from other values in the given column.

例如,如果:

df[work] = [4, 7, NaN, 4]

我想用 4 2/3 的时间和 7 1/3 的时间替换 df[work][2].

I'd like to replace df[work][2] with 4 2/3 of the time and 7 1/3 of the time.

这是我的尝试:

def resample_fillna(df):
    for col in df.columns:
        # get series consisting of non-NaN values
        valid_series = df[col].dropna()
        nan_indices = np.argwhere(np.isnan(df[col]))
        for nan_index in nan_indices:
            df[col][nan_index] = valid_series.sample(n=1)

我认为有更好、更 Pythonic 的方式.有什么想法吗?

I'm thinking there's a much better, more Pythonic way. Any thoughts?

谢谢!

让我们创建一些假数据,然后用同一列中的其他随机值填充缺失值.

Let's create some fake data and then fill the missing values with random other values from the same column.

np.random.seed(123)
data = np.random.randint(0, 10, (10,5))
df = pd.DataFrame(data, columns=list('abcde'))
df = df.where(df > 2)
df

     a    b    c    d    e
0  NaN  NaN  6.0  NaN  3.0
1  9.0  6.0  NaN  NaN  NaN
2  9.0  NaN  NaN  9.0  3.0
3  4.0  NaN  NaN  4.0  NaN
4  7.0  3.0  NaN  4.0  7.0
5  NaN  4.0  8.0  NaN  7.0
6  9.0  3.0  4.0  6.0  NaN
7  5.0  6.0  NaN  NaN  8.0
8  3.0  5.0  NaN  NaN  6.0
9  NaN  4.0  4.0  6.0  3.0

现在我们可以使用 apply 遍历每一列,并从非缺失值中进行替换采样.

Now we can loop through each column with apply and sample with replacement from the non-missing values.

df.apply(lambda x: np.where(x.isnull(), x.dropna().sample(len(x), replace=True), x))

     a    b    c    d    e
0  5.0  3.0  6.0  6.0  3.0
1  9.0  6.0  4.0  9.0  7.0
2  9.0  5.0  8.0  9.0  3.0
3  4.0  3.0  8.0  4.0  6.0
4  7.0  3.0  4.0  4.0  7.0
5  9.0  4.0  8.0  6.0  7.0
6  9.0  3.0  4.0  6.0  3.0
7  5.0  6.0  4.0  4.0  8.0
8  3.0  5.0  4.0  4.0  6.0
9  9.0  4.0  4.0  6.0  3.0