使用sapply进行中位插补

问题描述:

我想替换数据框列中的缺失值.我写了下面的代码

I want to replace missing values in columns of a dataframe. I have written the following code

MedianImpute <- function(data=data)
     {
      for(i in 1:ncol(data))
        {        
        if(class(data[,i]) %in% c("numeric","integer"))
          {
          if(sum(is.na(data[,i])))
            {
            data[is.na(data[,i]),i] <- 
                          median(data[,i],na.rm = TRUE)
            }
          }
        }
      return(data)
      }

这将返回将NA替换为列中位数的数据帧. 我不想使用for循环,如何使用R中的任何apply函数获得相同的结果?

This returns the dataframe with the NAs replaced by the column median. I do no want to use for loop, how can I get the same result using any of the apply functions in R?

这实际上是一个微妙的问题,因此值得进行一些讨论(IMO).您有一个data frame,并且只希望为数字列估算中值,因此结果当然是一个数据框.

This is actually a subtle problem, so worth a bit of discussion (IMO). You have a data frame and want to impute medians for numeric columns only, with the result being, of course, a data frame.

apply(...)函数首先将其参数强制转换为矩阵.由于根据定义,矩阵中的所有元素都必须是相同的数据类型,因此如果原始df中有任何字符或因子列,则将整个矩阵传递给 >.

The apply(...) function will coerce it's argument to a matrix first. Since all elements in a matrix must by definition be the same data type, if there are any character or factor columns in the original df, the whole matrix will be coerced to char when it is passed to apply(...).

# 1st column of df is a factor
df <- data.frame(a=letters[1:5],x=sample(1:5,5),y=runif(5))
df[3,]$x <- NA
df[5,]$y <- NA
df
#   a  x         y
# 1 a  5 0.5235779
# 2 b  3 0.2142011
# 3 c NA 0.8886608
# 4 d  4 0.4952574
# 5 e  1        NA

apply(df,2,function(x) {
  if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
#      a   x    y          
# [1,] "a" " 5" "0.5235779"
# [2,] "b" " 3" "0.2142011"
# [3,] "c" NA   "0.8886608"
# [4,] "d" " 4" "0.4952574"
# [5,] "e" " 1" NA         

sapply(df,FUN=f)会将df的列分别传递给函数f(...),但是结果将为矩阵.因此,例如,df中的任何因子都将被强制为整数.

sapply(df,FUN=f) will pass the columns of df individually to a function f(...), but, the result will be matrix. So, for example, any factors in df will be coerced to integer.

sapply(df,function(x) {
  if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
#      a   x         y
# [1,] 1 5.0 0.5235779
# [2,] 2 3.0 0.2142011
# [3,] 3 3.5 0.8886608
# [4,] 4 4.0 0.4952574
# [5,] 5 1.0 0.5094176

因此,在这里df$xdf$y是正确的,但请查看df$a发生了什么:通过返回因子水平将因子强制转换为数字-不是您想要的!

So here, df$x and df$y are correct,but look what happened to df$a: the factor was coerced to numeric by returning the factor levels - not what you want!

lapply(df,FUN=F)将返回一个列表,然后可以将其转换为数据框.这种方法可以为您提供所需的结果:

lapply(df,FUN=F) will return a list, which can then be converted to a data frame. This approach gives you the desired result:

data.frame(lapply(df,function(x) {
    if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x}))
#   a   x         y
# 1 a 1.0 0.3093707
# 2 b 3.0 0.3486391
# 3 c 3.5 0.8292446
# 4 d 5.0 0.7882574
# 5 e 4.0 0.5684483

我想这是否比使用循环更好……还是有争议的.

I suppose it's debatable whether this is any better than using a loop...