如何用R中具有特定值范围的NA替换离群值?

问题描述：

我有气候数据，我正在尝试用 NA 替换异常值.我之所以不使用 boxplot(x)$ out 是因为我有一定范围的值可以用来计算离群值.

I have climate data and I'm trying to replace outliers with NA. I'm not using boxplot(x)$out is because I have a range of values to be considered to compute the outlier.

temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)

我的数据框看起来像这样

My dataframe looks like this

带有异常值的df

(我突出显示了根据范围应替换为NA的值.)

(I highlighted values that should be replaced with NA according to ranges.)

因此，必须根据 temp_range ， wind 的离群值应根据 wind_range 替换为 NA ，最后将湿度的离群值替换为 NA 根据湿度范围.

So temp1 and temp2 outliers must be replaced to NA according to temp_range, wind's outliers should be replaced to NA according to wind_range and finally humidity's outliers must be replaced to NA according to humidity_range.

这就是我所拥有的:

df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)

df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))

#Ranges
temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)

#Function to detect outlier
in_interval <- function(x, interval){
  stopifnot(length(interval) == 2L)
  interval[1] <= x & x <= interval[2]
}


#Replace outliers according to temp_range
cols <- c('temp1', 'temp2')
df[, cols] <- lapply(df[, cols], function(x) {

  x[in_interval(x, temp_range)==FALSE] <- NA
  x
})

我正在为每个范围进行代码的最后一部分(替换).有没有一种方法可以简化它，这样我就可以避免很多重复?

I'm doing the last part of code (the replacement) for every range. Is there a way to simplify it so I can avoid a lot of repetition?

最后，假设 cols<-c('wind')会向我抛出警告，并用常量替换整个 wind 列.

Last thing, let's say cols <- c('wind') this throws me a warning and replaces the whole wind column with a constant.

Warning message:
In `[<-.data.frame`(`*tmp*`, , cols, value = list(23.88, 23.93,  :
  provided 10 variables to replace 1 variables

有什么建议吗?

答

要更动态地执行此操作，请使用字典:具有异常值的数据框与每个变量相关联.

To do it more dynamically, use a dictionnary: a dataframe with outlier value associate to each variable.

在这里我用R创建它，但是将它包含在csv中会更加实用，因此您可以轻松地对其进行编辑.

Here I create it in R, but it would be more practical to have it in csv so you can edit it easily.

df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)

df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))


df_dict <- data.frame(variable = c("temp1", "temp2", "wind", "humidity"), 
                       out_low = c(-15, -15, 0, 0), 
                       out_high =c(45, 45, 15, 100))

for (var in df_dict$variable) {

  df[[var]][df[[var]] < df_dict[df_dict$variable == var, ]$out_low | df[[var]] > df_dict[df_dict$variable == var, ]$out_high] <- NA

}

如何用R中具有特定值范围的NA替换离群值?

相关推荐