如何用R中具有特定值范围的NA替换离群值?
我有气候数据,我正在尝试用 NA
替换异常值.我之所以不使用 boxplot(x)$ out
是因为我有一定范围的值可以用来计算离群值.
I have climate data and I'm trying to replace outliers with NA
.
I'm not using boxplot(x)$out
is because I have a range of values to be considered to compute the outlier.
temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)
我的数据框看起来像这样
My dataframe looks like this
(我突出显示了根据范围应替换为NA的值.)
(I highlighted values that should be replaced with NA according to ranges.)
因此,必须根据 temp_range
, wind
的离群值应根据 wind_range
替换为 NA
,最后将湿度
的离群值替换为 NA
根据湿度范围
.
So temp1
and temp2
outliers must be replaced to NA
according to temp_range
, wind
's outliers should be replaced to NA
according to wind_range
and finally humidity
's outliers must be replaced to NA
according to humidity_range
.
这就是我所拥有的:
df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)
df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))
#Ranges
temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)
#Function to detect outlier
in_interval <- function(x, interval){
stopifnot(length(interval) == 2L)
interval[1] <= x & x <= interval[2]
}
#Replace outliers according to temp_range
cols <- c('temp1', 'temp2')
df[, cols] <- lapply(df[, cols], function(x) {
x[in_interval(x, temp_range)==FALSE] <- NA
x
})
我正在为每个范围进行代码的最后一部分(替换).有没有一种方法可以简化它,这样我就可以避免很多重复?
I'm doing the last part of code (the replacement) for every range. Is there a way to simplify it so I can avoid a lot of repetition?
最后,假设 cols<-c('wind')
会向我抛出警告,并用常量替换整个 wind 列.
Last thing, let's say cols <- c('wind')
this throws me a warning and replaces the whole wind column with a constant.
Warning message:
In `[<-.data.frame`(`*tmp*`, , cols, value = list(23.88, 23.93, :
provided 10 variables to replace 1 variables
有什么建议吗?
要更动态地执行此操作,请使用字典:具有异常值的数据框与每个变量相关联.
To do it more dynamically, use a dictionnary: a dataframe with outlier value associate to each variable.
在这里我用R创建它,但是将它包含在csv中会更加实用,因此您可以轻松地对其进行编辑.
Here I create it in R, but it would be more practical to have it in csv so you can edit it easily.
df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)
df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))
df_dict <- data.frame(variable = c("temp1", "temp2", "wind", "humidity"),
out_low = c(-15, -15, 0, 0),
out_high =c(45, 45, 15, 100))
for (var in df_dict$variable) {
df[[var]][df[[var]] < df_dict[df_dict$variable == var, ]$out_low | df[[var]] > df_dict[df_dict$variable == var, ]$out_high] <- NA
}