R:用NA替换多列数据帧中的多个值

问题描述:

我正在努力实现类似于这个问题,但是具有必须由NA替换的多个值以及大型数据集。

I am trying to achieve something similar to this question but with multiple values that must be replaced by NA, and in large dataset.

df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = rep(1:9), var2 = rep(3:5, each = 3))

生成此数据框:

df
  name foo var1 var2
1    a   1    1    3
2    a   2    2    3
3    a   3    3    3
4    b   4    4    4
5    b   5    5    4
6    b   6    6    4
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

我想用NA替换所有出现的3和4,但仅在以var开头的列中。

I would like to replace all occurrences of, say, 3 and 4 by NA, but only in the columns that start with "var".

我知道我可以使用 [] 运算符的组合来实现我想要的结果:

I know that I can use a combination of [] operators to achieve the result I want:

df[,grep("^var[:alnum:]?",colnames(df))][ 
        df[,grep("^var[:alnum:]?",colnames(df))] == 3 |
        df[,grep("^var[:alnum:]?",colnames(df))] == 4
   ] <- NA

df
  name foo var1 var2
1    a   1    1    NA
2    a   2    2    NA
3    a   3    NA   NA
4    b   4    NA   NA
5    b   5    5    NA
6    b   6    6    NA
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

现在我的问题如下:


  1. 有没有办法这是一个有效的方式,假设我的实际
    数据集有大约100.000行,500个变量中的400个开始
    与var。当我使用
    双括号技术时,我的电脑上似乎(主观上)缓慢。

  2. 如果
    而不是2个值(3和4)被替换为NA,我有一个长的
    列表,例如100个不同的值?有没有办法指定多个值,必须执行由 | 运算符分开的笨拙系列条件?

  1. Is there a way to do this in an efficient way, given that my actual dataset has about 100.000 lines, and 400 out of 500 variables start with "var". It seems (subjectively) slow on my computer when I use the double brackets technique.
  2. How would I approach the problem if instead of 2 values (3 and 4) to be replaced by NA, I had a long list of, say, 100 various values? Is there a way to specify multiple values with having to do a clumsy series of conditions separated by | operator?


您也可以使用替换

sel <- grepl("var",names(df))
df[sel] <- lapply(df[sel], function(x) replace(x,x %in% 3:4, NA) )
df

#  name foo var1 var2
#1    a   1    1   NA
#2    a   2    2   NA
#3    a   3   NA   NA
#4    b   4   NA   NA
#5    b   5    5   NA
#6    b   6    6   NA
#7    c   7    7    5
#8    c   8    8    5
#9    c   9    9    5

一些使用百万行数据的快速基准测试表明,这比其他答案更快。

Some quick benchmarking using a million row sample of data suggests this is quicker than the other answers.