如何矢量化 R strsplit?

问题描述：

在创建使用 strsplit 的函数时，向量输入的行为不符合预期，需要使用 sapply.这是由于 strsplit 产生的列表输出.有没有办法对过程进行矢量化——也就是说，函数为输入的每个元素生成列表中的正确元素?

When creating functions that use strsplit, vector inputs do not behave as desired, and sapply needs to be used. This is due to the list output that strsplit produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input?

例如，计算字符向量中单词的长度:

For example, to count the lengths of words in a character vector:

words <- c("a","quick","brown","fox")

> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)

> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only

> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown   fox 
1     5     5     3 
# Success, but potentially very slow

理想情况下，类似于 length(strsplit(words,"")[[.]]) 其中 . 被解释为输入向量的相关部分.

Ideally, something like length(strsplit(words,"")[[.]]) where . is interpreted as the being the relevant part of the input vector.

答

通常，您应该尝试使用矢量化函数开始.使用 strsplit 之后经常需要某种迭代(这会更慢)，所以如果可能的话尽量避免它.在您的示例中，您应该使用 nchar 代替:

In general, you should try to use a vectorized function to begin with. Using strsplit will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should use nchar instead:

> nchar(words)
[1] 1 5 5 3

更一般地，利用 strsplit 返回列表的事实并使用 lapply:

More generally, take advantage of the fact that strsplit returns a list and use lapply:

> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3

或者使用 plyr 中的 l*ply 系列函数.例如:

Or else use an l*ply family function from plyr. For instance:

> laply(strsplit(words,""), length)
[1] 1 5 5 3

为了纪念 Bloomsday，我决定测试性能使用乔伊斯的尤利西斯的这些方法:

In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:

joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))

既然我知道了所有的话，我们可以数数了:

Now that I have all the words, we can do our counts:

> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   2.65    0.03    2.73 
> # vectorized function
> system.time(print(summary(nchar(joyce))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   0.05    0.00    0.04 
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
    0.8     0.0     0.8 
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
  17.20    0.05   17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
       V1        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 4.000  
 Mean   : 4.666  
 3rd Qu.: 6.000  
 Max.   :69.000  
   user  system elapsed 
   7.97    0.00    8.03

矢量化函数和 lapply 比原始 sapply 版本快得多.所有解决方案都返回相同的答案(如摘要输出所示).

The vectorized function and lapply are considerably faster than the original sapply version. All solutions return the same answer (as seen by the summary output).

显然最新版本的 plyr 速度更快(这是使用稍旧的版本).

Apparently the latest version of plyr is faster (this is using a slightly older version).

相关推荐