如何矢量化 R strsplit?
在创建使用 strsplit
的函数时,向量输入的行为不符合预期,需要使用 sapply
.这是由于 strsplit
产生的列表输出.有没有办法对过程进行矢量化——也就是说,函数为输入的每个元素生成列表中的正确元素?
When creating functions that use strsplit
, vector inputs do not behave as desired, and sapply
needs to be used. This is due to the list output that strsplit
produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input?
例如,计算字符向量中单词的长度:
For example, to count the lengths of words in a character vector:
words <- c("a","quick","brown","fox")
> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)
> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only
> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown fox
1 5 5 3
# Success, but potentially very slow
理想情况下,类似于 length(strsplit(words,"")[[.]])
其中 .
被解释为输入向量的相关部分.
Ideally, something like length(strsplit(words,"")[[.]])
where .
is interpreted as the being the relevant part of the input vector.
通常,您应该尝试使用矢量化函数开始.使用 strsplit
之后经常需要某种迭代(这会更慢),所以如果可能的话尽量避免它.在您的示例中,您应该使用 nchar
代替:
In general, you should try to use a vectorized function to begin with. Using strsplit
will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should use nchar
instead:
> nchar(words)
[1] 1 5 5 3
更一般地,利用 strsplit
返回列表的事实并使用 lapply
:
More generally, take advantage of the fact that strsplit
returns a list and use lapply
:
> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3
或者使用 plyr
中的 l*ply
系列函数.例如:
Or else use an l*ply
family function from plyr
. For instance:
> laply(strsplit(words,""), length)
[1] 1 5 5 3
为了纪念 Bloomsday,我决定测试性能使用乔伊斯的尤利西斯的这些方法:
In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:
joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))
既然我知道了所有的话,我们可以数数了:
Now that I have all the words, we can do our counts:
> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 4.666 6.000 69.000
user system elapsed
2.65 0.03 2.73
> # vectorized function
> system.time(print(summary(nchar(joyce))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 4.666 6.000 69.000
user system elapsed
0.05 0.00 0.04
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 4.666 6.000 69.000
user system elapsed
0.8 0.0 0.8
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 4.000 4.666 6.000 69.000
user system elapsed
17.20 0.05 17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
V1
Min. : 0.000
1st Qu.: 3.000
Median : 4.000
Mean : 4.666
3rd Qu.: 6.000
Max. :69.000
user system elapsed
7.97 0.00 8.03
矢量化函数和 lapply
比原始 sapply
版本快得多.所有解决方案都返回相同的答案(如摘要输出所示).
The vectorized function and lapply
are considerably faster than the original sapply
version. All solutions return the same answer (as seen by the summary output).
显然最新版本的 plyr
速度更快(这是使用稍旧的版本).
Apparently the latest version of plyr
is faster (this is using a slightly older version).