从 R 中的向量创建频率计数

问题描述:

假设有一个带有可能重复值的数值的向量

Suppose there is a vector with numerical values with possible duplicated values

x <- c(1, 2, 3, 4, 5, 1, 2, 2, 3)

我想创建另一个计数向量,如下所示.

I want to create another vector of counts as follows.

  1. 它的长度与 x 相同.
  2. 对于 x 中的每个唯一值,第一次出现为 1,第二次出现为 2,依此类推.
  1. It has the same length as x.
  2. For each unique value in x, the first appearance is 1, the second appearance is 2, and so on.

我想要的新向量是

1, 1, 1, 1, 1, 2, 2, 3, 2

我需要一种快速的方法,因为 x 可能很长.

I need a fast way of doing this since x can be really long.

使用aveseq_along:

> x <- c(1, 2, 3, 4, 5, 1, 2, 2, 3)
> ave(x, x, FUN = seq_along)
[1] 1 1 1 1 1 2 2 3 2

另一个需要考虑的选项是data.table.虽然这需要更多的工作,但它可能会在很长的向量上得到回报.


Another option to consider is data.table. Although it is a little bit more work, it might pay off on very long vectors.

这是你的例子——绝对看起来有点矫枉过正!

Here it is on your example--definitely seems like overkill!

library(data.table)

x <- c(1, 2, 3, 4, 5, 1, 2, 2, 3)
DT <- data.table(id = sequence(length(x)), x, key = "id")
DT[, y := sequence(.N), by = x][, y]
# [1] 1 1 1 1 1 2 2 3 2

但是对于 10,000,000 项长的向量呢?

But how about on a vector 10,000,000 items long?

set.seed(1)
x2 <- sample(100, 1e7, replace = TRUE)

funAve <- function() {
  ave(x2, x2, FUN = seq_along)
}

funDT <- function() {
  DT <- data.table(id = sequence(length(x2)), x2, key = "id")
  DT[, y := sequence(.N), by = x2][, y]
}

identical(funAve(), funDT())
# [1] TRUE

library(microbenchmark)
# Unit: seconds
#      expr      min       lq   median       uq      max neval
#  funAve() 6.727557 6.792743 6.827117 6.992609 7.352666    20
#   funDT() 1.967795 2.029697 2.053886 2.070462 2.123531    20