计算具有联系的数据矢量的样本统计信息,该统计信息存储为频率表

计算具有联系的数据矢量的样本统计信息,该统计信息存储为频率表

问题描述:

我正在尝试从具有绑定值的数据向量中获取一些摘要统计信息(均值,方差和分位数).特别是,它存储在频率分布表中:唯一数据值var和联系数frequency.

I am trying to get some summary statistics (mean, variance and quantiles) from a data vector with tied values. In particular, it is stored in a frequency distribution table: unique data values var and number of ties frequency.

我知道我可以使用rep函数首先将向量扩展为完整格式:

I know I could use rep function to first expand the vector to its full format:

xx <- rep(mydata$var, mydata$frequency)

然后做标准

mean(xx)
var(xx)
quantile(xx)

但是频率确实很大,并且我有很多唯一值,这使程序速度非常慢.有没有一种方法可以直接从varfrequency计算这些统计信息?

But the frequency is really large and I have many unique values, which makes the program really slow. Is there a way to compute these statistics directly from var and frequency?

set.seed(0)
x <- runif(10)                ## unique data values
k <- sample.int(5, 10, TRUE)  ## frequency

n <- sum(k)
xx <- rep.int(x, k)           ## "expanded" data

#################
## sample mean ##
#################

mean(xx)  ## using `xx`
#[1] 0.6339458

mu <- c(crossprod(x, k)) / n  ## using `x` and `k`
#[1] 0.6339458

#####################
## sample variance ##
#####################

var(xx) * (n - 1) / n  ## using `xx`
#[1] 0.06862544

v <- c(crossprod(x ^ 2, k)) / n - mu * mu  ## using `x` and `k`
#[1] 0.06862544

计算分位数涉及更多,但可行.我们首先需要了解如何以标准方式计算分位数.

Computing quantiles are much more involved, but doable. We need to first understand how quantiles are computed in a standard way.

xx <- sort(xx)
pp <- seq(0, 1, length = n)
plot(pp, xx); abline(v = pp, col = 8, lty = 2)

标准分位数计算是线性插值问题.但是,当数据联系在一起时,我们可以清楚地看到是图中的游程"(相同值)和跳跃"(两个值之间).线性插值仅在跳跃"时才需要,而在行程"中,分位数只是行程值.

The standard quantile computation is a linear interpolation problem. However, when data have ties, we can clearly see that there are "runs" (of the same value) and "jumps" (between two values) in the plot. Linear interpolation is only needed on "jumps", while on "runs" the quantiles are just the run values.

以下函数仅使用xk查找分位数.为了演示起见,有一个参数verbose.如果为TRUE,它将生成一个绘图和一个数据框,其中包含行程"(和跳跃")信息.

The following function finds quantiles only using x and k. For demonstration purpose there is an argument verbose. If TRUE it will produce a plot and a data frame containing information of "runs" (and "jumps").

find_quantile <- function (x, k, prob = seq(0, 1, length = 5), verbose = FALSE) {

  if (is.unsorted(x)) {
    ind <- order(x); x <- x[ind]; k <- k[ind]
    }

  m <- length(x)     ## number of unique values
  n <- sum(k)        ## number of data
  d <- 1 / (n - 1)   ## break [0, 1] into (n - 1) intervals

  ## the right and left end of each run
  r <- (cumsum(k) - 1) * d
  l <- r - (k - 1) * d

  if (verbose) {

    breaks <- seq(0, 1, d)
    plot(r, x, "n", xlab = "prob (p)", ylab = "quantile (xq)", xlim = c(0, 1))
    abline(v = breaks, col = 8, lty = 2)

    ## sketch each run
    segments(l, x, r, x, lwd = 3)

    ## sketch each jump
    segments(r[-m], x[-m], l[-1], x[-1], lwd = 3, col = 2)

    ## sketch `prob`
    abline(v = prob, col = 3)

    print( data.frame(x, k, l, r) )
    }

  ## initialize the vector of quantiles 
  xq <- numeric(length(prob))

  run <- rbind(l, r)
  i <- findInterval(prob, run, rightmost.closed = TRUE)

  ## odd integers in `i` means that `prob` lies on runs
  ## quantiles on runs are just run values
  on_run <- (i %% 2) != 0
  run_id <- (i[on_run] + 1) / 2
  xq[on_run] <- x[run_id]

  ## even integers in `i` means that `prob` lies on jumps
  ## quantiles on jumps are linear interpolations
  on_jump <- !on_run
  jump_id <- i[on_jump] / 2
  xl <- x[jump_id]      ## x-value to the left of the jump
  xr <- x[jump_id + 1]  ## x-value to the right of the jump
  pl <- r[jump_id]      ## percentile to the left of the jump
  pr <- l[jump_id + 1]  ## percentile to the right of the jump
  p  <- prob[on_jump]   ## probability on the jump
  ## evaluate the line `(pl, xl) -- (pr, xr)` at `p`
  xq[on_jump] <- (xr - xl) / (pr - pl) * (p - pl) + xl

  xq
  }

使用verbose = TRUE将函数应用于上面的示例数据可得出:

Applying the function to the example data above with verbose = TRUE gives:

result <- find_quantile(x, k, prob = seq(0, 1, length = 5), TRUE)

#           x k         l         r
#1  0.2016819 4 0.0000000 0.1111111
#2  0.2655087 2 0.1481481 0.1851852
#3  0.3721239 1 0.2222222 0.2222222
#4  0.5728534 4 0.2592593 0.3703704
#5  0.6291140 2 0.4074074 0.4444444
#6  0.6607978 5 0.4814815 0.6296296
#7  0.8966972 1 0.6666667 0.6666667
#8  0.8983897 3 0.7037037 0.7777778
#9  0.9082078 2 0.8148148 0.8518519
#10 0.9446753 4 0.8888889 1.0000000

数据帧的每一行都是一个运行". x给出游程值,k是游程长度,并且lr是游程的左右百分比.在图中,行程"以黑色水平线绘制.

Each row of the data frame is a "run". x gives the run values, k is the run length, and l and r are the left and right percentile of the run. In the figure, "runs" are drawn in black horizontal lines.

一行的rx值和下一行的lx值暗含跳转"的信息.在图中,跳"用红线绘制.

Information of "jumps" is implied by the r, x values of a row and the l, x values of the next row. In the figure, "jumps" are drawn in red lines.

垂直绿线表示我们给出的prob值.

The vertical green lines signals the prob values we give.

计算的分位数是

result
#[1] 0.2016819 0.5226710 0.6607978 0.8983897 0.9446753

quantile(xx, names = FALSE)
#[1] 0.2016819 0.5226710 0.6607978 0.8983897 0.9446753