为什么内置的lm函数在R中这么慢?

问题描述：

我一直认为R中的lm函数非常快，但是正如本示例所建议的那样，使用solve函数计算的封闭解更快.

I always thought that the lm function was extremely fast in R, but as this example would suggest, the closed solution computed using the solve function is way faster.

data<-data.frame(y=rnorm(1000),x1=rnorm(1000),x2=rnorm(1000))
X = cbind(1,data$x1,data$x2)

library(microbenchmark)
microbenchmark(
solve(t(X) %*% X, t(X) %*% data$y),
lm(y ~ .,data=data))

如果这个玩具示例是一个不好的示例，或者是lm实际上很慢的情况，有人可以向我解释吗?

Can someone explain me if this toy example is a bad example or it is the case that lm is actually slow?

根据Dirk Eddelbuettel的建议，由于lm需要解析公式，所以比较是不公平的，因此最好使用lm.fit而不需要解析公式

As suggested by Dirk Eddelbuettel, as lm needs to resolve the formula, the comparison is unfair, so better to use lm.fit which doesn't need to resolve the formula

microbenchmark(
solve(t(X) %*% X, t(X) %*% data$y),
lm.fit(X,data$y))


Unit: microseconds
                           expr     min      lq     mean   median       uq     max neval cld
solve(t(X) %*% X, t(X) %*% data$y)  99.083 108.754 125.1398 118.0305 131.2545 236.060   100  a 
                      lm.fit(X, y) 125.136 136.978 151.4656 143.4915 156.7155 262.114   100   b

答

您正在忽略

solve()仅返回您的参数
lm()返回一个(非常丰富的)对象，其中包含许多个组件，用于后续分析，推断，绘图等.
lm()通话的主要费用不是投射，而是需要构建模型矩阵的公式y ~ .的分辨率.

solve() only returns your parameters
lm() returns you a (very rich) object with many components for subsequent analysis, inference, plots, ...
the main cost of your lm() call is not the projection but the resolution of the formula y ~ . from which the model matrix needs to be built.

为了说明Rcpp，我们编写了函数fastLm()的几个变体，并做了lm()的更多工作(即，比基数R的lm.fit()多)并对其进行了测量.参见例如此基准脚本清楚地表明，较小数据的主要成本集是解析公式并建立模型矩阵..

To illustrate Rcpp we wrote a few variants of a function fastLm() doing more of what lm() does (ie a bit more than lm.fit() from base R) and measured it. See e.g. this benchmark script which clearly shows that the dominant cost for smaller data sets is in parsing the formula and building the model matrix.

简而言之，您正在通过使用基准测试来完成正确的事情，但是在尝试比较大多数无可比拟的事情(一个具有更大任务的子集)时，您所做的并不是正确的.

In short, you are doing the Right Thing by using benchmarking but you are doing it not all that correctly in trying to compare what is mostly incomparable: a subset with a much larger task.

为什么内置的lm函数在R中这么慢?

相关推荐