如何在不损失信息的情况下将因子转换为整数\数字?
将因子转换为数字或整数时,我得到的是底层的层级代码,而不是数字形式的值.
When I convert a factor to a numeric or integer, I get the underlying level codes, not the values as numbers.
f <- factor(sample(runif(5), 20, replace = TRUE))
## [1] 0.0248644019011408 0.0248644019011408 0.179684827337041
## [4] 0.0284090070053935 0.363644931698218 0.363644931698218
## [7] 0.179684827337041 0.249704354675487 0.249704354675487
## [10] 0.0248644019011408 0.249704354675487 0.0284090070053935
## [13] 0.179684827337041 0.0248644019011408 0.179684827337041
## [16] 0.363644931698218 0.249704354675487 0.363644931698218
## [19] 0.179684827337041 0.0284090070053935
## 5 Levels: 0.0248644019011408 0.0284090070053935 ... 0.363644931698218
as.numeric(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
as.integer(f)
## [1] 1 1 3 2 5 5 3 4 4 1 4 2 3 1 3 5 4 5 3 2
我必须求助于paste
以获得真实值:
I have to resort to paste
to get the real values:
as.numeric(paste(f))
## [1] 0.02486440 0.02486440 0.17968483 0.02840901 0.36364493 0.36364493
## [7] 0.17968483 0.24970435 0.24970435 0.02486440 0.24970435 0.02840901
## [13] 0.17968483 0.02486440 0.17968483 0.36364493 0.24970435 0.36364493
## [19] 0.17968483 0.02840901
是否有更好的方法将因子转换为数字?
Is there a better way to convert a factor to numeric?
请参阅 ?factor
:
尤其是
as.numeric
适用于 一个因素是没有意义的,并且可能 通过隐性强制发生.到 将因子f
转换为 近似其原始数值 值,as.numeric(levels(f))[f]
是 推荐和更多 比as.numeric(as.character(f))
.
In particular,
as.numeric
applied to a factor is meaningless, and may happen by implicit coercion. To transform a factorf
to approximately its original numeric values,as.numeric(levels(f))[f]
is recommended and slightly more efficient thanas.numeric(as.character(f))
.
The FAQ on R has similar advice.
为什么as.numeric(levels(f))[f]
的效率比as.numeric(as.character(f))
高?
Why is as.numeric(levels(f))[f]
more efficent than as.numeric(as.character(f))
?
as.numeric(as.character(f))
实际上是as.numeric(levels(f)[f])
,因此您正在将length(x)
值而不是nlevels(x)
值转换为数字.对于水平少的长向量,速度差异最为明显.如果这些值大部分是唯一的,则速度不会有太大差异.无论您进行转换,此操作都不太可能成为代码中的瓶颈,因此不必担心太多.
as.numeric(as.character(f))
is effectively as.numeric(levels(f)[f])
, so you are performing the conversion to numeric on length(x)
values, rather than on nlevels(x)
values. The speed difference will be most apparent for long vectors with few levels. If the values are mostly unique, there won't be much difference in speed. However you do the conversion, this operation is unlikely to be the bottleneck in your code, so don't worry too much about it.
某些时间
library(microbenchmark)
microbenchmark(
as.numeric(levels(f))[f],
as.numeric(levels(f)[f]),
as.numeric(as.character(f)),
paste0(x),
paste(x),
times = 1e5
)
## Unit: microseconds
## expr min lq mean median uq max neval
## as.numeric(levels(f))[f] 3.982 5.120 6.088624 5.405 5.974 1981.418 1e+05
## as.numeric(levels(f)[f]) 5.973 7.111 8.352032 7.396 8.250 4256.380 1e+05
## as.numeric(as.character(f)) 6.827 8.249 9.628264 8.534 9.671 1983.694 1e+05
## paste0(x) 7.964 9.387 11.026351 9.956 10.810 2911.257 1e+05
## paste(x) 7.965 9.387 11.127308 9.956 11.093 2419.458 1e+05