从大数据帧中采样小数据帧
我正在尝试从给定的数据帧中采样一个数据帧,以使每个变量级别都有足够的采样.
这可以通过将数据帧按级别和样本分开来实现.
我以为ddply
(数据帧到数据帧)会帮我做到这一点.
举一个最小的例子:
I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable.
This can be achieved by separating the data frame by the levels and sample from each of those .
I thought ddply
(data-frame to data-frame) would do it for me.
Taking a minimal example:
set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2
30 32 38
以下命令执行采样...
The following commands perform the sampling...
当我输入...
data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))
我收到以下错误
Error in `[.data.frame`(x, .Internal(sample(length(x), size, replace, :
cannot take a sample larger than the population when 'replace = FALSE'
此错误是因为ddply
函数内的x
不是矢量,而是数据帧.
This error is because x
inside the ddply
function is not a vector but a dataframe.
有人对如何实现此采样有任何想法吗?
我知道一种方法是不使用ddply,而是分三个步骤进行(1)隔离,(2)采样和(3)整理.但是我想知道必须以某种方式...使用base或plyr
函数...
Does anyone have any idea on how to achieve this sampling?
I know one way is to not use ddply and just do (1) segregation, (2) sampling, and (3) collation in three steps. But I was wondering there must by some way ...with base or plyr
functions...
谢谢您的帮助...
我认为您想要的是使用sample
子集在x
中传递的数据帧的子集:
I think what you want is to subset the data frame passed in x
using sample
:
ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])
但是,当然,您仍然需要注意,根据a
的水平,每件样本的大小(在这种情况下为20)至少与数据的最小子集一样大.
But, of course, you still need to take care that the size of the sample for each piece (in this case 20) is at least as big as the smallest subset of your data based on the levels of a
.