Spark RDD的示例方法无法按预期工作

问题描述:

我正在尝试在Spark 1.6.1上使用RDD的示例"方法

I am trying with the "sample" method of RDD on Spark 1.6.1

scala>val nu = sc.parallelize(1 to 10)
scala>val sp =  nu.sample(true,0.2)
scala>sp.collect.foreach(println(_))

38

scala>val sp2 = nu.sample(true, 0.2)
scala>sp2.collect.foreach(println(_))

247810

我不明白为什么sp2包含2,4,7,8,10.我认为应该只印两个数字.有什么问题吗?

I cannot understand why sp2 contains 2,4,7,8,10. I think there should be only two numbers printed. Is there anything wrong?

详细说明上一个答案:在

Elaborating on the previous answer: in the documentation (scroll down to sample) it is mentioned (emphasis mine):

分数:预期的样本大小,占该RDD大小的一部分,无需替换:选择每个元素的概率;分数必须为[0,1],并要替换:期望:选择每个元素的次数;分数必须> = 0

fraction: expected size of the sample as a fraction of this RDD's size without replacement: probability that each element is chosen; fraction must be [0, 1] with replacement: expected number of times each element is chosen; fraction must be >= 0

期望的"视上下文而定,可能有多种含义,但其中一个肯定没有的含义是精确",因此,样本大小的确切数量也有所不同.

'Expected' can have several meanings depending on the context, but one meaning it certainly does not have is 'exact', hence the varying exact number of the sample size.

如果您想要绝对固定的样本大小,则可以使用 takeSample 方法,缺点是它返回一个数组(即非RDD),该数组必须适合您的主内存:

If you want absolutely fixed sample sizes, you may use the takeSample method, the downside being that it returns an array (i.e. not an RDD), which must fit in your main memory:

val nu = sc.parallelize(1 to 10)
/** set seed for reproducibility */
val sp1 = nu.takeSample(true, 2, 182453) 
sp1: Array[Int] = Array(7, 2)

val sp2 = nu.takeSample(true, 2)
sp2: Array[Int] = Array(2, 10)

val sp3 = nu.takeSample(true, 2)
sp2: Array[Int] = Array(4, 6)