在Spark中对同一DataFrame并行执行独立操作

问题描述：

假设我有一个具有以下架构的Spark DataFrame:

Let's say I have a Spark DataFrame with the following schema:

root
 | -- prob: Double
 | -- word: String

我想从这个DataFrame中随机选择两个不同的词，但是我想执行X次此操作，因此最后我将随机选择X个词组，并且当然，每个选择ID都是彼此独立的.我该如何完成?

I'd like to randomly select two different words from this DataFrame, but I'd like to perform this action X amount of times, so at the end I'll have X tuples of words selected at random, and of course every selection id independent of each other. How do I accomplish this?

示例:

假设这是我的数据集:

[(0.1,"blue"),(0.2,"yellow"),(0.1,"red"),(0.6,"green")]

其中第一个数字ID为prob，第二个数字ID为word.对于X = 5，输出为:

where the first number id prob and the second is the word. For X=5 the output will be:

1. blue, green
2. green, yellow
3. green, yellow
4. yellow, blue
5. green, red

由于它们是独立的动作，因此您可以看到2和3相同，这很好.但是在每个元组中，一个单词只能重复一次.

As they are independent actions, you can see that 2 and 3 are the same, and that's fine. But in every tuple, a word can only repeat once.

答

1)您可以使用以下DataFrame方法之一:

1) You can use one of this DataFrame methods:

randomSplit(weights: Array[Double], seed: Long)
randomSplitAsList(weights: Array[Double], seed: Long)或
sample(withReplacement: Boolean, fraction: Double)

randomSplit(weights: Array[Double], seed: Long)
randomSplitAsList(weights: Array[Double], seed: Long) or
sample(withReplacement: Boolean, fraction: Double)

然后进行前两行.

2)随机排列行并取其中的前两个.

2) Shuffle rows and take first two of them.

import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)

3)或者您可以使用RDD的takeSample方法，然后将其转换为DataFrame:

3) Or you can use takeSample method of the RDD and then convert it to a DataFrame:

def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]

例如:

dataframe.rdd.takeSample(true, 1000).toDF()

在Spark中对同一DataFrame并行执行独立操作

相关推荐