在Spark中对同一DataFrame并行执行独立操作
假设我有一个具有以下架构的Spark DataFrame
:
Let's say I have a Spark DataFrame
with the following schema:
root
| -- prob: Double
| -- word: String
我想从这个DataFrame
中随机选择两个不同的词,但是我想执行X次此操作,因此最后我将随机选择X个词组,并且当然,每个选择ID都是彼此独立的.我该如何完成?
I'd like to randomly select two different words from this DataFrame
, but I'd like to perform this action X amount of times, so at the end I'll have X tuples of words selected at random, and of course every selection id independent of each other. How do I accomplish this?
示例:
假设这是我的数据集:
[(0.1,"blue"),(0.2,"yellow"),(0.1,"red"),(0.6,"green")]
其中第一个数字ID为prob
,第二个数字ID为word
.对于X = 5,输出为:
where the first number id prob
and the second is the word
. For X=5 the output will be:
1. blue, green
2. green, yellow
3. green, yellow
4. yellow, blue
5. green, red
由于它们是独立的动作,因此您可以看到2和3相同,这很好.但是在每个元组中,一个单词只能重复一次.
As they are independent actions, you can see that 2 and 3 are the same, and that's fine. But in every tuple, a word can only repeat once.
1)您可以使用以下DataFrame方法之一:
1) You can use one of this DataFrame methods:
-
randomSplit(weights: Array[Double], seed: Long)
-
randomSplitAsList(weights: Array[Double], seed: Long)
或 -
sample(withReplacement: Boolean, fraction: Double)
randomSplit(weights: Array[Double], seed: Long)
-
randomSplitAsList(weights: Array[Double], seed: Long)
or sample(withReplacement: Boolean, fraction: Double)
然后进行前两行.
2)随机排列行并取其中的前两个.
2) Shuffle rows and take first two of them.
import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)
3)或者您可以使用RDD的takeSample
方法,然后将其转换为DataFrame:
3) Or you can use takeSample
method of the RDD and then convert it to a DataFrame:
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T]
例如:
dataframe.rdd.takeSample(true, 1000).toDF()