PySpark:随机化数据框中的行

问题描述:

我有一个数据帧,我想随机化数据帧中的行.我尝试通过给出 1 的一小部分来对数据进行采样,但这不起作用(有趣的是,这在 Pandas 中有效).

I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).

它在 Pandas 中有效,因为在本地系统中采样通常通过改组数据来解决.另一方面,Spark 通过对数据执行线性扫描来避免改组.这意味着 Spark 中的采样只是随机抽取样本的成员而不是顺序.

It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.

您可以按一列随机数对 DataFrame 进行排序:

You can order DataFrame by a column of random numbers:

from pyspark.sql.functions import rand 

df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)

## +---+
## |  x|
## +---+
## |  2|
## |  7|
## | 14|
## +---+
## only showing top 3 rows

但它是:

  • 昂贵 - 因为它需要完全洗牌,而这是您通常想要避免的.
  • 可疑 - 因为 DataFrame 中值的顺序在非平凡的情况下不是你可以真正依赖的东西,而且由于 DataFrame 不支持索引,所以它是相对的不收藏就没用.
  • expensive - because it requires full shuffle and it something you typically want to avoid.
  • suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.