使用 spark 'save' 的错误现在不支持分桶

使用 spark 'save' 的错误现在不支持分桶

问题描述:

我有一个 DataFrame,我正在尝试 partitionBy 一列,按该列对其进行排序并使用以下命令以镶木地板格式保存:

I have a DataFrame which I am trying to partitionBy a column, sort it by that column and save in parquet format using the following command:

df.write().format("parquet")
  .partitionBy("dynamic_col")
  .sortBy("dynamic_col")
  .save("test.parquet");

我收到以下错误:

reason: User class threw exception: org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now;

不允许 save(...) 吗?是否只允许 saveAsTable(...) 将数据保存到 Hive?

Is save(...) not allowed? Is only saveAsTable(...) allowed which saves the data to Hive?

任何建议都有帮助.

问题是sortBy目前(Spark 2.3.1)只支持和bucketing一起使用,bucketing需要结合使用saveAsTable 和桶排序列不应该是分区列的一部分.

The problem is that sortBy is currently (Spark 2.3.1) supported only together with bucketing and bucketing needs to be used in combination with saveAsTable and also the bucket sorting column should not be part of partition columns.

所以你有两个选择:

  1. 不要使用sortBy:

df.write
.format("parquet")
.partitionBy("dynamic_col")
.option("path", output_path)
.save()

  • sortBy 与分桶一起使用,并使用 saveAsTable 通过 Metastore 将其保存:

  • Use sortBy with bucketing and save it through the metastore using saveAsTable:

    df.write
    .format("parquet")
    .partitionBy("dynamic_col")
    .bucketBy(n, bucket_col)
    .sortBy(bucket_col)
    .option("path", output_path)
    .saveAsTable(table_name)