PySpark:在某些情况下,为什么我不能将列称为属性?

PySpark:在某些情况下,为什么我不能将列称为属性?

问题描述:

假设我有以下代码:

df = df \
    .withColumn('this_month_sales', df.units * df.rate) \
    .withColumn('this_year_sales_v1', df.this_month_sales + df.sales_till_last_month) \
    .withColumn('this_year_sales_v2', F.col('this_month_sales') + df.sales_till_last_month)

在此代码中,

    this_year_sales_v1
  • 公式将导致失败,提示 this_month_sales 列不存在或不是属性或类似内容.
  • this_year_sales_v2 的公式将起作用
  • formula for this_year_sales_v1 will cause a failure saying this_month_sales column doesn't exist or is not an attribute or something similar.
  • formula for this_year_sales_v2 will work

为什么?他们不是在本质上做同样的事情吗?

Why though? Aren't they doing the same thing essentially?

这是因为在第三行中,原始 df 中不存在 this_month_sales 列.它仅在第二行中创建,但是 df 变量的属性尚未更新.

That's because in the third line, the this_month_sales column does not exist in the original df. It was only created in the second line, but the attributes of the df variable has not been updated yet.

如果您做类似的事情

df = df \
    .withColumn('this_month_sales', df.units * df.rate)

df = df \
    .withColumn('this_year_sales_v1', df.this_month_sales + df.sales_till_last_month)

然后它应该工作,因为第二行运行时, this_month_sales 列现在是 df 的属性.

Then it should work because the this_month_sales column is now an attribute of df when the second line is run.

通常,我更喜欢使用 F.col 来防止此类问题.

In general, I prefer using F.col to prevent this kind of problems.