PySpark:在某些情况下,为什么我不能将列称为属性?
问题描述:
假设我有以下代码:
df = df \
.withColumn('this_month_sales', df.units * df.rate) \
.withColumn('this_year_sales_v1', df.this_month_sales + df.sales_till_last_month) \
.withColumn('this_year_sales_v2', F.col('this_month_sales') + df.sales_till_last_month)
在此代码中,
- 公式将导致失败,提示
this_month_sales
列不存在或不是属性或类似内容. -
this_year_sales_v2
的公式将起作用
this_year_sales_v1
的- formula for
this_year_sales_v1
will cause a failure sayingthis_month_sales
column doesn't exist or is not an attribute or something similar. - formula for
this_year_sales_v2
will work
为什么?他们不是在本质上做同样的事情吗?
Why though? Aren't they doing the same thing essentially?
答
这是因为在第三行中,原始 df
中不存在 this_month_sales
列.它仅在第二行中创建,但是 df
变量的属性尚未更新.
That's because in the third line, the this_month_sales
column does not exist in the original df
. It was only created in the second line, but the attributes of the df
variable has not been updated yet.
如果您做类似的事情
df = df \
.withColumn('this_month_sales', df.units * df.rate)
df = df \
.withColumn('this_year_sales_v1', df.this_month_sales + df.sales_till_last_month)
然后它应该工作,因为第二行运行时, this_month_sales
列现在是 df
的属性.
Then it should work because the this_month_sales
column is now an attribute of df
when the second line is run.
通常,我更喜欢使用 F.col
来防止此类问题.
In general, I prefer using F.col
to prevent this kind of problems.