pyspark:删除所有行中具有相同值的列

问题描述:

相关问题:如何通过熊猫或火花数据框删除所有行中具有相同值的列?

所以我有一个 pyspark 数据框,我想删除所有行中所有值都相同的列,同时保持其他列不变.

So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact.

然而,上述问题中的答案仅适用于熊猫.pyspark 数据框有解决方案吗?

However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe?

谢谢

您可以在每列上应用 countDistinct() 聚合函数以获取每列不同值的计数.count=1 的列表示所有行中只有 1 个值.

You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.

# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()

# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]

# drop the selected column
df.drop(*cols_to_drop).show()