带有许多列的Python Pandas成对频率表
初学者熊猫在这里提问:
Beginner Pandas Question here:
如何为所有列创建交叉频率计数表??我想使用输出来绘制显示每一对列之间计数的海图热图.
How do I create a cross frequency count table for all columns? I want to ues the output to make a seaborn heatmap plot showing the counts between each pair of columns.
我有一个数据帧(从带有pyspark的hdfs中拉下来),具有约70个唯一列和约60万行
I have a dataframe (pulled down from hdfs with pyspark) with ~70 unique columns and about 600K rows
所需的样本输出:
C1 C2 C3 C4 ...C70
C1 - 1 1 2
C2 1 - 0 2
C3 1 0 - 1
C4 2 2 1 -
...
C70
样本DF:
import numpy as np
import pandas as pd
raw_data = {'C1': [ 0, 2, 5, 0, 3], #...600K
'C2': [3, 0 , 2, 0, 0],
'C3': [0, 0, 0, 3, 3],
'C4': [2, 1, 1, 4, 0]}
df = pd.DataFrame(raw_data, columns = ['C1', 'C2', 'C3','C4'])
print(df)
我尝试使用pandas的crosstab,pivot,pivot_table,并认为该解决方案正在使用crosstab,但是我无法以所需的输出格式来获取它(对不起,如果我缺少明显的东西).任何帮助表示赞赏!
I've tried using crosstab, pivot, pivot_table from pandas and think that the solution is using crosstab, but I can't get it in the desired output format (sorry if there is something obvious I'm missing). Any help appreciated!
使用clip_upper
将正值剪切到1
,然后计算点积:
Clip positive values to 1
with clip_upper
, and then compute the dot product:
i = df.clip_upper(1)
j = i.T.dot(i)
j
C1 C2 C3 C4
C1 3 1 1 2
C2 1 2 0 2
C3 1 0 2 1
C4 2 2 1 4