如何在数据框中按列分组并在循环中创建数据透视表

问题描述:

我有下表 df

ID  CATEG   LEVEL   COLS    VALUE   COMMENT
1    A       3      Apple    388    comment1
1    A       3      Orange   204    comment1
1    A       2      Orange   322    comment1
1    A       1      Orange   716    comment1
1    A       1      Apple    282    comment1
1    A       2      Apple    555    comment1
1    A              Berry    289    comment1
2    A              Car      316    comment1
1    B              Berry    297    comment1
1    B       3      Apple    756    comment1
1    B       2      Apple    460    comment1
1    B       3      Orange   497    comment1
1    B       2      Orange   831    comment1
1    B       1      Orange   225    comment1
1    B       1      Apple    395    comment1
2    B              Car      486    comment1
1    C       2      Orange   320    comment1
1    C       1      Orange   208    comment1
1    C       1      Apple    464    comment1
1    C       2      Apple    613    comment1
1    C       3      Apple    369    comment1
1    C              Berry    474    comment1
2    C              Car      888    comment1
1    C       3      Orange   345    comment1
2    B              Car      664    comment2

我想在以下位置创建此视图数据框,并为每个 ID 组写excel。 ID $ c的示例$ c> 1 。在我的示例中,只有一个注释,因此工作表名称应类似于 ID_COMMENT 1_comment1 :-

I want to create this view in dataframe and write in excel for each group of ID.Example for ID 1. In my sample there is only one comment so sheet name be like ID_COMMENT like 1_comment1:-

  Berry     Apple     Orange        
         1   2   3  1   2   3
A   289 388 555 282 204 322 716
B   297 756 460 395 497 831 225
C   474 369 613 464 345 320 208

如果级别 None / na 我应该能够基于 df > COLS 和评论单独使用名称 ID_NULL_COMMENT如工作表名称:-
2_NULL_comment1 工作表:-

If LEVEL is None/na I should be able to create/ split the df based on COLS and comments alone with name "ID_NULL_COMMENT" as sheet name like:- 2_NULL_comment1 sheet :-

   CATEG    Car
     A      316
     B      486
     C      888

2_NULL_comment2 表格:-

CATEG   Car
 B      664

我尝试了什么:

from pandas import ExcelWriter
writer = ExcelWriter('Values.xlsx')
distinct_id_df= np.unique(df[['ID']], axis=0)   
for ID in  distinct_id_df.iloc[:,0] :
    sample_df = pd.DataFrame()
    for df in sample_df:
        for i in(distinct_id_df):
            distinct_id_df = df.groupby['ID'].pivot_table('VALUE', ['LEVEL','CATEEG'],'COLS')
        sample_df = sample_df.append(df)
        print(sample_df.shape, '===>', datetime.now())
    sample_df.to_excel(writer,'{}''{}'.format(id).format(comments),index= False)

writer.save()

这显然不正确,我不知道可以正确地执行 pivot ,并且还停留在如何正确循环以放置在不同工作表中的问题上。

This is not correct clearly, Im unable to do the pivot correctly and also stuck on how to loop correctly to place in different sheet.

使用:

df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2], 'CATEG': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'B'], 'LEVEL': [3.0, 3.0, 2.0, 1.0, 1.0, 2.0,  np.nan,  np.nan,  np.nan, 3.0, 2.0, 3.0, 2.0, 1.0, 1.0,  np.nan, 2.0, 1.0, 1.0, 2.0, 3.0,  np.nan,  np.nan, 3.0,  np.nan], 'COLS': ['Apple', 'Orange', 'Orange', 'Orange', 'Apple', 'Apple', 'Berry', 'Car', 'Berry', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Apple', 'Car', 'Orange', 'Orange', 'Apple', 'Apple', 'Apple', 'Berry', 'Car', 'Orange', 'Car'], 'VALUE': [388, 204, 322, 716, 282, 555, 289, 316, 297, 756, 460, 497, 831, 225, 395, 486, 320, 208, 464, 613, 369, 474, 888, 345, 664], 'COMMENT': ['comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment2']})





#check misisng values
mask = df['LEVEL'].isna()

#split DataFrames for different processing
df1 = df[~mask]
df2 = df[mask]

#pivoting with differnet columns parameters
df1 = df1.pivot_table(index=['ID','COMMENT','CATEG'], 
                      columns=['COLS','LEVEL'],
                      values='VALUE')
# print (df1)

df2 = df2.pivot_table(index=['ID','COMMENT','CATEG'], columns='COLS',values='VALUE')
# print (df1)

from pandas import ExcelWriter
with pd.ExcelWriter('Values.xlsx') as writer: 
    
    #groupby by first 2 levels ID, COMMENT
    for (ids,comments), sample_df in df1.groupby(['ID','COMMENT']):
        #removed first 2 levels, also removed only NaNs columns
        df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
        #new sheetnames by f-strings
        name = f'{ids}_{comments}'
        #write to file
        df.to_excel(writer,sheet_name=name)
        
    for (ids,comments), sample_df in df2.groupby(['ID','COMMENT']):
        df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
        name = f'{ids}_NULL_{comments}'
        df.to_excel(writer,sheet_name=name)

另一种无需重复代码的解决方案:

Another solution without repeating code:

mask = df['LEVEL'].isna()

dfs = {'no_null': df[~mask], 'null': df[mask]}

from pandas import ExcelWriter
with pd.ExcelWriter('Values.xlsx') as writer: 
    
    for k, v in dfs.items():
        if k == 'no_null':
            add = ''
            cols = ['COLS','LEVEL']
        else:
             add = 'NULL_'
             cols = 'COLS'
        
        df = v.pivot_table(index=['ID','COMMENT','CATEG'], columns=cols, values='VALUE')
          
        for (ids,comments), sample_df in df.groupby(['ID','COMMENT']):
            df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
            name = f'{ids}_{add}{comments}'
            df.to_excel(writer,sheet_name=name)