pandas :将宽数据框重塑为多索引长

问题描述：

我有一个由许多单独的评分者生成的照片评分数据集.
为每个评估者提供了几幅图像进行评估，并且对于每幅图像，评估者提供了几种不同的评估以及描述.

I have a dataset of ratings of photographs, generated by many individual raters.
Each rater is given several images to rate, and for each image, a rater provides several different ratings, plus a description.

因此，例如，可能要求一名评分者对3张不同的照片进行评分，并针对每张照片的快乐，悲伤和有趣程度以0-5的比例提供单独的评分.另外，要求评分者提供每张照片的简短文字说明.

So, for example, a rater might be asked to rate 3 different photos, and provide separate ratings on a 0-5 scale for how happy, sad, and interesting each photo is. In addition, a rater is asked to provide a short text description of each photo.

同一张照片可以由多个评估者评分，但并非所有照片都由相同数量的评估者评分.

The same photograph may be rated by multiple raters, but not all photographs will have been rated by the same number of raters.

当前我的数据采用这种格式(每个url代表一张照片):

Currently my data are in this form (each url represents a photo):

rater_id | url1 | url2 | url3 | rating_1(1) | rating_2(1) | rating_1(2) | rating_2(2)   | rating_1(3) | rating_2(3) | description(1) | description(2) | description(3)
     001 |   a  |   b  |   c  |     3.0     |     2.5     |     4.0     |     1.5       |     5.0     |     5.0     |  sunny day     |  rainy day     |  foggy day
     002 |   a  |   b  |   d  |     1.0     |     4.5     |     3.0     |     3.5       |     1.0     |     3.5     |  sunshine      |  rain          |  snow

我要在这里实现一些转换.
首先，我想更改数据框，以便按照片url进行索引-将所有url字段(url1，url2等)融合为一个长列url.含义:

I have a few transformations I'm trying to achieve here.
First I want to change the data frame so that it is indexed by photo url - with all of the url fields (url1, url2, etc) melted into one long column url. Meaning:

url1 | url2 | url3
-----|------|-----
  a  |   b  |  c

成为

url
---
 a
 b
 c

在每个url索引中，都有一个rater_id的嵌套索引，该索引随后具有该评分者的评级和给定照片的描述的一行.
例如:

Within each url index, there is a nested index for rater_id, which then has a row of that rater's ratings and description for the given photo.
Eg.:

                | rating_1 | rating_2 | description
url | rater_id
 a  |      001  |    3.0   |   2.5    |  sunny day
    |      002  |    1.0   |   4.5    |  sunshine
----|-----------|----------|----------|------------
 b  |      001  |    4.0   |   1.5    |  rainy day
    |      002  |    4.5   |   3.0    |  rain
----|-----------|----------|----------|------------
 c  |      001  |    5.0   |   5.0    |  foggy day
----|-----------|----------|----------|------------
 d  |      002  |    1.0   |   3.5    |  snow

最后，我想汇总每个照片网址的评分和描述: -数值等级的均值和方差 -所有描述的制表符分隔的字符串 -为每张照片评分的评分者的数量

Finally, I want to aggregate ratings and descriptions for each photo url: - mean and variance of numeric ratings - tab-separated string of all descriptions - number of raters who have rated each photo

例如:

url | rating_1_avg | rating_1_var | rating_2_avg | rating_2_var | all_descriptions      | total_ratings 
 a  |     2.0      |     2.0      |     3.0      |     2.0      | sunny day    sunshine |      2
----|--------------|--------------|--------------|--------------|-----------------------|--------------
 b  |     4.25     |     0.125    |     2.25     |     1.125    | rainy day    rain     |      2
----|--------------|--------------|--------------|--------------|-----------------------|--------------
 c  |     5.0      |     NA       |     5.0      |     NA       | foggy day             |      1
----|--------------|--------------|--------------|--------------|-----------------------|--------------
 d  |     1.0      |     NA       |     3.5      |     NA       | snow                  |      1

我尝试了多种方法，其中熊猫重塑工具(包括melt和wide_to_long)，但是我不知道如何首先将照片网址转换为长格式，然后创建嵌套索引来排列数据，如上所述.我对Pandas groupby和基本聚合非常满意，但这超出了我的技能水平.任何帮助深表感谢！

I've tried a number of approaches with the Pandas reshaping tools, including melt and wide_to_long, but I can't figure out how to first get the photo urls into long format, and then created a nested index to arrange the data as I've presented above. I'm pretty comfortable with Pandas groupby and basic aggregation, but this is a bit beyond my skill level. Any help is much appreciated!

注意:我在这些虚拟数据中提供的字段名称与实际数据集中的名称不同，但它们完全遵循相同的命名约定.图片网址全是url1，url2等，并且评分字段表示为rating_<rating_category_number>(<url_number>)，例如. rating_1(2).描述字段表示为description(<url_number>)，例如. description(2).

Note: The fields I've given in these dummy data are not the exact names as in the actual dataset, but they follow the same naming conventions exactly. The photo urls are all url1, url2, etc, and the ratings fields are denoted as rating_<rating_category_number>(<url_number>), eg. rating_1(2). The description fields are denoted as description(<url_number>), eg. description(2).

以下是用于构建初始数据集的Python代码:

Here's Python code to build the initial dataset:

df = pd.DataFrame({'id': {0: '001', 1: '002'},
                   'url1': {0: 'a', 1: 'a'},
                   'url2': {0: 'b', 1: 'b'},
                   'url3': {0: 'c', 1: 'd'}})

df['rating_1(1)'] = [3.0, 1]
df['rating_2(1)'] = [2.5, 4.5]
df['rating_1(2)'] = [4.0, 3]
df['rating_2(2)'] = [1.5, 3.5]
df['rating_1(3)'] = [5.0, 1]
df['rating_2(3)'] = [5.0, 3.5]
df['description(1)'] = ['sunny day','sunshine']
df['description(2)'] = ['rainy day','rain']
df['description(3)'] = ['foggy day','snow']

答

您可以先通过 mean ， var ， count 和join:

You can first find columns for each category by str.contains and then use not well known pd.lreshape. Last aggregate columns by mean, var, count and join:

#select columns with each category
rat1 = df.columns[df.columns.str.contains(r'rating_1')].tolist()
print rat1
['rating_1(1)', 'rating_1(2)', 'rating_1(3)']

rat2 = df.columns[df.columns.str.contains(r'rating_2')].tolist()
url = df.columns[df.columns.str.contains(r'url')].tolist()
desc = df.columns[df.columns.str.contains(r'description')].tolist()

df =  pd.lreshape(df, {'rat1': rat1, 'rat2': rat2,'url': url,'desc': desc})
print df
  rater_id url  rat2  rat1       desc
0    '001'   a   2.5   3.0  sunny day
1    '002'   a   4.5   1.0   sunshine
2    '001'   b   1.5   4.0  rainy day
3    '002'   b   3.5   3.0       rain
4    '001'   c   5.0   5.0  foggy day
5    '002'   d   3.5   1.0       snow

#aggregate
df = df.groupby(['url']).agg({'rat1':['mean', 'var'],
                              'rat2':['mean', 'var'], 
                              'desc': ' '.join, 
                              'rater_id': 'count'})

#reset multiindex in columns
df.columns = ['_'.join(col) for col in df.columns.values]

print df
     rater_id_count  rat2_mean  rat2_var  rat1_mean  rat1_var  \
url                                                             
a                 2        3.5       2.0        2.0       2.0   
b                 2        2.5       2.0        3.5       0.5   
c                 1        5.0       NaN        5.0       NaN   
d                 1        3.5       NaN        1.0       NaN   

              desc_join  
url                      
a    sunny day sunshine  
b        rainy day rain  
c             foggy day  
d                  snow

pandas :将宽数据框重塑为多索引长

相关推荐