pandas :将宽数据框重塑为多索引长
我有一个由许多单独的评分者生成的照片评分数据集.
为每个评估者提供了几幅图像进行评估,并且对于每幅图像,评估者提供了几种不同的评估以及描述.
I have a dataset of ratings of photographs, generated by many individual raters.
Each rater is given several images to rate, and for each image, a rater provides several different ratings, plus a description.
因此,例如,可能要求一名评分者对3张不同的照片进行评分,并针对每张照片的快乐,悲伤和有趣程度以0-5的比例提供单独的评分.另外,要求评分者提供每张照片的简短文字说明.
So, for example, a rater might be asked to rate 3 different photos, and provide separate ratings on a 0-5 scale for how happy, sad, and interesting each photo is. In addition, a rater is asked to provide a short text description of each photo.
同一张照片可以由多个评估者评分,但并非所有照片都由相同数量的评估者评分.
The same photograph may be rated by multiple raters, but not all photographs will have been rated by the same number of raters.
当前我的数据采用这种格式(每个url代表一张照片):
Currently my data are in this form (each url represents a photo):
rater_id | url1 | url2 | url3 | rating_1(1) | rating_2(1) | rating_1(2) | rating_2(2) | rating_1(3) | rating_2(3) | description(1) | description(2) | description(3)
001 | a | b | c | 3.0 | 2.5 | 4.0 | 1.5 | 5.0 | 5.0 | sunny day | rainy day | foggy day
002 | a | b | d | 1.0 | 4.5 | 3.0 | 3.5 | 1.0 | 3.5 | sunshine | rain | snow
我要在这里实现一些转换.
首先,我想更改数据框,以便按照片url进行索引-将所有url字段(url1,url2等)融合为一个长列url
.含义:
I have a few transformations I'm trying to achieve here.
First I want to change the data frame so that it is indexed by photo url - with all of the url fields (url1, url2, etc) melted into one long column url
. Meaning:
url1 | url2 | url3
-----|------|-----
a | b | c
成为
url
---
a
b
c
在每个url
索引中,都有一个rater_id
的嵌套索引,该索引随后具有该评分者的评级和给定照片的描述的一行.
例如:
Within each url
index, there is a nested index for rater_id
, which then has a row of that rater's ratings and description for the given photo.
Eg.:
| rating_1 | rating_2 | description
url | rater_id
a | 001 | 3.0 | 2.5 | sunny day
| 002 | 1.0 | 4.5 | sunshine
----|-----------|----------|----------|------------
b | 001 | 4.0 | 1.5 | rainy day
| 002 | 4.5 | 3.0 | rain
----|-----------|----------|----------|------------
c | 001 | 5.0 | 5.0 | foggy day
----|-----------|----------|----------|------------
d | 002 | 1.0 | 3.5 | snow
最后,我想汇总每个照片网址的评分和描述: -数值等级的均值和方差 -所有描述的制表符分隔的字符串 -为每张照片评分的评分者的数量
Finally, I want to aggregate ratings and descriptions for each photo url: - mean and variance of numeric ratings - tab-separated string of all descriptions - number of raters who have rated each photo
例如:
url | rating_1_avg | rating_1_var | rating_2_avg | rating_2_var | all_descriptions | total_ratings
a | 2.0 | 2.0 | 3.0 | 2.0 | sunny day sunshine | 2
----|--------------|--------------|--------------|--------------|-----------------------|--------------
b | 4.25 | 0.125 | 2.25 | 1.125 | rainy day rain | 2
----|--------------|--------------|--------------|--------------|-----------------------|--------------
c | 5.0 | NA | 5.0 | NA | foggy day | 1
----|--------------|--------------|--------------|--------------|-----------------------|--------------
d | 1.0 | NA | 3.5 | NA | snow | 1
我尝试了多种方法,其中熊猫重塑工具(包括melt
和wide_to_long
),但是我不知道如何首先将照片网址转换为长格式,然后创建嵌套索引来排列数据,如上所述.我对Pandas groupby
和基本聚合非常满意,但这超出了我的技能水平.任何帮助深表感谢!
I've tried a number of approaches with the Pandas reshaping tools, including melt
and wide_to_long
, but I can't figure out how to first get the photo urls into long format, and then created a nested index to arrange the data as I've presented above. I'm pretty comfortable with Pandas groupby
and basic aggregation, but this is a bit beyond my skill level. Any help is much appreciated!
注意:我在这些虚拟数据中提供的字段名称与实际数据集中的名称不同,但它们完全遵循相同的命名约定.图片网址全是url1
,url2
等,并且评分字段表示为rating_<rating_category_number>(<url_number>)
,例如. rating_1(2)
.描述字段表示为description(<url_number>)
,例如. description(2)
.
Note: The fields I've given in these dummy data are not the exact names as in the actual dataset, but they follow the same naming conventions exactly. The photo urls are all url1
, url2
, etc, and the ratings fields are denoted as rating_<rating_category_number>(<url_number>)
, eg. rating_1(2)
. The description fields are denoted as description(<url_number>)
, eg. description(2)
.
以下是用于构建初始数据集的Python代码:
Here's Python code to build the initial dataset:
df = pd.DataFrame({'id': {0: '001', 1: '002'},
'url1': {0: 'a', 1: 'a'},
'url2': {0: 'b', 1: 'b'},
'url3': {0: 'c', 1: 'd'}})
df['rating_1(1)'] = [3.0, 1]
df['rating_2(1)'] = [2.5, 4.5]
df['rating_1(2)'] = [4.0, 3]
df['rating_2(2)'] = [1.5, 3.5]
df['rating_1(3)'] = [5.0, 1]
df['rating_2(3)'] = [5.0, 3.5]
df['description(1)'] = ['sunny day','sunshine']
df['description(2)'] = ['rainy day','rain']
df['description(3)'] = ['foggy day','snow']
您可以先通过 mean
, var
, count
和join
:
You can first find columns for each category by str.contains
and then use not well known pd.lreshape
. Last aggregate columns by mean
, var
, count
and join
:
#select columns with each category
rat1 = df.columns[df.columns.str.contains(r'rating_1')].tolist()
print rat1
['rating_1(1)', 'rating_1(2)', 'rating_1(3)']
rat2 = df.columns[df.columns.str.contains(r'rating_2')].tolist()
url = df.columns[df.columns.str.contains(r'url')].tolist()
desc = df.columns[df.columns.str.contains(r'description')].tolist()
df = pd.lreshape(df, {'rat1': rat1, 'rat2': rat2,'url': url,'desc': desc})
print df
rater_id url rat2 rat1 desc
0 '001' a 2.5 3.0 sunny day
1 '002' a 4.5 1.0 sunshine
2 '001' b 1.5 4.0 rainy day
3 '002' b 3.5 3.0 rain
4 '001' c 5.0 5.0 foggy day
5 '002' d 3.5 1.0 snow
#aggregate
df = df.groupby(['url']).agg({'rat1':['mean', 'var'],
'rat2':['mean', 'var'],
'desc': ' '.join,
'rater_id': 'count'})
#reset multiindex in columns
df.columns = ['_'.join(col) for col in df.columns.values]
print df
rater_id_count rat2_mean rat2_var rat1_mean rat1_var \
url
a 2 3.5 2.0 2.0 2.0
b 2 2.5 2.0 3.5 0.5
c 1 5.0 NaN 5.0 NaN
d 1 3.5 NaN 1.0 NaN
desc_join
url
a sunny day sunshine
b rainy day rain
c foggy day
d snow