【心跳信号分类预测】Datawhale打卡-Task01 赛题理解
重点复习一下忘记的知识点.
数据概况
import pandas as pd
win_file_path = 'E:\competition-data\016_heartbeat_signals\'
train = pd.read_csv(win_file_path+'train.csv')
test = pd.read_csv(win_file_path+'testA.csv')
训练集
train.info
<bound method DataFrame.info of id heartbeat_signals label
0 0 0.9912297987616655,0.9435330436439665,0.764677... 0.0
1 1 0.9714822034884503,0.9289687459588268,0.572932... 0.0
2 2 1.0,0.9591487564065292,0.7013782792997189,0.23... 2.0
3 3 0.9757952826275774,0.9340884687738161,0.659636... 0.0
4 4 0.0,0.055816398940721094,0.26129357194994196,0... 2.0
... ... ... ...
99995 99995 1.0,0.677705342021188,0.22239242747868546,0.25... 0.0
99996 99996 0.9268571578157265,0.9063471198026871,0.636993... 2.0
99997 99997 0.9258351628306013,0.5873839035878395,0.633226... 3.0
99998 99998 1.0,0.9947621698382489,0.8297017704865509,0.45... 2.0
99999 99999 0.9259994004527861,0.916476635326053,0.4042900... 0.0
[100000 rows x 3 columns]>
train.describe()
id | label | |
---|---|---|
count | 100000.000000 | 100000.000000 |
mean | 49999.500000 | 0.856960 |
std | 28867.657797 | 1.217084 |
min | 0.000000 | 0.000000 |
25% | 24999.750000 | 0.000000 |
50% | 49999.500000 | 0.000000 |
75% | 74999.250000 | 2.000000 |
max | 99999.000000 | 3.000000 |
测试集
test.info
<bound method DataFrame.info of id heartbeat_signals
0 100000 0.9915713654170097,1.0,0.6318163407681274,0.13...
1 100001 0.6075533139615096,0.5417083883163654,0.340694...
2 100002 0.9752726292239277,0.6710965234906665,0.686758...
3 100003 0.9956348033996116,0.9170249621481004,0.521096...
4 100004 1.0,0.8879490481178918,0.745564725322326,0.531...
... ... ...
19995 119995 1.0,0.8330283177934747,0.6340472606311671,0.63...
19996 119996 1.0,0.8259705825857048,0.4521053488322387,0.08...
19997 119997 0.951744840752379,0.9162611283848351,0.6675251...
19998 119998 0.9276692903808186,0.6771898159607004,0.242906...
19999 119999 0.6653212231837624,0.527064114047737,0.5166625...
[20000 rows x 2 columns]>
test.describe()
id | |
---|---|
count | 20000.000000 |
mean | 109999.500000 |
std | 5773.647028 |
min | 100000.000000 |
25% | 104999.750000 |
50% | 109999.500000 |
75% | 114999.250000 |
max | 119999.000000 |
预测指标
具体计算公式如下:
总共有n个病例,针对某一个信号,若真实值为[y1,y2,y3,y4],模型预测概率值为[a1,a2,a3,a4],那么该模型的评价指标abs-sum为
[{abs-sum={mathop{ sum }limits_{{j=1}}^{{n}}{{mathop{ sum }limits_{{i=1}}^{{4}}{{ left| {ymathop{{}}
olimits_{{i}}-amathop{{}}
olimits_{{i}}}
ight| }}}}}}
]
多分类算法常见的评估指标如下:
其实多分类的评价指标的计算方式与二分类完全一样,只不过我们计算的是针对于每一类来说的召回率、精确度、准确率和 F1分数。
1.准确率(Accuracy)
准确率是常用的一个评价指标,但是不适合样本不均衡的情况,医疗数据大部分都是样本不均衡数据。
准确率: 预测正确的样本,占总体样本的数量.
2.准确率: 是针对预测结果而言的,其含义是在被所有预测为正的样本中实际为正样本的概率
3.召回率(Recall)是针对原样本而言的,其含义是 正正 / (正正 + 正负), 即原来就为正的样本中被预测为正样本的概率
5.宏查准率(macro-P)
计算每个样本的精确率然后求平均值
[{macroP=frac{{1}}{{n}}{mathop{ sum }limits_{{1}}^{{n}}{pmathop{{}}
olimits_{{i}}}}}
]
6.宏查全率(macro-R)
计算每个样本的召回率然后求平均值
[{macroR=frac{{1}}{{n}}{mathop{ sum }limits_{{1}}^{{n}}{Rmathop{{}}
olimits_{{i}}}}}
]
7.宏F1(macro-F1)
[{macroF1=frac{{2 imes macroP imes macroR}}{{macroP+macroR}}}
]
与上面的宏不同,微查准查全,先将多个混淆矩阵的TP,FP,TN,FN对应位置求平均,然后按照P和R的公式求得micro-P和micro-R,最后根据micro-P和micro-R求得micro-F1。
8.微查准率(micro-P)
[{microP=frac{{overline{TP}}}{{overline{TP} imes overline{FP}}}}
]
9.微查全率(micro-R)
[{microR=frac{{overline{TP}}}{{overline{TP} imes overline{FN}}}}
]
10.微F1(micro-F1)
[{microF1=frac{{2 imes microP imes microR }}{{microP+microR}}}
]