ROC AUC得分低但准确性高

问题描述:

飞行延迟数据集的版本上在scikit-learn中使用LogisticRegression.

我使用pandas选择一些列:

df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]

我用0填写NaN值:

df = df.fillna({'ARR_DEL15': 0})

确保类别列标记有类别"数据类型:

Make sure the categorical columns are marked with the 'category' data type:

df["ORIGIN"] = df["ORIGIN"].astype('category')
df["DEST"] = df["DEST"].astype('category')

然后从pandas调用get_dummies():

df = pd.get_dummies(df)

现在,我训练和测试我的数据集:

Now I train and test my data set:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

test_set, train_set = train_test_split(df, test_size=0.2, random_state=42)

train_set_x = train_set.drop('ARR_DEL15', axis=1)
train_set_y = train_set["ARR_DEL15"]

test_set_x = test_set.drop('ARR_DEL15', axis=1)
test_set_y = test_set["ARR_DEL15"]

lr.fit(train_set_x, train_set_y)

一旦我调用score方法,我将得到0.867左右.但是,当我调用roc_auc_score方法时,我得到的数字要低得多,约为0.583

Once I call the score method I get around 0.867. However, when I call the roc_auc_score method I get a much lower number of around 0.583

 probabilities = lr.predict_proba(test_set_x)

 roc_auc_score(test_set_y, probabilities[:, 1])

为什么ROC AUC远低于score方法提供的值?

Is there any reason why the ROC AUC is much lower than what the score method provides?

首先,说0.583的AUC低于0.867的分数*,就像把苹果和桔子比较.

To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges.

[*我假设您的得分是平均准确度,但这对于本次讨论并不重要-原则上可以是其他任何东西]

[* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle]

至少根据我的经验,大多数机器学习从业者认为AUC评分与实际操作有所不同:常见(不幸的是)用法与其他任何较高级别的用法相同,更好的指标(如准确性)自然会导致您表达自己的困惑.

According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

事实是,从广义上讲,AUC衡量的是在所有可能的决策阈值上平均的二元分类器的性能.

The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.

(决定)阈值二进制分类中的a>是我们决定将其标记为1的值(回想概率分类器实际上在[0,1]中返回值p,通常将其解释为概率-在scikit-learn中,它是predict_proba返回的内容.

The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value p in [0, 1], usually interpreted as a probability - in scikit-learn it is what predict_proba returns).

现在,在诸如scikit-learn predict之类的返回 labels (1/0)的方法中,此阈值为

Now, this threshold, in methods like scikit-learn predict which return labels (1/0), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example).

带回家的重点是:

  • 当您要求输入score时(使用predict ,即使用标签而不是概率),您还已将该阈值隐式设置为0.5
  • 当您请求AUC(相反,它使用由predict_proba返回的概率)时,不涉及阈值,并且您在所有可能的阈值中获得了(平均)准确度平均值
  • when you ask for score (which under the hood uses predict, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5
  • when you ask for AUC (which, in contrast, uses probabilities returned with predict_proba), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholds

鉴于这些澄清,您的特定示例提供了一个非常有趣的案例:

Given these clarifications, your particular example provides a very interesting case in point:

我的模型获得了大约87%的准确度;我是否应该关心,根据0.58的AUC,我的分类器仅比随机猜测的结果好多少?

I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing?

假设您数据中的类表示形式是合理地平衡,那么现在的答案应该很明显:不,您不必在意;对于所有实际情况,您关心的是使用特定阈值部署的分类器,并且在所有可能阈值的平均水平下,在纯粹的理论和抽象情况下,该分类器的作用对于实践者(确实引起了研究者提出一种新算法的兴趣,但是我认为这不是您的情况).

Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case).

(对于不平衡的数据,参数会发生变化;这里的精度实际上是没有用的,您应该考虑精度,查全率和混淆矩阵).

(For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead).

由于这个原因,AUC开始受到文学界的严重批评(请不要误解-对 ROC曲线本身的分析非常有用,也很有用); Wikipedia条目和其中提供的参考资料是强烈建议阅读:

For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:

因此,AUC量度的实用价值受到质疑,这增加了AUC可能在机器学习分类准确性比较中实际引入比分辨率更多的不确定性的可能性.

Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.

[...]

最近对ROC AUC问题的一种解释是,将ROC曲线简化为一个数字会忽略以下事实:它是关于不同系统或绘制的性能点之间的权衡,而不是单个系统的性能

One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system

强调地雷-另请参见关于AUC的危险 ...

Emphasis mine - see also On the dangers of AUC...