ROC AUC得分低但准确性高






df = df.fillna({'ARR_DEL15': 0})


Make sure the categorical columns are marked with the 'category' data type:

df["ORIGIN"] = df["ORIGIN"].astype('category')
df["DEST"] = df["DEST"].astype('category')


df = pd.get_dummies(df)


Now I train and test my data set:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

test_set, train_set = train_test_split(df, test_size=0.2, random_state=42)

train_set_x = train_set.drop('ARR_DEL15', axis=1)
train_set_y = train_set["ARR_DEL15"]

test_set_x = test_set.drop('ARR_DEL15', axis=1)
test_set_y = test_set["ARR_DEL15"], train_set_y)


Once I call the score method I get around 0.867. However, when I call the roc_auc_score method I get a much lower number of around 0.583

 probabilities = lr.predict_proba(test_set_x)

 roc_auc_score(test_set_y, probabilities[:, 1])

为什么ROC AUC远低于score方法提供的值?

Is there any reason why the ROC AUC is much lower than what the score method provides?


To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges.


[* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle]


According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.


The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.


The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value p in [0, 1], usually interpreted as a probability - in scikit-learn it is what predict_proba returns).

现在,在诸如scikit-learn predict之类的返回 labels (1/0)的方法中,此阈值为

Now, this threshold, in methods like scikit-learn predict which return labels (1/0), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example).


  • 当您要求输入score时(使用predict ,即使用标签而不是概率),您还已将该阈值隐式设置为0.5
  • 当您请求AUC(相反,它使用由predict_proba返回的概率)时,不涉及阈值,并且您在所有可能的阈值中获得了(平均)准确度平均值
  • when you ask for score (which under the hood uses predict, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5
  • when you ask for AUC (which, in contrast, uses probabilities returned with predict_proba), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholds


Given these clarifications, your particular example provides a very interesting case in point:


I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing?


Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case).


(For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead).

由于这个原因,AUC开始受到文学界的严重批评(请不要误解-对 ROC曲线本身的分析非常有用,也很有用); Wikipedia条目和其中提供的参考资料是强烈建议阅读:

For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:


Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.


最近对ROC AUC问题的一种解释是,将ROC曲线简化为一个数字会忽略以下事实:它是关于不同系统或绘制的性能点之间的权衡,而不是单个系统的性能

One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system

强调地雷-另请参见关于AUC的危险 ...

Emphasis mine - see also On the dangers of AUC...