Introduction
The discussion of binary predictors (e.g. retain/attrit) so far has focused on the AUC measure of an independent variable's power to predict the outcome. As with any performance measure, we have to make assumptions and then live with the consequences. In this case, we set the question up in a rather artificial way: we assume that we have to choose between two casesDespite its limitations, I find AUC useful for scans of predictors and for communicating to non-technical audiences. It's much easier to explain the "guess which" scenario than regression goodness of fit measures. So the purpose of this post is to delve into AUC and its dual nature as a probability interpretation and a geometrical one. Last time I claimed that AUC was:
- The probability of distinguishing
from when given the explanatory variable - The area under the ROC curve.
It is by no means obvious that these two things should be the same, so I'll use the 2x2 case to show it's true in the simplest case. We already know that the outcome only takes values 0 or 1, and in this simple case, the explanatory variable also only has 0 or 1 value. For example, if you want to use "early admit" status to predict first year retention, both variables are binary. I'll continue to use that case as an example for concreteness.
Preliminaries
Start with what is called the "confusion" matrix, where we take historical data and populate the four possible outcomes of the predictor
with frequencies of occurrence. Here, is the fraction of all cases in the data where the predictor is zero (i.e. not early admit) and the outcome was zero (student did not return for second year).
Recall that the setup for the AUC calculation is that we assume a random draw has occurred to choose a case where and a case where . That means we need to condition the confusion matrix by dividing by the column sums to obtain conditional probabilities:
where for convenience I've renamed the normalized values with lowercase letters. Now we can proceed to compare the probability and area interpretations of AUC.
AUC as Probability
Suppose we are presented with two cases, and we know that one of them has and the other has . If our predictor is any good, then should associate more with . It's also possible that the predictor is reversed (as in negatively correlated), but that just entails reversing the direction. For example, if we used first generation status to predict retention, we would probably find that those students had lower retention, so we would need to change the semantics of the predictor to NOT first generation student to get the two variables to positively correlate. That's just for convenience in interpretation.
Assuming is useful, we'll then predict that if one case has and the other case has that the second case is more likely to be the case. In the example, if we are given a student who is early admit and one who is not, we'll assume it's the former who returns the next year.
When the random draw of the two cases is made, the probability that has is in the matrix, and the probability that has is , so the probability that both happen independently is the product . Similarly, the probability that our predictor gets is exactly wrong is , and the probability of a tie (both cases are early admit or both are not) is . In the case of a tie, we just have to guess, so the probability of a correct answer is one half.
This gives us a probability of correct classification of
where the last step comes from using the definition of the lower-case variables (convert back to the capital letters to see it).
In the language of classifiers, and are called the True Negative Rate (TNR) of classification and True Positive Rate (TPR). So the AUC is the average of the two in the 2x2 case.
AUC as Area
The Receiver-Operator Characteristic (ROC) curve plots the False Positive Rate (FPR) on the horizontal axis and the TPR on the vertical axis, where TPR = 1 - TNR. The fourth variation on this theme is the False Negative Rate (FNR). Putting these names on our matrix, we have
The ROC graph looks like this:
![]() |
ROC curve for our binary classifier. |
The most satisfying way to find the area of a polygon is to use this formula, whereby we get a list of points to traverse counter-clockwise, starting and ending at (0,0).
Cross-multiplying, summing, and dividing by two (see the link for details) gives , which is the same as , as we got before.
Discussion
The illustration shows that for a binary predictor and a binary outcome, the probability of correct classification is the same as the area under the ROC curve, and that both are equal to the average of the true positive and true negative rates. Under more general conditions, it is still true that the probability and the area are equal. The TPR and TNR are not defined in that case--they specifically refer to the 2x2 version.
There are other statistical connections we can make to AUC, and the blog post linked above does some of that. For the 2x2 case, we have
If you compute how much area in the ROC lies above the "guessing" rate (the line from (0,0) to (1,1)), you find it's , which is the covariance divided by both variances. This is almost the definition of the Pearson correlation between the predictor and outcome variables.