Introduction
The discussion of binary predictors (e.g. retain/attrit) so far has focused on the AUC measure of an independent variable's power to predict the outcome. As with any performance measure, we have to make assumptions and then live with the consequences. In this case, we set the question up in a rather artificial way: we assume that we have to choose between two cases \(y_0\) and \(y_1\), where the first case has outcome 0 (e.g. a student who failed to return, if predicting retention), and the second case has the outcome 1 (e.g. a student who did return). The job of the independent variable \(x\) is to distinguish which case is which. This is a different task than taking all the \(x_i,y_i\) pairs of data and, for example using regression to minimize error. Regression models have other kinds of performance measures, although we can use AUC for a logistic (i.e. binary) regression. See Frank Harrell's blog for some commentary from an expert.Despite its limitations, I find AUC useful for scans of predictors and for communicating to non-technical audiences. It's much easier to explain the "guess which" scenario than regression goodness of fit measures. So the purpose of this post is to delve into AUC and its dual nature as a probability interpretation and a geometrical one. Last time I claimed that AUC was:
- The probability of distinguishing \(y_0\) from \(y_1\) when given the explanatory variable \(x\)
- The area under the ROC curve.
It is by no means obvious that these two things should be the same, so I'll use the 2x2 case to show it's true in the simplest case. We already know that the outcome only takes values 0 or 1, and in this simple case, the explanatory variable also only has 0 or 1 value. For example, if you want to use "early admit" status to predict first year retention, both variables are binary. I'll continue to use that case as an example for concreteness.
Preliminaries
Start with what is called the "confusion" matrix, where we take historical data and populate the four possible outcomes of the predictor
$$ \begin{matrix} & y = 0 & y = 1 \\ x = 0 & A & B \\ x = 1 & C & D \end{matrix} $$
with frequencies of occurrence. Here, \(A\) is the fraction of all cases in the data where the predictor is zero (i.e. not early admit) and the outcome was zero (student did not return for second year).
Recall that the setup for the AUC calculation is that we assume a random draw has occurred to choose a case where \(y = 0\) and a case where \(y = 1\). That means we need to condition the confusion matrix by dividing by the column sums to obtain conditional probabilities:
$$ \begin{pmatrix} \frac{A}{A+C} & \frac{B}{B+D} \\ \frac{C}{A+C} & \frac{D}{B+D} \end{pmatrix} := \begin{pmatrix} a & b \\ c & d \end{pmatrix},$$
where for convenience I've renamed the normalized values with lowercase letters. Now we can proceed to compare the probability and area interpretations of AUC.
AUC as Probability
Suppose we are presented with two cases, and we know that one of them has \(y=0\) and the other has \(y=1\). If our predictor is any good, then \(x = 1\) should associate more with \(y=1\). It's also possible that the predictor is reversed (as in negatively correlated), but that just entails reversing the direction. For example, if we used first generation status to predict retention, we would probably find that those students had lower retention, so we would need to change the semantics of the predictor to NOT first generation student to get the two variables to positively correlate. That's just for convenience in interpretation.
Assuming \(x\) is useful, we'll then predict that if one case has \(x = 0\) and the other case has \(x = 1\) that the second case is more likely to be the \(y=1\) case. In the example, if we are given a student who is early admit and one who is not, we'll assume it's the former who returns the next year.
When the random draw of the two cases is made, the probability that \(y_0\) has \(x = 0\) is \(a\) in the matrix, and the probability that \(y_1\) has \(x = 1\) is \(d\), so the probability that both happen independently is the product \(ad\). Similarly, the probability that our predictor gets is exactly wrong is \(bc\), and the probability of a tie (both cases are early admit or both are not) is \(ab + cd\). In the case of a tie, we just have to guess, so the probability of a correct answer is one half.
This gives us a probability of correct classification of $$ \begin{align} ad + \frac{1}{2}(ab + cd) & = \frac{2ad + ab + cd}{2} \\ & = \frac{a(b+d) + d(a+c)}{2} \\ & = \frac{a + d}{2} \end{align},$$
where the last step comes from using the definition of the lower-case variables (convert back to the capital letters to see it).
In the language of classifiers, \(a\) and \(d\) are called the True Negative Rate (TNR) of classification and True Positive Rate (TPR). So the AUC is the average of the two in the 2x2 case.
AUC as Area
The Receiver-Operator Characteristic (ROC) curve plots the False Positive Rate (FPR) on the horizontal axis and the TPR on the vertical axis, where TPR = 1 - TNR. The fourth variation on this theme is the False Negative Rate (FNR). Putting these names on our matrix, we have
$$ \begin{matrix} a = TNR & b = FNR \\ c = FPR & d = TPR \end{matrix} $$
The ROC graph looks like this:
ROC curve for our binary classifier. |
The most satisfying way to find the area of a polygon is to use this formula, whereby we get a list of points to traverse counter-clockwise, starting and ending at (0,0).
$$ \begin{matrix} 0 & 0 \\ 1 & 0 \\ 1 & 1 \\ c & d \\ 0 & 0 \end{matrix} $$
Cross-multiplying, summing, and dividing by two (see the link for details) gives \( (1 + d - c)/2 \), which is the same as \( ( a + d) /2 \), as we got before.
Discussion
The illustration shows that for a binary predictor and a binary outcome, the probability of correct classification is the same as the area under the ROC curve, and that both are equal to the average of the true positive and true negative rates. Under more general conditions, it is still true that the probability and the area are equal. The TPR and TNR are not defined in that case--they specifically refer to the 2x2 version.
There are other statistical connections we can make to AUC, and the blog post linked above does some of that. For the 2x2 case, we have
$$ \begin{align} Var(X) &= (A + B)(C + D) \\ Var(Y) &= (A + C)(B + D) \\ Cov(X,Y) & = AD - BC \end{align} $$
If you compute how much area in the ROC lies above the "guessing" rate (the line from (0,0) to (1,1)), you find it's \( ad - bc \), which is the covariance divided by both variances. This is almost the definition of the Pearson correlation between the predictor and outcome variables.