Introduction
In the last post on this topic, the AUC metric of classifier performance was seen to have a dual nature. It can be viewed as a probability calculation or the area under a particular curve (AUC = area under curve). These are mathematically equivalent, and the basic principle behind both is easy to understand and useful to know.
Recall that we want to distinguish a "success" case \(Y_1\) from a "failure" case \(Y_0\), based on some information we have about each case, \(X_1\) and \(X_0\). Unless the scale is reversed, we compare the two predictors and guess that \(Y_1\) is the one with the greatest value of \(X\). If there is a tie, we have to flip a coin.
The explanatory power of a predictor in these conditions derives from
$$( Pr[Y_1 | X_1 > X_0] - Pr[Y_0 | X_1 > X_0] ) /2.$$
That is, we take the probability that the predictor gives us the correct answer and then subtract the probability that it gives us the wrong answer. On the ROC graph, this probability difference is the area above the diagonal guessing line.
Titanic Example
The distribution of passenger class is different for survivors and non-survivors. The rates are$$ \begin{matrix}
Pclass & Y_0 &Y_1\\
1 & 0.147 & 0.398\\
2 & 0.178 & 0.254\\
3 & 0.675 & 0.348\\
\end{matrix}$$
For example, for a randomly-selected survivor \(Y_1\), the probability of being in first class is about 40%, but for a non-survivor it's only 15%. So when a random selection is made from each of these columns to pick a \(Y_1\) and \(Y_0\), the combined probabilities are obtained by multiplying out each of the possibilities. See the R code at the end for how to do this. The result is a square matrix
$$\begin{matrix}
& & & Y_0 \\
& & Class1 & Class2 & Class3\\
& Class1 & 0.058 & 0.071 & 0.269\\
Y_1 & Class2 & 0.037 & 0.045 & 0.172\\
& Class3 & 0.051 & 0.062 & 0.235\\
\end{matrix}$$
showing for example that the joint probability of the survivor \(Y_1\) being in first class and the non-survivor \(Y_0\) being in third class is 26.9%.
We will correctly classify the two cases (guessing which one of the two is \(Y_1\)) only when the passenger class is greater for \(Y_1\), or if the two classes are the same and we guess correctly. To reflect this correct probability rate for each of the entries in the matrix above, we can weight them accordingly to obtain
$$\begin{matrix}
& Class1 & Class2 & Class3\\
Class1 & 0.029 & 0.071 & 0.269\\
Class2 & 0.000 & 0.023 & 0.172\\
Class3 & 0.000 & 0.000 & 0.117\\
\end{matrix}$$
The entries above the diagonal haven't changed, and the ones on the diagonal (where the classes are equal) have been halved, since we have to flip a coin in those cases. Below the diagonal, the predictor is giving us the wrong answer: the passenger class of the survivor is lower than the non-survivor, so those cells are zeroed out--they don't add to the probability of being correct.
We can obtain the AUC by simply adding up the probabilities in the last matrix, to get .68. Here' the ROC graph.
ROC graph for passenger class predicting survival |
Improvement Over Guessing
The computation above gives us the whole AUC, which includes the cases in which we just guess correctly. Notice that the diagonal of the last matrix sums to only about 17%, far less than the 50% rate we can get by ignoring passenger class and just flipping a coin.
To get the excess probability over guessing, which in our case is .68 - .50 = .18, we have to use the formula advertised at the top, viz: for each combination of passenger class, like 1 vs 2, find the probability that we get the right answer and subtract the probability that we get the wrong answer and divide by two. That matrix, after dividing by two, looks like
$$ \begin{matrix}
& Class1 & Class2 & Class3\\
Class1 & 0.000 & 0.017 & 0.109\\
Class2 & -0.017 & 0.000 & 0.055\\
Class3 & -0.109 & -0.055 & 0.000\\
\end{matrix} $$
We only care about the upper triangle here, above the diagonal, which is .017 + .109 + .055 = .180. (The entries below the triangle just tell us what would be the case if we had our predictor reversed: we'd get an AUC less than .5 by .18).
The steps in the calculation of the excess probability are the same as finding the area of a triangular region, which is one way the AUC can be calculated. The comparison is shown in the areas in the image below.
[Edit] Why Divide by Two?
The divide-by-two in the formula may seem mysterious. Here's where it comes from. I'll demonstrate for the 2x2 case. The general result comes from the same idea.
If we go back to our column-normalized matrix
$$ \begin{matrix} & Y = 0 & Y = 1 \\ X = 0 & a & b \\ X = 1 & c & d \end{matrix} $$
and multiply columns (outer product of the second column with the first), we get the probability of each possibility:
$$ \begin{matrix} ab & bc \\ ad & cd \end{matrix} $$
The predictive power comes from the \(ad\) case, where the values of \(X\) are in the correct direction to pick \(Y = 1\). As we saw before,
$$ AUC = ad + \frac{1}{2}(ab+cd) $$
where the second term is due to guessing when the predictor values are the same for both cases. The algebraic trick we need is the fact that the second matrix above sums to 1, so that we could also write
$$ AUC = 1 - bc - \frac{1}{2}(ab+cd) $$
Then we just add the two AUC formulas and divide by two to get
$$ AUC = \frac{1}{2} + \frac{1}{2}(ad-bc) $$
The explanatory power beyond guessing is what we get after subtracting the 1/2. In the general case, we have to sum up all the pieces that look like \(ad-bc\), which amounts to taking the probability matrix, subtracting its transpose, summing anything above the diagonal, and dividing by two.
This, in turn, is just an application of a kind of probabilistic "recentering around .5" you can do with any event A. If A' is the event's complement, then
\( Pr(A) = 1 - Pr(A') \) so \(2 Pr(A) = Pr(A) + 1 - Pr(A') \), or
$$ Pr(A) = \frac{1}{2} + \frac{1}{2}(Pr(A) - Pr(A')) $$
The direct summing of probabilities in the frequency matrix is the left side of the equation. The summing of triangles in the geometric AUC calculation is the right side.
This, in turn, is just an application of a kind of probabilistic "recentering around .5" you can do with any event A. If A' is the event's complement, then
\( Pr(A) = 1 - Pr(A') \) so \(2 Pr(A) = Pr(A) + 1 - Pr(A') \), or
$$ Pr(A) = \frac{1}{2} + \frac{1}{2}(Pr(A) - Pr(A')) $$
The direct summing of probabilities in the frequency matrix is the left side of the equation. The summing of triangles in the geometric AUC calculation is the right side.
R code
library(tidyverse)
library(knitr) # optional -- pretty tables
# read in data
pass <- read_csv("titanic.csv")
# calculate the passenger class frequencies conditional on survival status
rates <- pass %>%
select(Y = Survived, Pclass) %>%
group_by(Y, Pclass) %>%
summarize(N = n()) %>%
mutate(P = N /sum(N)) %>%
select(-N) %>%
spread(Y, P, sep = "_")
# print out the matrix
rates %>% kable(digits = 3, format = "latex") %>% cleaner()
# create the matrix of probabilities for class combinations for Y0 and Y1
class_prob <- outer(rates$Y_1, rates$Y_0)
rownames(class_prob) <- colnames(class_prob) <- c("Class1","Class2","Class3")
# print out the matrix
class_prob %>% kable(digits = 3, format = "latex") %>% cleaner()
# when X1 > X0 the weight is 1. When x1 = X0, we have to guess, so the weight is .5.
# when X0 > X1 we'll get the wrong answer, so its weight is zero
case_weights <- upper.tri(class_prob) + 0
diag(case_weights) <- .5 # classes match, so guess
# print out the matrix
(class_prob*case_weights) %>% kable(digits = 3)
# compute the AUC
sum(class_prob*case_weights)
# compute the explanatory power in excess of guessing
prob_diff <- (class_prob - t(class_prob)) / 2
# print it
prob_diff %>% kable(digits = 3)
sum(prob_diff[upper.tri(prob_diff)])
No comments:
Post a Comment