Thursday, October 31, 2019

ROC Example

Introduction

The discussion of binary predictors (e.g. retain/attrit) so far has focused on the AUC measure of an independent variable's power to predict the outcome. As with any performance measure, we have to make assumptions and then live with the consequences. In this case, we set the question up in a rather artificial way: we assume that we have to choose between two cases \(y_0\) and \(y_1\), where the first case has outcome 0 (e.g. a student who failed to return, if predicting retention), and the second case has the outcome 1 (e.g. a student who did return). The job of the independent variable \(x\) is to distinguish which case is which. This is a different task than taking all the \(x_i,y_i\) pairs of data and, for example using regression to minimize error. Regression models have other kinds of performance measures, although we can use AUC for a logistic (i.e. binary) regression. See Frank Harrell's blog for some commentary from an expert.

Despite its limitations, I find AUC useful for scans of predictors and for communicating to non-technical audiences. It's much easier to explain the "guess which" scenario than regression goodness of fit measures. So the purpose of this post is to delve into AUC and its dual nature as a probability interpretation and a geometrical one. Last time I claimed that AUC was:

  • The probability of distinguishing \(y_0\) from \(y_1\) when given the explanatory variable \(x\)
  • The area under the ROC curve.
It is by no means obvious that these two things should be the same, so I'll use the 2x2 case to show it's true in the simplest case. We already know that the outcome only takes values 0 or 1, and in this simple case, the explanatory variable also only has 0 or 1 value. For example, if you want to use "early admit" status to predict first year retention, both variables are binary. I'll continue to use that case as an example for concreteness. 

Preliminaries

Start with what is called the "confusion" matrix, where we take historical data and populate the four possible outcomes of the predictor

$$  \begin{matrix}  & y = 0 & y = 1 \\  x = 0 & A & B  \\ x = 1 &  C & D  \end{matrix} $$

with frequencies of occurrence. Here, \(A\) is the fraction of all cases in the data where the predictor  is zero (i.e. not early admit) and the outcome was zero (student did not return for second year). 

Recall that the setup for the AUC calculation is that we assume a random draw has occurred to choose a case where \(y = 0\) and a case where \(y = 1\). That means we need to condition the confusion matrix by dividing by the column sums to obtain conditional probabilities:

$$  \begin{pmatrix}    \frac{A}{A+C} & \frac{B}{B+D}  \\  \frac{C}{A+C} & \frac{D}{B+D}  \end{pmatrix} := \begin{pmatrix}    a & b  \\  c & d  \end{pmatrix},$$

where for convenience I've renamed the normalized values with lowercase letters. Now we can proceed to compare the probability and area interpretations of AUC.

AUC as Probability

Suppose we are presented with two cases, and we know that one of them has \(y=0\) and the other has \(y=1\). If our predictor is any good, then \(x = 1\) should associate more with \(y=1\). It's also possible that the predictor is reversed (as in negatively correlated), but that just entails reversing the direction. For example, if we used first generation status to predict retention, we would probably find that those students had lower retention, so we would need to change the semantics of the predictor to NOT first generation student to get the two variables to positively correlate. That's just for convenience in interpretation. 

Assuming \(x\) is useful, we'll then predict that if one case has \(x = 0\) and the other case has \(x = 1\) that the second case is more likely to be the \(y=1\) case. In the example, if we are given a student who is early admit and one who is not, we'll assume it's the former who returns the next year. 

When the random draw of the two cases is made, the probability that  \(y_0\) has \(x = 0\) is \(a\) in the matrix, and the probability that  \(y_1\) has \(x = 1\) is \(d\), so the probability that both happen independently is the product \(ad\). Similarly, the probability that our predictor gets is exactly wrong is \(bc\), and the probability of a tie (both cases are early admit or both are not) is \(ab + cd\). In the case of a tie, we just have to guess, so the probability of a correct answer is one half.

This gives us a probability of correct classification of $$ \begin{align} ad + \frac{1}{2}(ab + cd) & = \frac{2ad + ab + cd}{2} \\  & = \frac{a(b+d) + d(a+c)}{2} \\ & = \frac{a + d}{2} \end{align},$$

where the last step comes from using the definition of the lower-case variables (convert back to the capital letters to see it).

In the language of classifiers, \(a\) and \(d\) are called the True Negative Rate (TNR) of classification and True Positive Rate (TPR). So the AUC is the average of the two in the 2x2 case. 

AUC as Area

The Receiver-Operator Characteristic (ROC) curve plots the False Positive Rate (FPR) on the horizontal axis and the TPR on the vertical axis, where TPR = 1 - TNR. The fourth variation on this theme is the False Negative Rate (FNR). Putting these names on our matrix, we have

$$ \begin{matrix} a = TNR & b = FNR \\ c = FPR & d = TPR \end{matrix} $$

The ROC graph looks like this:


ROC curve for our binary classifier.

The most satisfying way to find the area of a polygon is to use this formula, whereby we get a list of points to traverse counter-clockwise, starting and ending at (0,0).

$$ \begin{matrix} 0 & 0 \\ 1 & 0 \\ 1 & 1 \\ c & d \\ 0 & 0 \end{matrix} $$

Cross-multiplying, summing, and dividing by two (see the link for details) gives \( (1 + d - c)/2 \), which is the same as \( ( a + d) /2 \), as we got before. 

Discussion

The illustration shows that for a binary predictor and a binary outcome, the probability of correct classification is the same as the area under the ROC curve, and that both are equal to the average of the true positive and true negative rates. Under more general conditions, it is still true that the probability and the area are equal. The TPR and TNR are not defined in that case--they specifically refer to the 2x2 version. 

There are other statistical connections we can make to AUC, and the blog post linked above does some of that. For the 2x2 case, we have

$$ \begin{align} Var(X) &= (A + B)(C + D) \\ Var(Y) &= (A + C)(B + D) \\ Cov(X,Y) & = AD - BC \end{align} $$

If you compute how much area in the ROC lies above the "guessing" rate (the line from (0,0) to (1,1)), you find it's \( ad - bc \), which is the covariance divided by both variances. This is almost the definition of the Pearson correlation between the predictor and outcome variables. 







Monday, October 28, 2019

ROC and AUC

This post continues the previous topic of thinking in predictors. I showed a measure of predictiveness, called AUC, which can be thought of as the probability of correctly guessing which of two different outcomes is which.Here I'll show how to visualize predictive power and relate the graph to that probability.

Suppose we want to predict which students will do poorly academically in a gateway course, where we define that by the student receiving C, D, or F. If we can predict that outcome far enough in advance, maybe we can do something to help. Note that this is a binary outcome (i.e. binominal), meaning there are only two possibilities.

The predictors we might consider are class attendance, use of tutoring services, prior grades, etc. I'll use the first two of these as illustrations. For example, back to the cups of the prior post, suppose one cup has the ID of a student who earned a C, D, or F, and the other cup has the ID of a student who earned an A or B. You don't know which cup is which, but you are told that the student on the left came to class all the time, and the one on the right did not. You'd probably guess that the one on the left is the A/B student. The probability of that being correct on average is what I called AUC.

You can find many explanations of ROC curves and AUC online, but I find the following twist on it easier to understand.

Imagine that we have all our historical students listed in a spreadsheet, along with all the information we need to know about each in order to assess predictors. We need to know their course grade, and we need to know whatever information is potentially predictive of that grade. The grade will separate students into the sought group (CDF) and the ones who don't need help (AB). Any subset of the students will have a percentage of CDF and a percentage of AB students. We can visualize all possible subsets of students by plotting these as y and x.

Visualization of possible subsets of students

For example, the top right corner of the box contains 100% (all the way to the top) of the CDF students, and 100% (all the way to the right) of the AB students. That subset is the set of all students. At the bottom left, we have 0% of each, or no students--the empty set. If we had a perfect predictor, it would allow us to find the top left "pure" CDF subset just from the predictive information.

Randomly select half the students
Note that the number of students in each category (CDF versus AB) do not have to be equal, because we are using the percentage of each category separately. I put sample Ns on the figure above. Imagine that we randomly select half the students as our "predictor" of which ones will be CDF. On average, we'd expect to get half of the total in each type (i.e. 15 of the 30 CDFs and 50 of the 100 ABCs). The red lines indicate 50% of each of the AB and CDF students (50 and 15, respectively). The point representing a 50% random selection of students is in the center of the diagram.

All random selections are on the diagonal on average

If we randomly chose 10% or 60% of the students instead of 50%, we'd still be on the diagonal, just lower to the left or higher to the right, depending on the percentage. Any random selection results on average in a subset represented by a point on the red line above. This is a handy reference line, because anything above that line means that the percent of CDF students is greater than random--this is what we want in a predictor.

Students who don't come to class

Now consider our possible predictors. Assume for the sake of illustrate that when we look at the group of students who rarely (or never) come to class, that ALL of them are in the CDF category. There are not very many of these students, maybe just 20% of all the CDFs, but this information is good to know. The green X marks the spot on the diagram that represents the subset of students who rarely/never come to class. Notice that the line from the bottom left empty set goes straight up toward the set at the top left, which is just CDF and no AB students (the ideal set we could find with a perfect predictor).

Adding information about students who don't use totoring
Now we can add more information we know about students. The next segment of the green line assumes that any student who does not come to class gets classified as CDF, and in addition we will classify any student who doesn't attend tutoring also as CDF. The combination of these two rules results in the second green X.

Notice that the slope of the line from the first X to the second is not vertical like the first line segment was. The new rule about tutoring is not nearly as good a predictor, because it's moving to the right, meaning that it includes AB students as well as CDF students. That's probably because some AB students don't need tutoring. However, the slope of the second line segment is still steeper than the red guessing line, meaning that the tutoring rule (no tutoring => CDF) is still better than random selection. The graph makes it visually easy to see this.

In practice, after doing this analysis, we might decide to start with the students who aren't attending class, to see if we can solve that problem. It's inefficient to try to help students who don't need it, which makes the use of a predictor an economic decision, not just a mathematical one.

Graphs of this type are called Receiver-Operator Characteristics curves, for historical reasons, and AUC stands for "area under the curve." Next time, I'll explain why this area and the probability we started with are the same thing.

Example: Surviving the Titanic

The passenger list from the Titanic, annotated with survival information, has become a common data set for prediction and machine learning exercises.
Titanic survivorship by passenger class
The graph shows a ROC curve using passenger class as a predictor of survivorship. The circled 1 shows us where the first class passengers are as a subset of all passengers. They comprise a little more than 40% of the passengers who survived (vertical axis), but less than 20% of the ones who died. Notice that the slope of the first line segment is significantly higher than the random selection line in red.

If we guess that all first class passengers will survive, and further assume that all second class passengers will too, we end up at the 2 on the diagram. Notice that the slope of the segment from 1 to 2 is better than random selection (the red reference line), but not by that much. If we also include 3rd class, we end up with all the passengers. Notice that the slope of the line from 2 to 3 is significantly less steep than the reference line--meaning the odds of surviving as a 3rd class passenger are worse than random selection from all passengers.

Quiz

Test your understanding with the questions reference the figures below.
Answer the questions below

Questions:
  1. Which of these predictors has the highest probability of correct classification?
  2. Which has the lowest?
  3. Which of (a) and (d) would be preferred for a small intervention program?
Scroll down for answers.


























Answers:
  1. c has a curve that gets closest to the top left perfect prediction subset, and if you imagine the area being shaded under the green line, it's larger than any of the others.
  2. b is indistinguishable from random selection
  3. We prefer d because it initially goes straight up, meaning we can select a small treatment group with little error.


















Sunday, October 27, 2019

Thinking in Predictors

There are some ideas that help us see the world more clearly. For example, the idea that a flipped coin has no memory of past events helps us realize that a recent string of heads does not increase the likelihood of tails (if the coin-flipping is done fairly). Thinking in predictors is one of those ideas that can make complex problems a little easier to think about.

This post is an introduction to how to visualize predictive power in the simplest case, where the outcome we want to know about only has two values. Examples include predicting first year student retention, student graduation, and applicants enrolling after being admitted. Having only two outcomes means there are only two possible predictions in a particular case--we predict that the outcome is one or the other of the two possibilities.

Let's suppose that you want to predict which of the new first-year students will return for their second fall. Suppose for a moment that I have a crystal ball and know the answer for each of your students. I randomly choose one student who will retain and one who will admit, and put their ID numbers under two cups on the table. Your job is to guess which one is the retained student.

With no additional information to go on, it's easy to see that your chances are 50% of choosing the correct cup.
Guess which is the cup with the retained student identifier
What if I tell you that the student associated with the right cup has a higher high school grade average than the one on the left? If your college is like mine, then predicting the right cup will have the retained student in it raises your chances of being correct.
Now you have some information to use as a predictor. 
We can quantify how much this new information matters in increasing the probability of correct prediction. For a recent retention analysis, I calculated this probability for a few dozen possible success indicators. Here's part of that list, sorted by the best predictors.

List of predictors of two year retention
The first column is the name of the potential predictor variable. The next two columns are the predictive power over men and women respectively, and the last column is how complete the data set is--the fraction of students for which we have that information. In this case, we're predicting two-year retention, and the first variable on the list (Attend) is the retention status itself.

Knowing the Attend variable would be having the answer before we have to choose the cup. Since the outcome variable can predict itself perfectly, the probability is 1 in the two measure (AUC) columns.

The second row (GPA) is the student's college grade average. If we know that information for each of the two students with IDs under the cups, and if we guess that the student with the higher GPA is the one who sticks around for two years, we'd be correct 68% of the time for males and 69% of the time for women students in the historical data. The data completeness rate is 99% because a few students leave before earning any grades. So as a measure of predictive power over two-year retention, grades have an AUC of about .68.

One of the more interesting predictors on the list is TFS_Transfer. This is a response to a question on the HERI freshman survey that asks about likelihood to transfer. This turns out to be a useful predictor for us, although we can only link IDs to 70% of the students.

Next time I'll explain why the columns with the probabilities are called AUCs and how you go about calculating them.