Monday, October 28, 2019

ROC and AUC

This post continues the previous topic of thinking in predictors. I showed a measure of predictiveness, called AUC, which can be thought of as the probability of correctly guessing which of two different outcomes is which.Here I'll show how to visualize predictive power and relate the graph to that probability.

Suppose we want to predict which students will do poorly academically in a gateway course, where we define that by the student receiving C, D, or F. If we can predict that outcome far enough in advance, maybe we can do something to help. Note that this is a binary outcome (i.e. binominal), meaning there are only two possibilities.

The predictors we might consider are class attendance, use of tutoring services, prior grades, etc. I'll use the first two of these as illustrations. For example, back to the cups of the prior post, suppose one cup has the ID of a student who earned a C, D, or F, and the other cup has the ID of a student who earned an A or B. You don't know which cup is which, but you are told that the student on the left came to class all the time, and the one on the right did not. You'd probably guess that the one on the left is the A/B student. The probability of that being correct on average is what I called AUC.

You can find many explanations of ROC curves and AUC online, but I find the following twist on it easier to understand.

Imagine that we have all our historical students listed in a spreadsheet, along with all the information we need to know about each in order to assess predictors. We need to know their course grade, and we need to know whatever information is potentially predictive of that grade. The grade will separate students into the sought group (CDF) and the ones who don't need help (AB). Any subset of the students will have a percentage of CDF and a percentage of AB students. We can visualize all possible subsets of students by plotting these as y and x.

Visualization of possible subsets of students

For example, the top right corner of the box contains 100% (all the way to the top) of the CDF students, and 100% (all the way to the right) of the AB students. That subset is the set of all students. At the bottom left, we have 0% of each, or no students--the empty set. If we had a perfect predictor, it would allow us to find the top left "pure" CDF subset just from the predictive information.

Randomly select half the students
Note that the number of students in each category (CDF versus AB) do not have to be equal, because we are using the percentage of each category separately. I put sample Ns on the figure above. Imagine that we randomly select half the students as our "predictor" of which ones will be CDF. On average, we'd expect to get half of the total in each type (i.e. 15 of the 30 CDFs and 50 of the 100 ABCs). The red lines indicate 50% of each of the AB and CDF students (50 and 15, respectively). The point representing a 50% random selection of students is in the center of the diagram.

All random selections are on the diagonal on average

If we randomly chose 10% or 60% of the students instead of 50%, we'd still be on the diagonal, just lower to the left or higher to the right, depending on the percentage. Any random selection results on average in a subset represented by a point on the red line above. This is a handy reference line, because anything above that line means that the percent of CDF students is greater than random--this is what we want in a predictor.

Students who don't come to class

Now consider our possible predictors. Assume for the sake of illustrate that when we look at the group of students who rarely (or never) come to class, that ALL of them are in the CDF category. There are not very many of these students, maybe just 20% of all the CDFs, but this information is good to know. The green X marks the spot on the diagram that represents the subset of students who rarely/never come to class. Notice that the line from the bottom left empty set goes straight up toward the set at the top left, which is just CDF and no AB students (the ideal set we could find with a perfect predictor).

Adding information about students who don't use totoring
Now we can add more information we know about students. The next segment of the green line assumes that any student who does not come to class gets classified as CDF, and in addition we will classify any student who doesn't attend tutoring also as CDF. The combination of these two rules results in the second green X.

Notice that the slope of the line from the first X to the second is not vertical like the first line segment was. The new rule about tutoring is not nearly as good a predictor, because it's moving to the right, meaning that it includes AB students as well as CDF students. That's probably because some AB students don't need tutoring. However, the slope of the second line segment is still steeper than the red guessing line, meaning that the tutoring rule (no tutoring => CDF) is still better than random selection. The graph makes it visually easy to see this.

In practice, after doing this analysis, we might decide to start with the students who aren't attending class, to see if we can solve that problem. It's inefficient to try to help students who don't need it, which makes the use of a predictor an economic decision, not just a mathematical one.

Graphs of this type are called Receiver-Operator Characteristics curves, for historical reasons, and AUC stands for "area under the curve." Next time, I'll explain why this area and the probability we started with are the same thing.

Example: Surviving the Titanic

The passenger list from the Titanic, annotated with survival information, has become a common data set for prediction and machine learning exercises.
Titanic survivorship by passenger class
The graph shows a ROC curve using passenger class as a predictor of survivorship. The circled 1 shows us where the first class passengers are as a subset of all passengers. They comprise a little more than 40% of the passengers who survived (vertical axis), but less than 20% of the ones who died. Notice that the slope of the first line segment is significantly higher than the random selection line in red.

If we guess that all first class passengers will survive, and further assume that all second class passengers will too, we end up at the 2 on the diagram. Notice that the slope of the segment from 1 to 2 is better than random selection (the red reference line), but not by that much. If we also include 3rd class, we end up with all the passengers. Notice that the slope of the line from 2 to 3 is significantly less steep than the reference line--meaning the odds of surviving as a 3rd class passenger are worse than random selection from all passengers.

Quiz

Test your understanding with the questions reference the figures below.
Answer the questions below

Questions:
  1. Which of these predictors has the highest probability of correct classification?
  2. Which has the lowest?
  3. Which of (a) and (d) would be preferred for a small intervention program?
Scroll down for answers.


























Answers:
  1. c has a curve that gets closest to the top left perfect prediction subset, and if you imagine the area being shaded under the green line, it's larger than any of the others.
  2. b is indistinguishable from random selection
  3. We prefer d because it initially goes straight up, meaning we can select a small treatment group with little error.


















No comments:

Post a Comment