Suppose we want to predict which students will do poorly academically in a gateway course, where we define that by the student receiving C, D, or F. If we can predict that outcome far enough in advance, maybe we can do something to help. Note that this is a binary outcome (i.e. binominal), meaning there are only two possibilities.
The predictors we might consider are class attendance, use of tutoring services, prior grades, etc. I'll use the first two of these as illustrations. For example, back to the cups of the prior post, suppose one cup has the ID of a student who earned a C, D, or F, and the other cup has the ID of a student who earned an A or B. You don't know which cup is which, but you are told that the student on the left came to class all the time, and the one on the right did not. You'd probably guess that the one on the left is the A/B student. The probability of that being correct on average is what I called AUC.
You can find many explanations of ROC curves and AUC online, but I find the following twist on it easier to understand.
Imagine that we have all our historical students listed in a spreadsheet, along with all the information we need to know about each in order to assess predictors. We need to know their course grade, and we need to know whatever information is potentially predictive of that grade. The grade will separate students into the sought group (CDF) and the ones who don't need help (AB). Any subset of the students will have a percentage of CDF and a percentage of AB students. We can visualize all possible subsets of students by plotting these as y and x.
Visualization of possible subsets of students |
For example, the top right corner of the box contains 100% (all the way to the top) of the CDF students, and 100% (all the way to the right) of the AB students. That subset is the set of all students. At the bottom left, we have 0% of each, or no students--the empty set. If we had a perfect predictor, it would allow us to find the top left "pure" CDF subset just from the predictive information.
Randomly select half the students |
All random selections are on the diagonal on average |
If we randomly chose 10% or 60% of the students instead of 50%, we'd still be on the diagonal, just lower to the left or higher to the right, depending on the percentage. Any random selection results on average in a subset represented by a point on the red line above. This is a handy reference line, because anything above that line means that the percent of CDF students is greater than random--this is what we want in a predictor.
Students who don't come to class |
Now consider our possible predictors. Assume for the sake of illustrate that when we look at the group of students who rarely (or never) come to class, that ALL of them are in the CDF category. There are not very many of these students, maybe just 20% of all the CDFs, but this information is good to know. The green X marks the spot on the diagram that represents the subset of students who rarely/never come to class. Notice that the line from the bottom left empty set goes straight up toward the set at the top left, which is just CDF and no AB students (the ideal set we could find with a perfect predictor).
Adding information about students who don't use totoring |
Notice that the slope of the line from the first X to the second is not vertical like the first line segment was. The new rule about tutoring is not nearly as good a predictor, because it's moving to the right, meaning that it includes AB students as well as CDF students. That's probably because some AB students don't need tutoring. However, the slope of the second line segment is still steeper than the red guessing line, meaning that the tutoring rule (no tutoring => CDF) is still better than random selection. The graph makes it visually easy to see this.
In practice, after doing this analysis, we might decide to start with the students who aren't attending class, to see if we can solve that problem. It's inefficient to try to help students who don't need it, which makes the use of a predictor an economic decision, not just a mathematical one.
Graphs of this type are called Receiver-Operator Characteristics curves, for historical reasons, and AUC stands for "area under the curve." Next time, I'll explain why this area and the probability we started with are the same thing.
Example: Surviving the Titanic
The passenger list from the Titanic, annotated with survival information, has become a common data set for prediction and machine learning exercises.Titanic survivorship by passenger class |
If we guess that all first class passengers will survive, and further assume that all second class passengers will too, we end up at the 2 on the diagram. Notice that the slope of the segment from 1 to 2 is better than random selection (the red reference line), but not by that much. If we also include 3rd class, we end up with all the passengers. Notice that the slope of the line from 2 to 3 is significantly less steep than the reference line--meaning the odds of surviving as a 3rd class passenger are worse than random selection from all passengers.
Quiz
Test your understanding with the questions reference the figures below.Answer the questions below |
Questions:
- Which of these predictors has the highest probability of correct classification?
- Which has the lowest?
- Which of (a) and (d) would be preferred for a small intervention program?
Answers:
- c has a curve that gets closest to the top left perfect prediction subset, and if you imagine the area being shaded under the green line, it's larger than any of the others.
- b is indistinguishable from random selection
- We prefer d because it initially goes straight up, meaning we can select a small treatment group with little error.
No comments:
Post a Comment