Sunday, October 27, 2019

Thinking in Predictors

There are some ideas that help us see the world more clearly. For example, the idea that a flipped coin has no memory of past events helps us realize that a recent string of heads does not increase the likelihood of tails (if the coin-flipping is done fairly). Thinking in predictors is one of those ideas that can make complex problems a little easier to think about.

This post is an introduction to how to visualize predictive power in the simplest case, where the outcome we want to know about only has two values. Examples include predicting first year student retention, student graduation, and applicants enrolling after being admitted. Having only two outcomes means there are only two possible predictions in a particular case--we predict that the outcome is one or the other of the two possibilities.

Let's suppose that you want to predict which of the new first-year students will return for their second fall. Suppose for a moment that I have a crystal ball and know the answer for each of your students. I randomly choose one student who will retain and one who will admit, and put their ID numbers under two cups on the table. Your job is to guess which one is the retained student.

With no additional information to go on, it's easy to see that your chances are 50% of choosing the correct cup.
Guess which is the cup with the retained student identifier
What if I tell you that the student associated with the right cup has a higher high school grade average than the one on the left? If your college is like mine, then predicting the right cup will have the retained student in it raises your chances of being correct.
Now you have some information to use as a predictor. 
We can quantify how much this new information matters in increasing the probability of correct prediction. For a recent retention analysis, I calculated this probability for a few dozen possible success indicators. Here's part of that list, sorted by the best predictors.

List of predictors of two year retention
The first column is the name of the potential predictor variable. The next two columns are the predictive power over men and women respectively, and the last column is how complete the data set is--the fraction of students for which we have that information. In this case, we're predicting two-year retention, and the first variable on the list (Attend) is the retention status itself.

Knowing the Attend variable would be having the answer before we have to choose the cup. Since the outcome variable can predict itself perfectly, the probability is 1 in the two measure (AUC) columns.

The second row (GPA) is the student's college grade average. If we know that information for each of the two students with IDs under the cups, and if we guess that the student with the higher GPA is the one who sticks around for two years, we'd be correct 68% of the time for males and 69% of the time for women students in the historical data. The data completeness rate is 99% because a few students leave before earning any grades. So as a measure of predictive power over two-year retention, grades have an AUC of about .68.

One of the more interesting predictors on the list is TFS_Transfer. This is a response to a question on the HERI freshman survey that asks about likelihood to transfer. This turns out to be a useful predictor for us, although we can only link IDs to 70% of the students.

Next time I'll explain why the columns with the probabilities are called AUCs and how you go about calculating them.






No comments:

Post a Comment