## Sunday, January 27, 2013

### Finding Meaning in Data, Part II

In Part I, we took a look at a large data set from the perspective of trying to predict something. The example was artificial--there's no need to predict first generation status because it's easy to directly determine--but the survey items that link to that status tell us something about those students. So I'm using 'prediction' as a generic term to mean connections within a data set, and not necessarily chronologically.

But often we do want to make predictions of the future based on past information (inductive reasoning). I'll give an example below that expands on the first article by directly involving more than one predictor.

Here's a mock data set small enough to understand in full:

We want to predict graduation from the data we have at hand. Using the single-dimension predictors, the best one is Athlete status:

At this point, we could research this by interviewing students or others, or looking for more data. Or we can try to mine the available data further by involving more than one variable. In Part I, I used RapidMiner to do that for me, and that's a good option. Decision trees are particularly easy to understand and implement.

One of the key ideas in some of the standard models is conditional probability, meaning that if we restrict our attention to a subset of the whole data set, we may get better results for that subset. More discussion on the philosophy behind this will come in the next installment. For now, I'll use our sample data as an example.

Let's go 'meta', and ask about the predictors themselves, and in particular the Athlete flag, which accurately predicts 12 out of 16 cases now. Let's ask: can we predict when the Athlete flag works and when it doesn't? This is the same kind of problem that we're already working on, just asked at a higher level. Instead of asking which variables predict graduation, we want to ask which variables predict how well the Athlete flag works.

To do that, we can create an indicator column that has a 1 in it if the Athlete column accurately predicts graduation, and a zero if it doesn't. Here's what that looks like:

I've highlighted the cases where the Athlete flag == Graduation flag, and those cases have a 1 appearing in the new AthInd indicator column. Now we run the predictor algorithm to see what predicts success, and come up with:

We find that Gender is a good indicator of the success of Athlete as a predictor, and in particular, when Gender==0, it's a perfect predictor. So now we know that in this data set we can perfectly predict Graduation by using Athlete for all cases where Gender==0.

A simpler approach is to turn the problem around and imagine that the best 1-D predictor can become a conditional for another of the variables. In this case, we'd run our prediction process with the filter Athlete==1, and we find that Gender conditioned on athlete works just as well as the other way around: we can predict 100% of the graduates for Gender==0. This business of conditionals may seem murky, given this brief description. I will address it more fully in the next installment.

Real data is not as clean as the example. Picking up from the update in the first post, one of the indicators of high GPA as a college senior (in 1999) is RATE01_TFS--academic self-confidence as a freshman. If a student ranked himself/herself in the highest 10%, there's a 51% chance he/she will finish with (self-reported) A average grades. Using the easy method described above, we can condition on this case (RATE01_TFS==5) and see what the best predictors of that set are. Within this restricted set of cases, we find that the items below predict A students to the level shown:

• Took honors course: 74%
• Goal: Being very well off financially: not important: 70% (answered as a freshman)
• Goal: Being very well off financially: not important: 70% (answered as a senior)
• Never overslept to miss class or appointment: 69%
• Never failed to complete homework on time: 69%
• Very satisfied with amount of contact with faculty: 68%
• Less than an hour of week partying: 66%
• Self Rating: Drive to achieve (highest 10%): 65%
• Faculty Provide: Encouragement to pursue graduate/professional school: 64%
• Future Act: Be elected to an academic honor society: 64% (answered as a freshman)
• Goal: Being successful in a business of my own (not important): 63%

All of these improve on the initial accuracy of 51%, but at the cost of reducing the applicable pool of cases into smaller chunks.

With three questions from the freshman survey, we have found a way to correctly classify 79% of a subset of students into A/not A (lots of fine print here, including the assumption that they stay at the school long enough to become a senior, etc.). Here's the performance:

This is great. However, we have narrowed the scope from the original 5500 cases of A-students to about 800 by conditioning on the two items above (only one of the two had to be true: being well off financially being not important OR anticipating being elected to an academic honor society). However, this is not a bad thing--it gets us away from the notion that all A students are alike, and starts us on the path of discriminating different types. Note that to have confidence that we haven't "overfit" the data, the results need to be validated by testing the model against another year's data.