## Sunday, January 27, 2013

### Finding Meaning in Data, Part II

In Part I, we took a look at a large data set from the perspective of trying to predict something. The example was artificial--there's no need to predict first generation status because it's easy to directly determine--but the survey items that link to that status tell us something about those students. So I'm using 'prediction' as a generic term to mean connections within a data set, and not necessarily chronologically.

But often we do want to make predictions of the future based on past information (inductive reasoning). I'll give an example below that expands on the first article by directly involving more than one predictor.

Here's a mock data set small enough to understand in full:

We want to predict graduation from the data we have at hand. Using the single-dimension predictors, the best one is Athlete status:

At this point, we could research this by interviewing students or others, or looking for more data. Or we can try to mine the available data further by involving more than one variable. In Part I, I used RapidMiner to do that for me, and that's a good option. Decision trees are particularly easy to understand and implement.

One of the key ideas in some of the standard models is conditional probability, meaning that if we restrict our attention to a subset of the whole data set, we may get better results for that subset. More discussion on the philosophy behind this will come in the next installment. For now, I'll use our sample data as an example.

Let's go 'meta', and ask about the predictors themselves, and in particular the Athlete flag, which accurately predicts 12 out of 16 cases now. Let's ask: can we predict when the Athlete flag works and when it doesn't? This is the same kind of problem that we're already working on, just asked at a higher level. Instead of asking which variables predict graduation, we want to ask which variables predict how well the Athlete flag works.

To do that, we can create an indicator column that has a 1 in it if the Athlete column accurately predicts graduation, and a zero if it doesn't. Here's what that looks like:

I've highlighted the cases where the Athlete flag == Graduation flag, and those cases have a 1 appearing in the new AthInd indicator column. Now we run the predictor algorithm to see what predicts success, and come up with:

We find that Gender is a good indicator of the success of Athlete as a predictor, and in particular, when Gender==0, it's a perfect predictor. So now we know that in this data set we can perfectly predict Graduation by using Athlete for all cases where Gender==0.

A simpler approach is to turn the problem around and imagine that the best 1-D predictor can become a conditional for another of the variables. In this case, we'd run our prediction process with the filter Athlete==1, and we find that Gender conditioned on athlete works just as well as the other way around: we can predict 100% of the graduates for Gender==0. This business of conditionals may seem murky, given this brief description. I will address it more fully in the next installment.

Real data is not as clean as the example. Picking up from the update in the first post, one of the indicators of high GPA as a college senior (in 1999) is RATE01_TFS--academic self-confidence as a freshman. If a student ranked himself/herself in the highest 10%, there's a 51% chance he/she will finish with (self-reported) A average grades. Using the easy method described above, we can condition on this case (RATE01_TFS==5) and see what the best predictors of that set are. Within this restricted set of cases, we find that the items below predict A students to the level shown:

• Took honors course: 74%
• Goal: Being very well off financially: not important: 70% (answered as a freshman)
• Goal: Being very well off financially: not important: 70% (answered as a senior)
• Never overslept to miss class or appointment: 69%
• Never failed to complete homework on time: 69%
• Very satisfied with amount of contact with faculty: 68%
• Less than an hour of week partying: 66%
• Self Rating: Drive to achieve (highest 10%): 65%
• Faculty Provide: Encouragement to pursue graduate/professional school: 64%
• Future Act: Be elected to an academic honor society: 64% (answered as a freshman)
• Goal: Being successful in a business of my own (not important): 63%

All of these improve on the initial accuracy of 51%, but at the cost of reducing the applicable pool of cases into smaller chunks.

With three questions from the freshman survey, we have found a way to correctly classify 79% of a subset of students into A/not A (lots of fine print here, including the assumption that they stay at the school long enough to become a senior, etc.). Here's the performance:

This is great. However, we have narrowed the scope from the original 5500 cases of A-students to about 800 by conditioning on the two items above (only one of the two had to be true: being well off financially being not important OR anticipating being elected to an academic honor society). However, this is not a bad thing--it gets us away from the notion that all A students are alike, and starts us on the path of discriminating different types. Note that to have confidence that we haven't "overfit" the data, the results need to be validated by testing the model against another year's data.

## Saturday, January 26, 2013

### Finding Meaning in Data, Part I

Large data sets can be heartbreaking. You suspect that there's something in there--something really useful--maybe even a "knock it out of the park" bit of wisdom, but finding it is often not trivial. The Scylla and Charybdis in this classical tale are (*) wasting an inordinate amount of time to find nothing, or (@) trying too hard and bestowing confidence on something meaningless. The perils are everywhere. At one point, I had myself convinced I'd found a powerful predictor of attrition: students who didn't have financial aid packages. After chasing down that rabbit hole for too long, I discovered that the connection was real, but in the wrong direction: students who left had their packages scrubbed from the system (an argument for data warehousing).

The only solution, I think, is to make it easier to find interesting connections, and easier to know if they are real or spurious. To that end, I've strung together a Rube Goldberg contraption that looks for interestingness and reports it in rank order, interestingest on top. There's quite a bit to say about the hows and whys, but let me start with an example.

Yesterday I happened across an article about the disadvantages first-generation students face in college:
Given ... the ways in which students’ social class backgrounds shape their motives for attending college, we questioned whether universities provide students from these different backgrounds with an equal chance of success. — Assistant Professor Nicole Stephens
And if you're interested in that, there's a recent article in the Atlantic "Why Smart Poor Students Don't Apply to Selective Colleges [...]"
[T]he majority of these smart poor students don't apply to any selective college or university, according to a new paper by Caroline M. Hoxby and Christopher Avery -- even though the most selective schools would actually cost them less, after counting financial aid. Poor students with practically the same grades as their richer classmates are 75 percent less likely to apply to selective colleges.
Now 'poor' and 'first generation' are not the same thing, but they overlap substantially. We can test that by looking at some data (albeit old data).

The nice folks at HERI allow access to old survey data without much fuss, and I downloaded the Senior Survey results from their data archive to use for a demonstration. The survey is high quality and there are lots of samples. I limited myself to results from 1999, about 40K cases, of which somewhat more than half are flagged as not being in the 2-year college comparison group.

This is an item on the survey:

FIRSTGEN_TFS First generation status based on parent(s) with less than 'some college'
• (1) No
• (2) Yes
Wouldn't it be interesting to know how the other items on the survey relate to this one? My old approach would have been to load it up in SPSS and create a massive correlation table. Later I figured out how to graph and filter the results, but there are problems with this approach that I'll cover later.

Now, within about a minute I have a full report of what significant links there are, answering a specific question: if I had the whole survey except for this FIRSTGEN_TFS column, how well could I predict who was first generation? Here's the answer.
1. (.95) Father's education level is lower
2. (.94) Mother's educational level is lower
3. (.63) Distance from school to home is lower
4. (.63) Get more PELL grant money
5. (.61) More concerns about financing college education
6. (.60) Less likely to communicate via email with family
7. (.60) Financial assistance was more important in choice to attend
8. (.59) Don't use computers as often
9. (.59) Lower self-evaluation of academic ability as a Freshman (less true as a senior)
10. (.58) Wanted to stay near home
11. (.57) Are more likely to work full time while attending
12. (.57) Are less likely to discuss politics
13. (.57) Are less likely to have been a guest at a professor's home
14. (.57) Spend more time commuting
15. (.57) Are more likely to speak a language other than English at home.
16. (.57) Evaluate their own writing skills lower
17. (.57) Low tuition was more important in choice to attend
18. (.56) Have a goal of being very well off financially.
This list goes on for pages. The numbers in parentheses are a measure of predictive power, the AUC which will be described in a bit. Obviously mother and father's educational level are linked to the definition of first-generation student, but the others are not.

We can see in this list a clear concern, even back in 1999, about finances. There's also a hint that these students are not as savvy consumers as their higher-SES peers: the desire to stay close to home, for example. We could follow up by explicitly chasing ideas. What's the profile of high-GPA first generation students? Does gender make a difference? And so on. These are easily done until the point where we don't have much data left (you can only disaggregate so far).

Rather than doing that, let me show the details of this automated survey analysis. Each of these items on the list above comes with a full report, one of which is reproduced below.

This is a lot to take in. For the moment, focus on the table near the bottom called "Predictor Performance." If we just used the response to this item about the likelihood of having to get a job to pay for college, the predictor works by classifying anyone who responds with (4) Very good chance or (3) Some chance as a likely first-generation student. This would result in correct classification of 86% of the first generation students (true positives), diluted by 89% (from 1-.18) of the non-first-generation students (false positives). This is not a great predictor by itself, but it's useful information to know when recruiting and advising these students.

The value of a classifier like this is sometimes rated by the area under the curve (AUC) when drawing true positive rate versus false positive rate--a so-called ROC curve. That's the left-most graph above the table we were just looking at. The next one to the right shows where the two rates meet (slightly to the left of 3), which defines the cut-off for the classifier. The histogram to the right of that shows the frequency of responses, with blue being the first-gen-students and red the others. The final graph in that series gives the fraction of first-gen students who responded with 4,3,2,1 for this item, so if we chose a student who responded (4) to this item, there's a 19% chance that they were first generation, as opposed to a 9% chance if they responded (1).

Note that even though this is not a great predictor, we can have some confidence that it's meaningful because of the way the responses line up: 4,3,2,1. These are not required to be in order by the program that produces the report. Rather, the order is determined by the usefulness of the response as a predictor, sorted from highest to lowest. As a result, the In-Class fraction graph always shows higher values on the left.

The table at the very top shows the true positive and false positive rates (here, In-class means the group we have selected to contrast: students who checked the first-gen box in this case). The number of samples in each case is given underneath, and a one-tailed p-value for the difference of proportions is shown at the bottom, color coded for easy identification of anything less than .05.

We can take the whole list of good predictors and generate a table of predictions from each. These are each one-dimension, like the question above about the likelihood of needing a job while in college. We can see how these overlap by mapping out their correlations. That's what the graph below shows (down to correlation +/-.40).

The good news is that there are no negative correlations (predictors that contradict one another). The items cluster in fairly obvious ways and suggest how we might pare down the list for further consideration.

In order to create multi-dimensional models, we can try using standard machine learning algorithms using RapidMiner or Weka, which are free. An example using the former is shown below, employing a Bayesian learning algorithm to try to sort out the difference between first generation students and the others using the one-dimensional predictions already generated.

The result is a classifier that has usable accuracy. In the table below, first generation students (as self-reported on the survey) are coded as 1.

The model correctly classifies 15,219 non-first-gen students and 2,300 first gen students, and incorrectly classifies a total of 6,259 cases. This is better performance than any of the single predictors we found (leaving out parental education level, which is hardly fair to use). The AUC for this model is .76.

Note that this is only part of the work required. Ideally we create the model using one set of data, and test it with another. For example, we could use 1998 data to predict 1999 data.

This general approach can be used to predict enrollment, retention, academic success, and anything else you have good data on. There's much more to this, which is why the title has "Part 1" in it. Stay tuned...

Update: As another example, I contrasted students who report having an A average in their senior year to those who don't (for 4-year programs), and eliminated those in the B+/A- range to make the distinction less fuzzy. You can see the whole report on one-dimensional predictors here.

You'll see things you expect, like high academic confidence, and good study habits (not missing class, turning homework in on time), but there's also a clear link to less drinking and partying, intellectual curiosity (e.g. taking interdisciplinary courses), the role of professors as mentors and role models (e.g. being a guest in a professor's home), a trend toward helping others through tutoring and service learning, but the one that surprised me the most is shown below. It's an item that is asked on the freshman survey and repeated on this senior survey. They give virtually identical results:

This is matched by a question about having a successful business, with similar results.  Both show a clear tendency of the top grade earners to not be as motivated by money as students who get Bs or below (remember, this is 1999).

Continued in Part II