Saturday, January 26, 2013

Finding Meaning in Data, Part I

Large data sets can be heartbreaking. You suspect that there's something in there--something really useful--maybe even a "knock it out of the park" bit of wisdom, but finding it is often not trivial. The Scylla and Charybdis in this classical tale are (*) wasting an inordinate amount of time to find nothing, or (@) trying too hard and bestowing confidence on something meaningless. The perils are everywhere. At one point, I had myself convinced I'd found a powerful predictor of attrition: students who didn't have financial aid packages. After chasing down that rabbit hole for too long, I discovered that the connection was real, but in the wrong direction: students who left had their packages scrubbed from the system (an argument for data warehousing).

The only solution, I think, is to make it easier to find interesting connections, and easier to know if they are real or spurious. To that end, I've strung together a Rube Goldberg contraption that looks for interestingness and reports it in rank order, interestingest on top. There's quite a bit to say about the hows and whys, but let me start with an example.

Yesterday I happened across an article about the disadvantages first-generation students face in college:
Given ... the ways in which students’ social class backgrounds shape their motives for attending college, we questioned whether universities provide students from these different backgrounds with an equal chance of success. — Assistant Professor Nicole Stephens
And if you're interested in that, there's a recent article in the Atlantic "Why Smart Poor Students Don't Apply to Selective Colleges [...]"
[T]he majority of these smart poor students don't apply to any selective college or university, according to a new paper by Caroline M. Hoxby and Christopher Avery -- even though the most selective schools would actually cost them less, after counting financial aid. Poor students with practically the same grades as their richer classmates are 75 percent less likely to apply to selective colleges.
Now 'poor' and 'first generation' are not the same thing, but they overlap substantially. We can test that by looking at some data (albeit old data).

The nice folks at HERI allow access to old survey data without much fuss, and I downloaded the Senior Survey results from their data archive to use for a demonstration. The survey is high quality and there are lots of samples. I limited myself to results from 1999, about 40K cases, of which somewhat more than half are flagged as not being in the 2-year college comparison group.

This is an item on the survey:

FIRSTGEN_TFS First generation status based on parent(s) with less than 'some college'
  • (1) No
  • (2) Yes
Wouldn't it be interesting to know how the other items on the survey relate to this one? My old approach would have been to load it up in SPSS and create a massive correlation table. Later I figured out how to graph and filter the results, but there are problems with this approach that I'll cover later.

Now, within about a minute I have a full report of what significant links there are, answering a specific question: if I had the whole survey except for this FIRSTGEN_TFS column, how well could I predict who was first generation? Here's the answer.
  1. (.95) Father's education level is lower
  2. (.94) Mother's educational level is lower
  3. (.63) Distance from school to home is lower
  4. (.63) Get more PELL grant money
  5. (.61) More concerns about financing college education
  6. (.60) Less likely to communicate via email with family
  7. (.60) Financial assistance was more important in choice to attend
  8. (.59) Don't use computers as often
  9. (.59) Lower self-evaluation of academic ability as a Freshman (less true as a senior)
  10. (.58) Wanted to stay near home
  11. (.57) Are more likely to work full time while attending
  12. (.57) Are less likely to discuss politics
  13. (.57) Are less likely to have been a guest at a professor's home
  14. (.57) Spend more time commuting
  15. (.57) Are more likely to speak a language other than English at home.
  16. (.57) Evaluate their own writing skills lower
  17. (.57) Low tuition was more important in choice to attend
  18. (.56) Have a goal of being very well off financially.
  19. (.56) Had lower (self-reported) college grade average
This list goes on for pages. The numbers in parentheses are a measure of predictive power, the AUC which will be described in a bit. Obviously mother and father's educational level are linked to the definition of first-generation student, but the others are not.

We can see in this list a clear concern, even back in 1999, about finances. There's also a hint that these students are not as savvy consumers as their higher-SES peers: the desire to stay close to home, for example. We could follow up by explicitly chasing ideas. What's the profile of high-GPA first generation students? Does gender make a difference? And so on. These are easily done until the point where we don't have much data left (you can only disaggregate so far).

Rather than doing that, let me show the details of this automated survey analysis. Each of these items on the list above comes with a full report, one of which is reproduced below.


This is a lot to take in. For the moment, focus on the table near the bottom called "Predictor Performance." If we just used the response to this item about the likelihood of having to get a job to pay for college, the predictor works by classifying anyone who responds with (4) Very good chance or (3) Some chance as a likely first-generation student. This would result in correct classification of 86% of the first generation students (true positives), diluted by 89% (from 1-.18) of the non-first-generation students (false positives). This is not a great predictor by itself, but it's useful information to know when recruiting and advising these students. 

The value of a classifier like this is sometimes rated by the area under the curve (AUC) when drawing true positive rate versus false positive rate--a so-called ROC curve. That's the left-most graph above the table we were just looking at. The next one to the right shows where the two rates meet (slightly to the left of 3), which defines the cut-off for the classifier. The histogram to the right of that shows the frequency of responses, with blue being the first-gen-students and red the others. The final graph in that series gives the fraction of first-gen students who responded with 4,3,2,1 for this item, so if we chose a student who responded (4) to this item, there's a 19% chance that they were first generation, as opposed to a 9% chance if they responded (1).

Note that even though this is not a great predictor, we can have some confidence that it's meaningful because of the way the responses line up: 4,3,2,1. These are not required to be in order by the program that produces the report. Rather, the order is determined by the usefulness of the response as a predictor, sorted from highest to lowest. As a result, the In-Class fraction graph always shows higher values on the left.

The table at the very top shows the true positive and false positive rates (here, In-class means the group we have selected to contrast: students who checked the first-gen box in this case). The number of samples in each case is given underneath, and a one-tailed p-value for the difference of proportions is shown at the bottom, color coded for easy identification of anything less than .05.

You can download the whole report here.

We can take the whole list of good predictors and generate a table of predictions from each. These are each one-dimension, like the question above about the likelihood of needing a job while in college. We can see how these overlap by mapping out their correlations. That's what the graph below shows (down to correlation +/-.40).



The good news is that there are no negative correlations (predictors that contradict one another). The items cluster in fairly obvious ways and suggest how we might pare down the list for further consideration.

In order to create multi-dimensional models, we can try using standard machine learning algorithms using RapidMiner or Weka, which are free. An example using the former is shown below, employing a Bayesian learning algorithm to try to sort out the difference between first generation students and the others using the one-dimensional predictions already generated.


The result is a classifier that has usable accuracy. In the table below, first generation students (as self-reported on the survey) are coded as 1.


The model correctly classifies 15,219 non-first-gen students and 2,300 first gen students, and incorrectly classifies a total of 6,259 cases. This is better performance than any of the single predictors we found (leaving out parental education level, which is hardly fair to use). The AUC for this model is .76.

Note that this is only part of the work required. Ideally we create the model using one set of data, and test it with another. For example, we could use 1998 data to predict 1999 data.

This general approach can be used to predict enrollment, retention, academic success, and anything else you have good data on. There's much more to this, which is why the title has "Part 1" in it. Stay tuned...

Update: As another example, I contrasted students who report having an A average in their senior year to those who don't (for 4-year programs), and eliminated those in the B+/A- range to make the distinction less fuzzy. You can see the whole report on one-dimensional predictors here.

You'll see things you expect, like high academic confidence, and good study habits (not missing class, turning homework in on time), but there's also a clear link to less drinking and partying, intellectual curiosity (e.g. taking interdisciplinary courses), the role of professors as mentors and role models (e.g. being a guest in a professor's home), a trend toward helping others through tutoring and service learning, but the one that surprised me the most is shown below. It's an item that is asked on the freshman survey and repeated on this senior survey. They give virtually identical results:


This is matched by a question about having a successful business, with similar results.  Both show a clear tendency of the top grade earners to not be as motivated by money as students who get Bs or below (remember, this is 1999).

Continued in Part II

No comments:

Post a Comment