Monday, July 14, 2014

Finding and Using Predictors of Student Attrition

A while back, I wrote about finding meaning in data, and this has turned into a productive project. In this article I'll describe some findings on causes of student attrition, a conceptual framework of actions to prevent attrition, and an outline of the methods we use.

In order to find predictors of attrition, we need to find information that was gathered before the student left. A different approach is to ask the student why he or she is leaving during the withdrawal process, but I won't talk about that here. We use HERI's The Freshman Survey each fall and get a high response rate (the new students are all in a room together). Combining this with the kinds of information gathered during the admissions process gives several hundred individual pieces of information. These data rows are 'labeled' per student with the binary variables for attrition (first semester, second semester, and so on). In several years of data, and relying on two different liberal arts colleges, we get the same kinds of predictors of attrition:

  • Low social engagement
  • High financial need
  • Poor academics
  • Psychology that leads to attrition: initial intent to transfer, extreme homesickness, and so on.
These can occur in various combinations. A student who is financially stressed and working two jobs off campus may find it hard to keep up her grades. At the AIR Forum (an institutional research conference) this June, we saw a poster from another liberal arts college that identified the same four categories. They used different methods, and we have a meeting set with them to compare notes.
For purposes of forming actions, we assume that these predictive conditions are causal. There's no way to prove that without randomized experiments, which are impossible, and doing nothing is not an option. In order to match up putative causes with actions, we relied on Vincent Tinto's latest book Completing College: Rethinking Institutional Action, taking his categories of action and cross-indexing them with our causes. Then we annotated it with the existing and proposed actions we were considering. The table below shows that conceptual framework for action.
Each letter designates some action. The ones at the bottom are actions related to getting better information. An example of how this approach generates thoughtful action is given next.

Social Engagement may happen through students attending club meetings, having work-study, playing sports, taking a class at the fitness center, and so on. This can be hard to track. This led us to consider adopting a software product that would do two things: (1) help students more easily find social activities to engage with, and (2) help us better track participation. As it turned out, there was a company in town that does exactly this, called Check I'm Here. We had them over for a demo, and then I went to their place to chat with Reuben Pressman, the CEO and founder. I was very impressed with the vision and passion of Reuben and his team. You can click through the link to their web site for a full rundown of features, but here's a quote from Reuben:

The philosophy is based around a continuous process to Manage, Track, Assess, & Engage students & organizations. It flows a lot like the MVP idea behind the book "The Lean Startup" that talks about a process of trying something, seeing how it goes, getting feedback, and making it better, than starting over again. We think of our engagement philosophy the same way:
  • Manage -- Organize and structure your organizations, events, and access to the platform.
  • Track -- Collect data in real-time and verify students live with mobile devices
  • Assess -- Integrate newly collected data and structurally combine it with existing data to give real-time assessment of what works and doesn't and what kinds of students are involved
  • Engage -- Use your new information to make educated decisions and use our tools for web and mobile to attract students in new ways
  • Rinse, and Repeat for more success!
A blog post talking more about our tracking directly is here. We take a focus on Assessing Involvement, Increasing Engagement, Retaining Students, and Successfully Allocating Funding.
Currently, we can get card-swipe counts for our fitness center, because it's controlled for security reasons. An analysis of the data gives some indication (this is not definitive) that there is an effect present for students who use the fitness center more than those who don't. This manifests itself after about a year, for a bonus of three percentage points in retention. The ability to capture student attendance at club events, academic lectures, and so on with an easy portable card-swipe system like Check I'm Here is very attractive. It also helps these things happen--students can check an app on their phones to see what's coming up, and register their interest in participating.


I put this last, because not everyone wants to know about the statistics. At the AIR forum, I gave a talk on this general topic, which was recorded and is available through the organization. I think you might have to pay for access, though.

The problem of finding which variables matter among hundreds of potential ones is sometimes solved with  step-wise linear regression, but in my experience this is problematic. For one thing, it assumes that relationships are linear, when they might well not be. Suppose the students who leave are those with the lowest and the highest grades. That wouldn't show up in a linear model. I suppose you could cross multiply all the variables to get non-linear ones, but now you've got tens of thousands of variables instead of hundreds.

There are more sophisticated methods available now, like lasso, but they aren't attractive for what I want. As far as I can tell, they assume linearity too. Anyway, there's a very simple solution that doesn't assume anything. I began developing the software to quickly implement it two years ago, and you can see an early version here.

I've expanded that software to create a nice work flow that looks like this:
  1. Identify what you care about (e.g. retention, grades) and create a binary variable out of it per student
  2. Accumulate all the other variables we have at our disposal that might be predictors of the one we care about. I put these in a large CSV file that may have 500 columns
  3. Normalize scalar data to (usually) quartiles, and truncate nominal data (e.g. state abbreviations) by keeping only the most frequent ones and calling the rest 'other'. NAs can be included or not.
  4. Look at a map of correlates within these variables, to see if there is structure we'd expect (SAT should correlate with grades, for example)
  5. Run the univariate predictor algorithm against the one we care about, and rank these best to worst. This usually takes less than a minute to set up and run.
  6. Choose a few (1-6 or so) of the best predictors and see how they perform pairwise. This means taking them two at a time to see how much the predictive power improves when both are considered. 
  7. Take the best ones that seem independent and combine them in a linear model if that seems appropriate (the variables need to act like linear relationships). Cross-validate the model by generating it on half the data, testing against the other half, do this 100 times an plot the distribution of predictive power.
Once the data set is compiled, all of this takes no more that four or five minutes. A couple of sample ROC curves and AUC histograms are show below.

This predictor is a four-variable model for a largish liberal arts college I worked with. It predicts student retention based on grades, finances, social engagement, and intent to transfer (as asked by the entering freshman CIRP survey).

At the end of this process, we can have some confidence in what predictors are resilient (not merely accidental), and how they work in combination with each other. The idea for me is not to try to predict individual students, but to understand the causes in such a way that we can treat them systematically. The example with social engagement is one such instance. Ideally, the actions taken are natural ones that make sense and have a good chance of improving the campus. 

No comments:

Post a Comment