Tuesday, July 29, 2014

OK Trends

If you're not familiar with the OKCupid blog, check out it out here. Christian Rudder slices and dices data from the dating site to try to reveal human nature. I find it great fun to follow along with his train of thought, which he presents engagingly and well-illustrated with graphics. The articles could serve as examples to students of what 'critical thinking' might be.

The report linked above is particularly interesting because it addresses ethical issues. If you look at the comments, you'll see a range of reactions from "that's cool" to "how dare you!", include a couple by putative psychology researchers who mention IRB processes. This comes on the heals of Facebook's research on manipulating attitudes, and the resulting media fiasco.

This is a looming problem for us in higher education, too. As an example, imagine a software application that tracks students on campus by using the wi-fi network's access points and connections to cellphones. This could be used to identify student behaviors that are predictive of academic performance and retention (e.g. class attendance, social activity). Whereas a manual roll-taking in class is an accepted method of monitoring student behavior, cellphone tracking crosses a line into creepy. The only way to proceed with such a project would be transparently, in my opinion, which could be done with an opt-in program. In such a program, students would be given a description and opportunity to sign up. In return, they receive information back, probably both the detailed information that is being gathered as well as summary reports on the project. I have been looking for examples of colleges taking this approach. If you know of one, please let me know!

See also: "okcupid is the new facebook? more on the politics of algorithmic manipulation" at scatterplot.com.

Sunday, July 27, 2014

Survey Prospector

(last updated 5/30/2015)

Survey Prospector is a web-based interface for quickly exploring discrete data. It is intended to support a "want-know-do" cycle of intelligent action. It allows you to quickly execute a predictor-finding workflow to try to find potential cause/effect relationships you care about.

  1. Normalize scalar or nominal data into bins or small numbers of categories if necessary. 
  2. Apply any filters of interest (e.g. just males or just females).
  3. Identify a target (dependent) variable and create a binary classification.
  4. List the independent variables in decreasing order of predictive power over the dependent variable, with graphs and suitable statistics automatically generated.
  5. Browse these top predictors to get a sense of what is important, including linear and non-linear relationships between pairs of them.
  6. Visually inspect correlational maps between the most important independent variables.
  7. Create multivariate predictors using combinations of the best individual predictors.
  8. Cross-validate the model by keeping some data back to test the predictor on.
  9. Assign modeled probabilities to cases, e.g. to predict attrition. 

This all happens in real time, so that this can be used in meetings to answer questions if you like. A common malady of IR offices is that it's much easier to ask questions than to answer them. This tool can be used to effectively prioritize research. It lends itself to data warehousing, where you might build a longitudinal history of student data across a spectrum of types. Then it becomes trivial to ask and answer questions on the fly like "what student non-cognitives predict good grades their first semester?" or "what's the effect of work-study on first year attrition?"

Here is the application: Survey Prospector v2-1-15 [Note: if the online app doesn't work, it's because my hourly limit for the month has been reached.]. A video demo can be found here, using this data set. If you need a primer on predictors and ROC curves, try this.

Here's a video tour using a 45M CIRP national data set. You can do something similar with a small sample by downloading these files:
  • CIRP1999sample.csv a random sample of 401 rows taken from the 38,844 in HERI's 1999 CIRP survey data set
  • CIRPvars.csv, an index to the items and responses

Technical details. If you want to try this on your own data, please don't upload anything with student identifiers or other sensitive information. The data should be in this format:
  • Data files need a header row with variable names.
  • Index files do not have a header. They are just a variable name, a comma, and the description without commas. Only one comma per line unless you want to put the description in quotes. Index files are optional, but they help decipher results later on. Download the example CIRPvars.csv listed above to see one.
I'm in the process of creating a real website for this project, but it's not finished yet.


This is the main tab, where the predictors are sorted and displayed in order of importance.

The graphs show the data distribution (with case types in blue or pink), a ROC curve for assessing predictor power, ratios with confidence intervals to assess statistical significance, and a table with case numbers and ratios in order to identify useful thresholds. 

The screen capture above shows a dynamic exploration of the correlations between best predictors (red lines are negative correlations). The sliders at the top allow for coarse- or fine-grain inspection of variables. This allows you to see how predictors cluster visually. In researching attrition, this technique easily allowed identification of major categories of risk evident in our data: academic, financial, social engagement, and psychological.  

Please leave feedback below or email me at deubanks.office@gmail.com.

Monday, July 14, 2014

Finding and Using Predictors of Student Attrition

A while back, I wrote about finding meaning in data, and this has turned into a productive project. In this article I'll describe some findings on causes of student attrition, a conceptual framework of actions to prevent attrition, and an outline of the methods we use.

In order to find predictors of attrition, we need to find information that was gathered before the student left. A different approach is to ask the student why he or she is leaving during the withdrawal process, but I won't talk about that here. We use HERI's The Freshman Survey each fall and get a high response rate (the new students are all in a room together). Combining this with the kinds of information gathered during the admissions process gives several hundred individual pieces of information. These data rows are 'labeled' per student with the binary variables for attrition (first semester, second semester, and so on). In several years of data, and relying on two different liberal arts colleges, we get the same kinds of predictors of attrition:

  • Low social engagement
  • High financial need
  • Poor academics
  • Psychology that leads to attrition: initial intent to transfer, extreme homesickness, and so on.
These can occur in various combinations. A student who is financially stressed and working two jobs off campus may find it hard to keep up her grades. At the AIR Forum (an institutional research conference) this June, we saw a poster from another liberal arts college that identified the same four categories. They used different methods, and we have a meeting set with them to compare notes.
For purposes of forming actions, we assume that these predictive conditions are causal. There's no way to prove that without randomized experiments, which are impossible, and doing nothing is not an option. In order to match up putative causes with actions, we relied on Vincent Tinto's latest book Completing College: Rethinking Institutional Action, taking his categories of action and cross-indexing them with our causes. Then we annotated it with the existing and proposed actions we were considering. The table below shows that conceptual framework for action.
Each letter designates some action. The ones at the bottom are actions related to getting better information. An example of how this approach generates thoughtful action is given next.

Social Engagement may happen through students attending club meetings, having work-study, playing sports, taking a class at the fitness center, and so on. This can be hard to track. This led us to consider adopting a software product that would do two things: (1) help students more easily find social activities to engage with, and (2) help us better track participation. As it turned out, there was a company in town that does exactly this, called Check I'm Here. We had them over for a demo, and then I went to their place to chat with Reuben Pressman, the CEO and founder. I was very impressed with the vision and passion of Reuben and his team. You can click through the link to their web site for a full rundown of features, but here's a quote from Reuben:

The philosophy is based around a continuous process to Manage, Track, Assess, & Engage students & organizations. It flows a lot like the MVP idea behind the book "The Lean Startup" that talks about a process of trying something, seeing how it goes, getting feedback, and making it better, than starting over again. We think of our engagement philosophy the same way:
  • Manage -- Organize and structure your organizations, events, and access to the platform.
  • Track -- Collect data in real-time and verify students live with mobile devices
  • Assess -- Integrate newly collected data and structurally combine it with existing data to give real-time assessment of what works and doesn't and what kinds of students are involved
  • Engage -- Use your new information to make educated decisions and use our tools for web and mobile to attract students in new ways
  • Rinse, and Repeat for more success!
A blog post talking more about our tracking directly is here. We take a focus on Assessing Involvement, Increasing Engagement, Retaining Students, and Successfully Allocating Funding.
Currently, we can get card-swipe counts for our fitness center, because it's controlled for security reasons. An analysis of the data gives some indication (this is not definitive) that there is an effect present for students who use the fitness center more than those who don't. This manifests itself after about a year, for a bonus of three percentage points in retention. The ability to capture student attendance at club events, academic lectures, and so on with an easy portable card-swipe system like Check I'm Here is very attractive. It also helps these things happen--students can check an app on their phones to see what's coming up, and register their interest in participating.


I put this last, because not everyone wants to know about the statistics. At the AIR forum, I gave a talk on this general topic, which was recorded and is available through the organization. I think you might have to pay for access, though.

The problem of finding which variables matter among hundreds of potential ones is sometimes solved with  step-wise linear regression, but in my experience this is problematic. For one thing, it assumes that relationships are linear, when they might well not be. Suppose the students who leave are those with the lowest and the highest grades. That wouldn't show up in a linear model. I suppose you could cross multiply all the variables to get non-linear ones, but now you've got tens of thousands of variables instead of hundreds.

There are more sophisticated methods available now, like lasso, but they aren't attractive for what I want. As far as I can tell, they assume linearity too. Anyway, there's a very simple solution that doesn't assume anything. I began developing the software to quickly implement it two years ago, and you can see an early version here.

I've expanded that software to create a nice work flow that looks like this:
  1. Identify what you care about (e.g. retention, grades) and create a binary variable out of it per student
  2. Accumulate all the other variables we have at our disposal that might be predictors of the one we care about. I put these in a large CSV file that may have 500 columns
  3. Normalize scalar data to (usually) quartiles, and truncate nominal data (e.g. state abbreviations) by keeping only the most frequent ones and calling the rest 'other'. NAs can be included or not.
  4. Look at a map of correlates within these variables, to see if there is structure we'd expect (SAT should correlate with grades, for example)
  5. Run the univariate predictor algorithm against the one we care about, and rank these best to worst. This usually takes less than a minute to set up and run.
  6. Choose a few (1-6 or so) of the best predictors and see how they perform pairwise. This means taking them two at a time to see how much the predictive power improves when both are considered. 
  7. Take the best ones that seem independent and combine them in a linear model if that seems appropriate (the variables need to act like linear relationships). Cross-validate the model by generating it on half the data, testing against the other half, do this 100 times an plot the distribution of predictive power.
Once the data set is compiled, all of this takes no more that four or five minutes. A couple of sample ROC curves and AUC histograms are show below.

This predictor is a four-variable model for a largish liberal arts college I worked with. It predicts student retention based on grades, finances, social engagement, and intent to transfer (as asked by the entering freshman CIRP survey).

At the end of this process, we can have some confidence in what predictors are resilient (not merely accidental), and how they work in combination with each other. The idea for me is not to try to predict individual students, but to understand the causes in such a way that we can treat them systematically. The example with social engagement is one such instance. Ideally, the actions taken are natural ones that make sense and have a good chance of improving the campus. 

Saturday, July 12, 2014

A Cynical Argument for the Liberal Arts: Part Sixteen

Previously: Part Zero ... Part Fifteen

The "Cynical Business Award" goes to ReservationHop, which I read about in CBC News here. Quote:
A San Francisco startup that sells restaurant reservations it has made under assumed names is raising the ire of Silicon Valley critics.  
ReservationHop, as it's called, is a web app with a searchable database of reservations made at "top SF restaurants." The business model is summed up in the website's tagline: "We make reservations at the hottest restaurants in advance so you don't have to."  
Users can buy the reservations, starting at $5 apiece, and assume the fake identity ReservationHop used to book the table. "After payment, we'll give you the name to use when you arrive at the restaurant," the website says. 
 The "coin of the realm" in this case is the trust between diner and restaurant that is engaged when a reservation is placed. This is an informal social contract that assures the diner a table and assures the restaurant a customer. Sometimes these agreements are broken in either direction, but the system is valuable and ubiquitous. The ReservationHop model is to take advantage of the anonymity of phone calls to falsely gain the trust of the restaurant and essentially sell it at $5 a pop. This erodes trust, debasing the aforementioned coin of the realm. Maybe the long term goal of the company is to become the Ticketmaster of restaurant reservations.

One can imagine monetizing all the informal trust systems in society in this way. Here's a business model, free of charge: you know all those commercial parking lots that give you a time stamped ticket when you drive in? It's easy to subvert that with a little forethought. Imagine an app that you use when you are ready to leave. With it, you meet up with someone who is just entering the parking lot and exchange tickets with them. You get to leave by paying for almost no time in the lot. If they do the same, so can they, ad infinitum. Call it ParkingHop.

One can argue that these disruptive innovations lead to improvements. In both the cases above, the debasement of the trust-coin is due to anonymity, which nowadays can be easily fixed. The restaurant can just ask for cell-phone number to verify the customer by instead of a name, for example, and check it by calling the phone. This isn't perfect, but the generally the fixing of personal identity to actions creates more responsible acts. The widely-quoted faking of Amazon.com book reviews, for example, is greatly facilitated by paid-for "sock puppet" reviewers taking on many identities. So anonymity can be a power multiplier, the way money is in politics.  The natural "improvement," if we want to call it that, is better record keeping and personal registration of transactions. This is what the perennial struggle to get an intrusive "cybersecurity" law passed is all about (so kids can't download movies without paying for them), and the NSA's vacuuming up of all the data it can. We move from "trust," to "trust but verify," to "verify."

These are liberal artsy ideas about what it is to be human and what it is to be a society. The humanities are dangerous. How many millions have died because of religion or ideology? I've been wondering lately how we put that tension in the classroom. Imagine a history class with a real trigger warning: Don't take this class if you have a weak disposition. If you aren't genuinely terrified by the end of a class, I haven't done my job.