Sunday, July 27, 2014

Survey Prospector

(last updated 5/30/2015)

Survey Prospector is a web-based interface for quickly exploring discrete data. It is intended to support a "want-know-do" cycle of intelligent action. It allows you to quickly execute a predictor-finding workflow to try to find potential cause/effect relationships you care about.

  1. Normalize scalar or nominal data into bins or small numbers of categories if necessary. 
  2. Apply any filters of interest (e.g. just males or just females).
  3. Identify a target (dependent) variable and create a binary classification.
  4. List the independent variables in decreasing order of predictive power over the dependent variable, with graphs and suitable statistics automatically generated.
  5. Browse these top predictors to get a sense of what is important, including linear and non-linear relationships between pairs of them.
  6. Visually inspect correlational maps between the most important independent variables.
  7. Create multivariate predictors using combinations of the best individual predictors.
  8. Cross-validate the model by keeping some data back to test the predictor on.
  9. Assign modeled probabilities to cases, e.g. to predict attrition. 

This all happens in real time, so that this can be used in meetings to answer questions if you like. A common malady of IR offices is that it's much easier to ask questions than to answer them. This tool can be used to effectively prioritize research. It lends itself to data warehousing, where you might build a longitudinal history of student data across a spectrum of types. Then it becomes trivial to ask and answer questions on the fly like "what student non-cognitives predict good grades their first semester?" or "what's the effect of work-study on first year attrition?"

Here is the application: Survey Prospector v2-1-15 [Note: if the online app doesn't work, it's because my hourly limit for the month has been reached.]. A video demo can be found here, using this data set. If you need a primer on predictors and ROC curves, try this.

Here's a video tour using a 45M CIRP national data set. You can do something similar with a small sample by downloading these files:
  • CIRP1999sample.csv a random sample of 401 rows taken from the 38,844 in HERI's 1999 CIRP survey data set
  • CIRPvars.csv, an index to the items and responses

Technical details. If you want to try this on your own data, please don't upload anything with student identifiers or other sensitive information. The data should be in this format:
  • Data files need a header row with variable names.
  • Index files do not have a header. They are just a variable name, a comma, and the description without commas. Only one comma per line unless you want to put the description in quotes. Index files are optional, but they help decipher results later on. Download the example CIRPvars.csv listed above to see one.
I'm in the process of creating a real website for this project, but it's not finished yet.


This is the main tab, where the predictors are sorted and displayed in order of importance.

The graphs show the data distribution (with case types in blue or pink), a ROC curve for assessing predictor power, ratios with confidence intervals to assess statistical significance, and a table with case numbers and ratios in order to identify useful thresholds. 

The screen capture above shows a dynamic exploration of the correlations between best predictors (red lines are negative correlations). The sliders at the top allow for coarse- or fine-grain inspection of variables. This allows you to see how predictors cluster visually. In researching attrition, this technique easily allowed identification of major categories of risk evident in our data: academic, financial, social engagement, and psychological.  

Please leave feedback below or email me at

No comments:

Post a Comment