Friday, September 30, 2011

A Recipe for Finding Correlates in Large Data Sets

The internet has revolutionized intelligence. I've seen articles about how it's making us dumber, and I don't know if that's true, but it's certainly made me spoiled. In the old days if I had a computer problem I would just use brute trial and error, often giving up before finding a solution. Now, I just assume that someone else has already had the same problem and kindly posted the solution on a message board somewhere. So a few Google searches almost always solves the problem. Not this time.

This problem is a bothersome thing that comes up occasionally, but not often enough that I've taken action on it. It happens when I have a large data set to analyze and I want to see what's related to what. It's easy enough in SPSS to generate a correlation table with everything I want to know, but it's too much information. If there are 100 items on a survey, the correlation matrix is 100x100 = 10,000 cells. Half of them are repeats, but that's still a lot to look at. So I wanted a way to filter out all the results except the ones with a certain significance level.

I poked around at scripting sites for SPSS, but couldn't find what I was looking for. The idea of writing code in a Basic-like language gives me hives too (don't get me wrong--I grew up on AppleSoft Basic, but somehow using it for this sort of thing just seems wrong).

So without further ado, here's the solution I found. I'm sure someone has a more elegant one, but this has the virtue of being simple.

How-to: Finding Significant Correlates

The task: take a set of numerical data (possibly with missing values) with column labels in a comma-separated file and produce a list of what is correlated with what other variables at some given cut-off for the correlation coefficients. Usually we would want to look for ones larger than a certain value.

Note that some names are definable. I was using CIRP data, so I called my data set that. I'll put the names you can define in bold. Everything else is verbatim. The hash # lines denote a comments, which you don't need to enter--it's just to explain what's going on.

Step One
Download R, the free stats package, if you don't have it already. Launch it to get the command prompt and run these commands (cribbed mostly from this site).

# choose a file for input data and name it something

# import the columns into R for analysis

# create a correlation matrix, using pairwise complete observation. other options can be found here
cirp.mat = cor(, use ="pairwise.complete.obs")

# output this potentially huge table to a text file. Note that here you use forward slashes even in Windows

Step Two
Download ActiveState Perl if you don't have it (that's for Windows). Run the following script to filter the table. You can change the file names and the threshold value as you like. [Edit: I had to replace the code below with an image because it wasn't rendering right. You can download the script here.]

Step Three
Go find the output file you just created. It will look like this:

YRSTUDY2 <-> YRSTUDY1 (0.579148687526634)
YRSTUDY3 <-> SATV (0.434618737520563)
YRSTUDY3 <-> SATW (0.491389963307668)
DISAB2 <-> ACTCOMP (-0.513776993632538)
DISAB4 <-> SATV (0.540769639192817)
DISAB4 <-> SATM (0.468981872216475)
DISAB4 <-> DISAB1 (0.493333333333333)

The variable names are linked by the <-> symbol to show a correlation, and the significance level (that is, the coefficient) is show in parenthesis. If you want the p-value, you'll have to do that separately.

Step Four (optional)
Find a nice way to display the results. I am preparing for a board report, and used Prezi to create graphs of connections showing self-reported behaviors, attitudes, and beliefs of a Freshman class. Here's a bit of it. A way to improve this display would be to incorporate the frequency of responses as well as the connections between items, perhaps using font size or color. [Update: see my following post on this topic.]

No comments:

Post a Comment