The internet has revolutionized intelligence. I've seen articles about how it's making us dumber, and I don't know if that's true, but it's certainly made me spoiled. In the old days if I had a computer problem I would just use brute trial and error, often giving up before finding a solution. Now, I just assume that someone else has already had the same problem and kindly posted the solution on a message board somewhere. So a few Google searches almost always solves the problem. Not this time.
This problem is a bothersome thing that comes up occasionally, but not often enough that I've taken action on it. It happens when I have a large data set to analyze and I want to see what's related to what. It's easy enough in SPSS to generate a correlation table with everything I want to know, but it's too much information. If there are 100 items on a survey, the correlation matrix is 100x100 = 10,000 cells. Half of them are repeats, but that's still a lot to look at. So I wanted a way to filter out all the results except the ones with a certain significance level.
I poked around at scripting sites for SPSS, but couldn't find what I was looking for. The idea of writing code in a Basic-like language gives me hives too (don't get me wrong--I grew up on AppleSoft Basic, but somehow using it for this sort of thing just seems wrong).
So without further ado, here's the solution I found. I'm sure someone has a more elegant one, but this has the virtue of being simple.
How-to: Finding Significant Correlates
The task: take a set of numerical data (possibly with missing values) with column labels in a comma-separated file and produce a list of what is correlated with what other variables at some given cut-off for the correlation coefficients. Usually we would want to look for ones larger than a certain value.
Note that some names are definable. I was using CIRP data, so I called my data set that. I'll put the names you can define in bold. Everything else is verbatim. The hash # lines denote a comments, which you don't need to enter--it's just to explain what's going on.
Step One
Download R, the free stats package, if you don't have it already. Launch it to get the command prompt and run these commands (cribbed mostly from this site).
# choose a file for input data and name it something
cirp.data=read.csv(file.choose())
# import the columns into R for analysis
attach(cirp.data)
# create a correlation matrix, using pairwise complete observation. other options can be found here
cirp.mat = cor(cirp.data, use ="pairwise.complete.obs")
# output this potentially huge table to a text file. Note that here you use forward slashes even in Windows
write.table(cirp.mat,"c:/cirpcor.txt")
Step Two
Download ActiveState Perl if you don't have it (that's for Windows). Run the following script to filter the table. You can change the file names and the threshold value as you like. [Edit: I had to replace the code below with an image because it wasn't rendering right. You can download the script here.]
Step Three
Go find the output file you just created. It will look like this:
YRSTUDY2 <-> YRSTUDY1 (0.579148687526634)
YRSTUDY3 <-> SATV (0.434618737520563)
YRSTUDY3 <-> SATW (0.491389963307668)
DISAB2 <-> ACTCOMP (-0.513776993632538)
DISAB4 <-> SATV (0.540769639192817)
DISAB4 <-> SATM (0.468981872216475)
DISAB4 <-> DISAB1 (0.493333333333333)
The variable names are linked by the <-> symbol to show a correlation, and the significance level (that is, the coefficient) is show in parenthesis. If you want the p-value, you'll have to do that separately.
Step Four (optional)
Find a nice way to display the results. I am preparing for a board report, and used Prezi to create graphs of connections showing self-reported behaviors, attitudes, and beliefs of a Freshman class. Here's a bit of it. A way to improve this display would be to incorporate the frequency of responses as well as the connections between items, perhaps using font size or color. [Update: see my following post on this topic.]
Subscribe to:
Post Comments (Atom)
-
The student/faculty ratio, which represents on average how many students there are for each faculty member, is a common metric of educationa...
-
(A parable for academic workers and those who direct their activities) by David W. Kammler, Professor Mathematics Department Southern Illino...
-
The annual NACUBO report on tuition discounts was covered in Inside Higher Ed back in April, including a figure showing historical rates. (...
-
Introduction Stephen Jay Gould promoted the idea of non-overlaping magisteria , or ways of knowing the world that can be separated into mutu...
-
In the last article , I showed a numerical example of how to increase the accuracy of a test by splitting it in half and judging the sub-sco...
-
Introduction Within the world of educational assessment, rubrics play a large role in the attempt to turn student learning into numbers. ...
-
I'm scheduled to give a talk on grade statistics on Monday 10/26, reviewing the work in the lead article of JAIE's edition on grades...
-
Inside Higher Ed today has a piece on " The Rise of Edupunk ." I didn't find much new in the article, except that perhaps mai...
-
I just came across a 2007 article by Daniel T. Willingham " Critical Thinking: Why is it so hard to teach? " Critical thinking is ...
-
"How much data do you have?" is an inevitable question for program-level data analysis. For example, assessment reports that attem...
No comments:
Post a Comment