Wednesday, June 27, 2012

Retrieving Named Arrays in Statistics::R

This is mostly a note to myself, but anyone else who uses Statistics::R from cpan to link Perl scripts to the statistics engine R can benefit. I find myself needing to retrieve named arrays from R for use in programs, and there's no easy way to do that with the built-in functions. So here's the code:

sub get_hash { # custom function that returns a vector in 
               # a hash indexed by variable names 
   my ($self, $varname) = @_;
   my $values_str = $self->run(qq{cat($varname)});
   my $keys_str = $self->run(qq{cat(names($varname))});
   my @values = split(/ /,$values_str);
   my @keys = split(/ /,$keys_str);
   my %hash;
   my $v;
   my $k;
   while(@keys) {
    $k = pop(@keys);
    $v = pop(@values);
    $hash{$k} = $v;
   }
   return \%hash;
}

Here's an example, using a matrix:
#take the number of rows in the matrix minus the column sums of blanks
$R->send(qq{n2= nrow(cols2.mat) - colSums(is.na(cols2.mat))});

#now we have a vector (with variable names) that has the number of non-blanks

#retrieve it from R into perl 
$n2 = $R->get_hash('n2');
 
In the example, the return value is a reference to a hash. You get the values out with something like:
$val_for_variable = ${$n2}{$name_of_variable};

Tuesday, June 12, 2012

Ed Tech Blogs

This is just a pointer to a nice list of K-12 Education Technology blogs, courtesy of EdTech. [link]


Browsing through the links, I found out about The Leap. Watch the video:



And also an idea apparently from Apple called Challenge Learning [pdf]. Excerpt:


Challenge Based Learning provides:
  • A flexible framework with multiple entry points
  • A scalable model with no proprietary systems or subscriptions
  • A focus on global challenges with local solutions
  • An authentic connection between academic disciplines and real world experience
  • A framework and workflow to develop 21st century skills
  • The purposeful use of technology for researching, analyzing, organizing, 
  • collaborating, communicating, publishing and reflecting.
  • The opportunity for learners to do something important now, rather than waiting until they are finished with their schooling
  • The documentation and assessment of the learning experience from challenge to solution
  • An environment for deep reflection on teaching and learning
  • A process that places students in charge of their learning
This sounds a lot like where I have been headed. The only thing missing is the connection to a professional portfolio.

Saturday, June 09, 2012

Outsourcing Prediction Models

Higher education can benefit from many sorts of predictors. One of the most common is modeling attrition: trying to figure out how to identify students who are going to leave before they actually do. Because you might be able to do something about it if you understand the problem, right?

Over the years, I've been underwhelmed by consultants who cater to higher education, and it seems to me that we get second-rate services.  I have seen only a few consultants that are worth the money, and I'll just mention one: SEM Works out of Greenville, NC. Jim Black seems to know everything there is about admissions, and I've seen direct benefits from his expertise.

There are various attractions to going outside for expertise. Maybe you want to make cuts, and need a hatchetman. Or maybe the decision makers are convinced that a "secret ingredient" is required, and the consulting firm has enough mystique to sell it. I don't believe much in secret ingredients. 

   

What's particularly galling is that the advanced methods of building predictive models come from universities. Then they get used at Amazon.com and Netflix and what have you, while our administrations still chug along on 50 year old technology. So we go out and buy consultants. Who also use 50 year old technology.

You may not be suspecting this twist, but I actually came across a solution for this problem. That's the whole reason I'm writing this post. I'll just let free market talk now, from kaggle.com.


So instead of contracting up front for who-knows what sort of analysis, you can actually bid out the work and get a solution before you pay! Of course, you have to think hard about what data to submit, and spend some time cleaning it and prepping for such a submission. But judging from the rates on the site, it would be a lot cheaper, with an almost guaranteed better result, than buying a big name consultant. I'll leave you with this quote from their About page:
The motivation behind Kaggle is simple: most organizations don't have access to the advanced machine learning and statistical techniques that would allow them to extract maximum value from their data. Meanwhile, data scientists crave real-world data to develop and refine their techniques. Kaggle corrects this mismatch by offering companies a cost-effective way to harness the 'cognitive surplus' of the world's best data scientists.
Update: Also see "Not so Expert" in The Economist.

Grade Transition Maps

There was a question on the ASSESS email list about evaluating grade distributions for courses. It's an interesting topic, and I dug out a data set and tried to construct some interesting non-parametric descriptions of it. Just looking at two years of fall and spring semester college grades, I first wrote a script to map transitions of grades that came in the same or earlier semesters with those that came later. You can download the code here (it's a perl script, but I had to save it as txt because of some weird Apache thing on the server).

The table below shows the fraction of grade transitions from the one on the top to the one down the side. So  of students who got a C one semester (C column) nine percent of them got a D the same or subsequent semester. The columns sum to 100% because only A-F grades are considered here. I could have included Ws, but I didn't.


Note that in this data set, there's no such thing as a "D student." Those who get Ds don't tend to repeat the experience nearly as often as the other grades. Instead they tend to get As and Bs more often than Ds and Fs.

The first table was for all grades and all courses. The next one is just for computer science courses, and rather than showing the actual percentages as I did above, I took the difference with the first table. So positive numbers mean that the transition is higher in computer science (colored green). Red and negative means a lower rate. You can see that B students are less likely to get As in computer science, for example.


The next three tables show the same thing for the disciplines shown.




Psychology gives a LOT of Ds--almost double the 'background' rate for A, B, and C prior grades. These are likely displaced Cs, because they give fewer of those. 

There's a nifty math trick we can play to do an average comparison. Imagine that a student took four classes in a row in the given discipline. The expected matrix of grade transitions would be the one I generated from the data, but raised to the fourth power. I ended up just doing this in Excel, but along the way I found a cool site for doing matrix math here. It will calculate eigenvalues and whatnot for you.

What happens is that the grade distributions converge and you can see the 'asymptotic' effect of the discipline grading scheme emerge, which is like a geometric average. This causes the matrix columns to converge to the same vector. I've taken those and differenced them against the overall transition matrix, and then put these "averages" side by side below in order to compare disciplines.


All of these disciplines push students away from the higher grades to the lower ones, with computer science accumulating Cs, and English and psychology accumulating Ds and Fs. 

Because you don't have the overall matrix to compare, here's a relative percent change over the 'background' rate of transition. The numbers below are no longer straight percentages of grade frequency (as in the table above). They are the percent change over the overall average rate for each cell. The most dramatic one is the psychology course's 68% increase in the production of D's over the 'background' transition rate. 


So if these subjects are all lowering grades, which ones are raising them? It wasn't hard to find one, but I won't name it. Someone might object. The table below shows the transitions on the right and the asymptotic behavior on the left. Everything is compared to the overall background rate.


You can see this discipline pushes grades from the bottom to the top, transitioning 4% more often to Bs over time, which is a 14% increase over the background rate. It's similarly disproportionate with the rate of Cs and Ds, giving 16% and 18% fewer than the background rate.

One could refine this type of analysis to look at majors courses only, or even individual instructors, although large sample sizes are obviously needed.

Beautiful Data Abstraction

I came across "Up and Down the Ladder of Abstraction" today on the Dataisbeautiful subReddit. I'll include one graphic to entice you to read it, but it speaks for itself. It's brilliant.


This same process could be applied to analyzing enrollment management, for example. Wheels are spinning...

Friday, June 08, 2012

Bad Reliability, Part Two

In the last article, I showed a numerical example of how to increase the accuracy of a test by splitting it in half and judging the sub-scores in combination. I'm sure there's a general theorem that can be derived from that, but haven't looked for it yet. I find it strange that in my whole career in education, I've never heard of anyone doing this in practice.

I first came across the idea that there is a tension between validity and reliability in Assessment Essentials by Paloma and Banta, page 89:

An […] issue related to the reliability of performance-based assessment deals with the trade-off between reliability and validity. As the performance task increases in complexity and authenticity, which serves to increase validity, the lack of standardization serves to decrease reliability.

So the idea is the reliability is generally good up to the point where it interferes with validity. To analyze that more closely, we have to ask what we mean by validity.

Validity is often misunderstood to be an intrinsic property of a test or other assessment method. But 'validity' just means 'truth', and it really refers to statements made about the results from using the instrument in question. We might want to say "the test results show that students are proficient writers," for example. The problem with this is that we probably don't have a way (independent of the test) to see if the statement is true or not, so we can't actually check the validity. There are all sorts of hacks to try to get around this, like  comparisons and statistical deconstructions, but I don't find any of these particularly convincing. See the Wikipedia page for more. Here's a nice image from that site that shows the usual conception of validity and reliability.

In the picture, validity is the same as accuracy, for example when shooting a rifle. Reliability would be precision.

This is an over-simplified picture of the situation that usually applies to educational testing, though.The conclusion of the previous article was that the situation depicted top right is better (at least in some circumstances) than the one bottom left, because we can discover useful information by repeated testing. There's no point in repeating a perfectly reliable test.

But there's another kind of 'bad reliability'.

Questions like "Can Tatiana write effectively?" can only be seriously considered if we establish what it would mean for that to be actually true. And because in education, we're supposed to be preparing students for life outside the ivy-covered walls, the answer has to be a real-world answer. It can't simply be correlations and factor analysis results from purely internal data.

The tension between trying to control for variation in order to understand 'ability' and the applicability of the results to situations where such controls are not present is nicely illuminated on the wiki page in this statement:
To get an experimental design you have to control for all interfering variables. That's why you often conduct your experiment in a laboratory setting. While gaining internal validity (excluding interfering variables by keeping them constant) you lose ecological or external validity because you establish an artificial lab setting. On the other hand with observational research you can't control for interfering variables (low internal validity) but you can measure in the natural (ecological) environment, at the place where behavior normally occurs. However, in doing so, you sacrifice internal validity.
Educational testing doesn't seem to concern itself overly much with external (ecological) validity, which is ironic given that the whole purpose of education is external performance. There are some really nice studies like the Berkeley Law study "Identification, Development, and Validation of Predictors for Successful Lawyering" by Shultz and Zedeck (2008), which found little external validity for either grades or standardized test results. It amazes me that our whole K-12 system has been turned into a standardized test apparatus without this sort of external validity check.

All tests are wrong all the time. It's only a matter of degree. To me, the only sensible sorts of questions that can be checked for validity are ones like this: "The probability of X happening is bounded by Y." This involves a minimum amount of theoretical construction in order to talk about probabilities, but avoids the reification fallacy of statements like "Tatiana has good writing ability."

Now the latter sort of statement can avoid a reification fallacy by converting it into a probabilistic assertion: "When others read Tatiana's writing, X proportion will rate it as good." This is now a statement about external reality that can be checked, and with a few assumptions, the results extrapolated into a probabilistic statement, which can itself be validated over time with more checking.

Imagine a perfectly reliable test. Given the same inputs it always produces the same outputs, meaning that all variation of any kind has been removed. Is this a good thing?

If we're measuring time or distance or energy or velocity, then perfect reliability is a great thing, and scientists spend a lot of energy perfecting their instruments to this end. But those are singular physical dimensions that correspond to external reality amazingly (some say unreasonably) well. The point is that, until you get to the quantum level where things get strange, physical quantities can be considered to have no intrinsic variation. Another way to say it is that all variation has to be accounted for in the model. If energy input doesn't equal energy output, then the difference had to go somewhere as heat or something.

It took a long time and a lot of work to come upon exactly the right way to model these dimensions and measure them. You can't just start with any old measurement and imagine that it can be done with perfect reliability:
Just because you provide the same input conditions gives you no right to expect the same outputs. These sorts of relationships in the real world are privileged and hard to find.
Unlike electrons, humans are intrinsically variable. So in order to squeeze out all the variability in human performance, we have to imagine idealizing them, and that automatically sacrifices reality for a convenient theory (like economists assuming that humans are perfectly rational).

When my daughter went off to take her (standardized) history test Monday, here are some of the factors that probably influenced her score: the review pages she looked at last, how much time she spent on each of them, the ones she chose to have me go over with her, how much sleep she got, the quality and quantity of the eggs I scrambled for her breakfast, the social issues and other stresses on her mind (like moving to a new city over the summer), excitement about her study-abroad trip next week, what stuck in her mind from classroom instruction, the quality of the assignments, her time invested in completing said assignments, and the unknowable effects of combinations of all these factors in her short and long-term memory. Add to this the total experience of having lived for almost fifteen years, the books she's read and culture she's been exposed to, the choice of vocabulary we use, etc., and we get a really complex picture, not much like a black box at all.

All of these factors are real, and affect performance on tests and whatever it is that tests are supposed to predict. One would think that educators ought not to just be interested in snapshots of idealized 'ability' but also the variability that comes with it.

Squeezing Reliability

Given that variability is intrinsic to most everything we look at, we can't just say that variability represents a threat to validity. Casinos make their profits based on controlled variation in games that give perfectly assessed unequivocal measurements. No one is going to claim that a dice roll is invalid because of the variability involved.

There is a real zeal in educational assessment to squeeze out all forms of variability in results, however, and the fact that variability is an important part of performance easily gets lost. I wrote about about this in "The Philosophical Importance of Beans" a couple of years ago, after a disagreement on the ASSESS email list about whether or not there was a uniquely true assessment of tastiness of green beans. The conversation there was similar to ones that go on about the use of rubrics, where the idea is that all raters are ideally supposed to agree on their ratings of commonly assessed work. I recommend the book Making the Grades: My Misadventures in the Standardized Testing Industry by Todd Farley, for a detailed look at what it means to force such agreement.

What sort of rubric and rating system would be required to ensure perfectly reliable ratings on such things as "how good a movie is" or "whether P is a good public servant" or "the aesthetic value of this piece of art"?

All of these examples have intrinsic variability that cannot be removed without cost to validity. It's just a fact that people disagree about art and culture and performance of practically every kind. And this includes thinking and communications skills, which are commonly assessed in higher education.

We shouldn't assume that a test can be made more reliable without sacrificing validity. The phenomena where one can get nearly zero reliability and retain all meaning in some sort of predictive model (like physics) are a very special sort of knowledge that was hard won. It's a property of the universe that we don't get to dictate.

Rather than trying to find these special relationships (which frankly may not exist in learning assessment), researchers seem to assume that we are entitled to just assume them. In this happy view of the world, variability is just a nuisance that impedes the Platonic perfection of 'true ability.'

There are practical implications to the idea of "valid variability," as the previous post demonstrated. More on that next time.

Wednesday, June 06, 2012

Bad Reliability

You usually hear that validity implies reliability, but sometimes reliability is bad.  Here's why.

Suppose a medical test exists for a certain disease, but it will only return a true positive result if there is a sufficient amount of antibody in the blood. Because the amount of the antibody changes over time, the test may return false negatives, saying that the subject does not have the disease when in fact they do. So the test does not reliably give the same answer every time. But this is actually an advantage because there is usefulness to repeating it. It would be bad if it gave you the same wrong answer reliably.

We can create a similar situation in educational testing. Suppose we want to know if students have mastered foreign language vocabulary, and have a list of 5,000 common words we want them to know. Suppose we give a test on 100 items. It's possible (but not likely) that a student could know 98% of the material and get a zero on the test.

If a student knows a fraction p of the 5000 word list, and the test items are randomly chosen, then we can compute the probability of overlap. Each item has an independent probability p of being chosen, and assuming there are no other kinds of errors, we can find a confidence interval around p using p^, the number the student gets correct on the test:
Courtesy of Wikipedia
For a 95% confidence interval and n = 100, we have the error bounds shown below.


If a student knows half the list, 5% of the time, their score will be wrong by more than 10%. If we are testing a lot of students, this is not great news. We can increase the size of the test, of course. But what if we just gave a retest to any student who fails with a 60% or less?

If it's the same test, with perfect reliability, the score shouldn't change. So that doesn't help. What if instead we gave the students a different test of randomly selected items? This reduces test reliability because it's, well, a different test. But it increase overall chances of eliminating false negatives, which is one measure of validity. So less reliability, more validity.

Here's a cleverer idea. What if we thought of the single 100 item test as two 50 item tests? Assume that we want to pass only students who know 60% or more of the words, and we accept a pass from either half of the test. How does that affect the accuracy of the results? I had to make some additional assumptions, in particular that the tested population had a mean ability of p=.7, with s=.1, distributed normally.

The results can be displayed visually on a ROC curve, shown below. If you want to download the spreadsheet and play with it, I put it here.

The single test method has a peak accuracy of 92% under these conditions, with an ideal cutoff score of 55%. This is a false positive rate of 34% and a true positive rate of 97%. The dual half test method has an accuracy of 93% with the cutoff score at 60%, which is more natural (since we want students to have mastered 60% of the material). This is a false positive rate of 18% and a true positive rate of 95%. The pass rate for the single test is 87% versus 83% for the dual test with the higher standard. The actual percentage who should pass under the stated conditions is 84%.

All in all, the dual-test method seems superior to the single test method. Given how many long tests students take, why aren't they generally scored this way? I'm certain someone has thought of this before, but for as long as I've been in education, I've never heard of split scoring.