Wednesday, November 16, 2011

A Perilous Tail

 A certain kind of intellectual honesty seems to be critical to systems that want to survive. Even without the subtleties I discussed earlier, it's obvious that a system that ignores reality can only survive as long as the environment is completely benign. By coincidence, I came across Daniel Kahneman's Thinking, Fast and Slow, which catalogs a number of ways in which we humans can fool ourselves. One instance of this is particularly relevant to training, management, and education. It occurs in any rating of performance that involves some element of luck.
Dr. Kahneman describes an episode with military training instructors, where he was talking about studies that show positive reinforcement is the key to better learning. This point of view was flatly contradicted by his audience, who claimed the following (my description):
When a cadet does a bad job at something, I yell at him. Usually he does better the next time. If I praise him for doing a good job, his performance almost always declines the next time!
The author describes this as an ah-ha! moment for him. The solution to this paradox is cleverly described in the book. Here's my version.

As we have seen with SAT scores, the predictive validity of even well-researched tests can be poor (65% correct classification in the case of the SAT benchmark). The remaining variance may as well be chalked up to chance unless we have more information to bear. In addition to measurement error, there can be chance involved in the performance itself. That is, many unpredictable environmental variables may come to bear on the outcome. Baseball games, for example, have a large amount of luck injected into the outcome, so that it's only over a large number of games does relative performance actually reveal itself (see Moneyball by Michael Lewis).

When luck is involved, something called regression to the mean happens--exceptional events are usually followed by unexceptional ones. To make this clear, you can do the following experiment, to mimic the drill instructors' experience. You need some dice.

Roll two dice and add. Consider higher sums as a good performance, and lower sums as poor. Feel free to strongly admonish the dice when they roll low numbers like 2 or 3, and lavish praise on them when they roll 11s and 12s. You'll find that the stern words work wonders--the rolls almost always improve afterwards! On the other hand, the praise is counter-productive since 11s and 12s are usually followed by lower rolls.

We can imagine most events happening in the 'fat part' of a bell curve, and we are generally ill-equipped to encounter events far out on the tails, which by definition are very rare. It's not just the paradox described above. Nassim Taleb wrote a whole book about this called The Black Swan. Other, more speculative thinkers, have imagined thus:
If you assembled all the humans who ever have or ever will live into a distribution according to when they were born, the curve would likely look like some kind of hump with tails on both sides. If you chose one human at random, he or she would likely come from the fat part of the curve. Therefore, that's where we likely are at this moment in time--that is, we would expect to be typical rather than exceptional. If this is true, then it puts probabilistic bounds on our expectations for the duration of human civilization. The math varies, depending on your assumptions, but something like a few thousand years would be a reasonable upper bound using this method.
In business, there's an idea called Six Sigma that is supposed to reduce process errors to an infinitesimal fraction (six sigma means six standard deviations from the mean, which for a Normal distribution is an exceedingly small percentage: about 4 errors per million attempts). Yesterday someone suggested to me that we might use Six Sigma in higher education. I laughed, not because there aren't probably useful ideas there (similar to institutional effectiveness), but because the inherent fuzziness of our core business--changing brains--is so fraught with unknowns. I think one sigma is about as good as we're likely to do. Although we're not well prepared to deal with it, we live in the tail of the distribution.

What percentage of "value-added" indices are due to random chance, do you suppose? This statistical method of computing theoretical filling of the learning vessel has been institutionalized to reward or punish teachers and schools. As a mathematician and assessment professional, it's hair-raising to read the pat descriptions from the link above, like:
Q: How does value-added assessment sort out the teachers' contributions from the students' contributions?

A:Because individual students rather than cohorts are traced over time, each student serves as his or her own "baseline" or control, which removes virtually all of the influence of the unvarying characteristics of the student, such as race or socioeconomic factors. 
Test scores are projected for students and then compared to the scores they actually achieve at the end of the school year. Classroom scores that equal or exceed projected values suggest that instruction was highly effective. Conversely, scores that are mostly below projections suggest that the instruction was ineffective.
Taking another page from Dr. Kahneman's book, this is an instance of solving a simple problem that superficially resembles the actual problem, because the original problem is too hard. It's easy to imagine that the distribution is tight and the tails insignificant, that we control and understand all the elements of chance that might contribute to a computed value-added parameter. Unfortunately, the direct link to reality doesn't reveal itself easily, and so there is no immediate feedback that would correct the problem by making it obviously wrong to observers. This is a case where we should be assiduously honest with our reasoning and doubts. Suppose the SAT's accuracy is representative, and the underlying achievement tests classify students correctly no more than 65% of the time. What fraction of the value-added score is simply random?

The general problem is caused by the unknown unknowns that plague complex observations. There is an elegant way out of this mess that doesn't involve advanced math or huge sample sizes. Moreover, it solves the most important problem in higher education. How's that for a cliff-hanger?

No comments:

Post a Comment