Tuesday, August 25, 2009

Zog's Lemma: Assessment and the Amplification of Error

The concept of outcomes assessment is like a Swiss Army spatula: it's used in all kinds of ways. As opposed to the proverbial Russian Army hardware, pictured below (ubiquitous on the Internets):
To some, outcomes assessment is a touchy-feely endeavor of encouraging the practitioners of higher education to do the right thing, close the right loop, bring a glowing smile to the visiting team. All that. I'm comfortable with that.

To others, it's a more serious matter, more scientific in approach, rather like making tick-marks on the door's threshold on a child's birthday to signify evidence of growth. We know that is serious business: with shoes or without, and where exactly is the top of the head? Does hair count or not? It grows too, after all.

The scientific approach requires belief in theory: that assessments are valid to the purposes we employ them for. In reality we usually have no real way to know that based on a solid physical theory. The belief has to be defended by what statistics can be summoned to make a case. (In truth, beliefs don't need to be defended at all; they only need to be believed.) What comprises this validity? I'd like to address that question sideways. The discussion of what constitutes validity is a groove cut deeply in the literature of psychometrics, and I'd rather ask a more important question: of what use is it to believe in the validity of an assessment?

It's easy to make hay from the fact that the definition of validity itself isn't settled, but this is unfair, I think. It's a difficult philosophical nut to crack. For my purposes, predictive validity is the most important aspect of assessment. If an assessment doesn't tell us anything about what's going to happen in the future, I don't see much use in it. The Wiki article on predictive validity contains an interesting observation that is the crux of the value of an assessment:
[T]he utility (that is the benefit obtained by making decisions using the test) provided by a test with a correlation of .35 can be quite substantial.
Utility is a concept from economics that allows for a weighting of outcomes to balance outcomes that would otherwise be numerically indistinguishable. A classic example is the fact that $1000 means more to a poor person than it does to a rich person, even though it won't buy any more for the former. The relative worth to the penniless is more. The point is that even imperfect predictive assessments are useful.

An example of this is the college admissions process. Let's suppose that with the data gathered on the application, the institution can estimate the probability of "success" of student. Success could be retention to second year, or GPA > 2.5, or graduation, or whatever you deem important. The statistics can be generated with a logistic regression, which will yield a model with a certain amount of predictive power. No model is perfect, and even in retrospect (feeding the original data back into the model) it will not correctly classify applicants as "predict success" or "predict failure" with 100% accuracy. There will be some proportion of false positives and false negatives. This can be visualized on a Receiver Operating Characteristic (ROC) curve. You can see a bunch of them on google images. The usefulness of the ROC curve is that it lets you visually explore the decision of where to set a threshold for decision-making. If you set it too high, you get too many false negatives (reject too many qualified candidates). Too low and you admit too many false positives.

For an institution, such a tool is obviously useful to believe in: one can work backwards from the desired number size of the entering class to see where the threshold should be set. You could even estimate the number of false positives. Any power to discriminate between successful and unsuccessful students is better than none. Of course there are other factors, such as ability to pay, that make this more complicated. To keep things simple, I won't consider these distractions further.

One can imagine a utopia springing from this arrangement, where the assessments continually get better and the applicants learn to distinguish themselves by giving signals that the assessments can recognize. The first doesn't seem to be happening, and the second has an unfortunate twist. An analogy from biology serves us well here.

The story of the peacock, according to evolutionary biologists, is that a showy mating display is worth the biological cost of making all those pretty feathers because of the payoff in reproduction. The similarity is that when mates choose each other they have limited information to go on--an assessment we could characterize with a ROC curve if we had all the facts. This produces a distortion that favors any apparent advantage in a potential mate, or in the case of the peacock, creates over time a completely artificial means of assessment. The peacock is saying "look, I'm so healthy I can drag around all this useless plumage and still escape predators."

So, if the value of partial assessments is clear from the institutional vantage, it's a different picture altogether from the applicant's point of view. It's interesting that InsideHigherEd has an article this morning on this very topic. According to the article, in response to increasing competitiveness at "top institutions":
[H]igh school students could respond to the pressure by taking more rigorous courses and studying more -- or they could focus their attentions on gaming the system and trying to impress.
A study by John Bound, Brad Hershbein, and Bridget Terry Long shows that while this perhaps motivates students to take a more rigorous curriculum, it also prompts them to spend more time in test preparation, or in games like trying to engineer more time to take the test. Peacock plumage? It's hard to say without seeing actual success rates. It could be that having the willingness to spend all that extra effort is itself a noncognitive predictor of success (that is, not related to the scores themselves, but to the personality traits of the applicant). As Rich Karlgaard put it in Forbes magazine, a degree from an elite institution is valuable because:
The degree simply puts an official stamp on the fact that the student was intelligent, hardworking and competitive enough to get into Harvard or Yale in the first place.
What is clear is that those with the means to game the system are better off than those who do not. So we see things like test-prep for kindergartners at $450/hr in the upper crust of society, but probably not so much in housing projects. This likely produces more false positives among the select group: it's as simple as money buying better access. [Update: see this InsideHigherEd article for some dramatic numbers to that effect. Year over year, SAT scores increased 8-9 points for $200,000+ families, and 0/1 points for the poorest group.]

The economic demand for false positives has become an industry under No Child Left Behind, and that doesn't seem to be changing. The New York Times has published letters from teachers on this topic. Some quotes:
  • [T]he use of test data for purposes of evaluating and compensating teachers will work against the education of the most vulnerable children. It is a mistake to conceptualize education as a “Race to the Top” (as federal grants to schools are titled) — for children or schools. (Julie Diamond)
  • Linking teacher evaluations to faulty standardized tests ignores the socioeconomic impact on a nation that is both rich and poor. Can a teacher confronting the poverty of some children in Bedford-Stuyvesant be made to compete with a teacher instructing affluent children in Scarsdale?(Maurice R. Berube)
  • My job went from teaching children to teaching test preparation in very little time. Many of our nation’s teachers have left their profession because the focus on testing leaves little room for passion, creativity or intellect. (Darcy Hicks)
  • Standardized tests are, by their nature, predictable. Most administrators and teachers, fearing failure and loss of position and/or bonuses, de-emphasize or delete those parts of the curriculum least likely to be tested. The students sense this and neglect serious studying because they know that they will be prepped for the big exams. (Martin Rudolph)
The economics of false positives is clear: test prep is an industry, teachers and administrator and schools are rated by how many they generate. Of course the object is not to create false positives, that's just the result of so much emphasis on an imperfect assessment.

Less attention is paid to false negatives. Research points to noncognitive traits like grit, planning for the future, and self-assessment as being important to actual success, but these are not directly accounted for in standardized assessments. To be sure, college admissions officers look at extra-curricular activities to try to add value to SAT, GPA, and curriculum, but I think it's safe to say that the cognitive assessments are primary. How badly do we underestimate actual performance?

That question comes up when we evaluate or create a predictive model for applicants. The resulting predicted (first year) GPA can be used for admissions decisions and financial aid awards. In the Noel-Levitz leveraging schema applicants are sorted into "low-ability" to "high-ability" bins for individual attention. The question of false negatives is the same as the question "how accurate is the description 'low-ability', based on the predictors?" Not very good, as it turns out.

Every time I ran the statistics I got the same answer: at the very bottom end of our admit pool--those Presidential and provisional admits who were supposed to have the hardest time--about half of them performed well (I usually use GPA > 2.5 for that distinction). Since we reject students below that line, we should assume that about half of those students just below the cutoff would have done as well too. If this is typical, there are a LOT of false negatives.

Remember, this line of thought applies to all outcomes assessments to one degree or another. Let's take a concrete example. Suppose we want to assess vocabulary knowledge of German language students with an exam. Memorizing a single word (including declensions for nouns and conjugations for verbs) is low-complexity. The total complexity is that for one word times, let's say 10,000 total items, less any compressibility of this data. All told, this is a lot of complexity (measured in bits) for a human. So it's a suitable subject for assessing, and the low complexity per item means that each of them can be assessed with confidence.

Supposing that we do not have the resources to test our learners on all 10,000 items of vocabulary, we'll have to sample randomly and hope that the ratios are representative (or else spend a lot of time checking correlations and such). Maybe our assessment is only on 100 items, chosen at random from the 10,000. If Tatiana actually only knows K of the items, then there is a chance of K/10,000 that she will know an individual word, and theoretically score K/10,000 on average (on any sized test). Given this arrangement, what is the trade-off between false positives and false negatives?

We will assume we can use the normal distribution to estimate these Bernoulli trials. This only requires that K not be too large or too small--in those cases the chance of an error diminishes anyway. We already know the mean is p=K/10,000, and the standard deviation is given by SQRT(p(1-p)). Two standard errors is about 3% for large and small values of p, ranging to about 5% for those in the middle. These would decrease for a number of items larger than 100, and increase for a smaller test.

The conclusion is that if we set our cutoff for passing in a usual spot, say 70%, we should expect at least half of the scores in the range 65-75% to be either false positives or false negatives. For example, if Tatiana actually knows 70% of the vocabulary items, she has a 50% chance of passing the test because the distribution of her scores over all possible tests is (almost) symmetrical and centered on 70%.

The actual number of false positives and negatives depends on where the skill ranges of the test-takers lie in relation to the cutoff value. The more there are close to the cutoff, the more errors there will be. I actually witnessed a multiple-choice placement test being graded one time, and inquired about the cutoff to find that the grade to pass was actually less than the average result expected by chance! This certainly reduces false negatives, but I'm not sure about the overall result.

The vocabulary example is a best case, where the items themselves are of low complexity and (we hope) not subject to a lot of other kinds of error. When the predictive validity is actually quite low (as in the example where we explain a small percentage of the variance in the outcome), the proportion of errors in both directions is far worse. What does this mean, then when we "believe" in a test like the SAT and adopt it as a de facto industry standard? Even together with high school GPA, these can typically explain only about 33% of the variance of first year college grades.

First, it is of benefit to the institution to be able to imperfectly sort students by desirability. It is costly if the institution has to bid for the apparent best applicants with institutional aid, and there is incentive to look at noncognitives and better predictors, but I can't see a lot of progress in that direction. So the false positives get bid up with the rest. Meanwhile, the false negatives--those applicants who would succeed but don't show it on the predictor--get passed over for admission or merit aid.

A tentative conclusion is that a weak predictor gets amplified in a competitive environment. This is probably true in lots of domains, like the evolutionary biology example of the peacock. Probably some economist has his/her name attached to it: Zog's Lemma or something. I'll have to ask around. (I just googled it--apparently there is no Zog's Lemma.)

For the assessment types, there are two lessons. First, there are going to be errors in classification. Second, those errors get amplified as the assessment gains importance. This is an argument against high-stakes assessments, I suppose. I've always gotten good results by keeping assessments free from political pressures like instructor or program review. "Accountability" creates problems as it tries to solve them. Acknowledging that would be a fine thing.

PS, if you're interested in other kinds of amplifiers in science, read this.

UPDATE: see Amplification Amplification

1 comment: