## Monday, September 05, 2011

### The Economics of Imperfect Tests

It's fascinating to me how attracted people are to rankings: colleges, sports teams, best cities to live in, most beautiful people, and so on can seemingly be put in an order. Of course, it's ridiculous if you stop to ask if the quality in question could really be as simply as a one-dimensional scalar that can be measured with such precision. But it doesn't stop new lists from being generated. Rank order statistics (e.g. saying that Denver is number one and Charlotte is number two on the list) come with their own sort of confidence intervals, so that we really should be saying "City C is rank R plus or minus E, with probability P." Computing these confidence bounds is not easy, and I've never seen it done on one of these lists.

Leaving aside the issue of bounding error, the generation of the numbers themselves is highly questionable. Often, as with US News rankings of colleges, a bunch of statistics are microwaved together in Frankenstein's covered dish to create the master ranking number. You can read the FAQ on the US News rankings here. It seems that consumers of these reports are in such a hurry, or have such limited attention spans, that we can only consider one comparative index. That is, we can't simultaneously consider graduation rate in one column and net cost in another to make compromise decisions. Rather, all the variables have assigned weights (by experts, of course), and everything is cooked down into a mush.

A more substantive example is the use of SAT scores for making decisions about admissions to college (in combination with other factors). In conversations in higher ed circles, SATs are sometimes used as a proxy for the academic potential of a student. It's inarguable that although there is some slight predictive validity for, say, first year college grades, tests like these aren't very good as absolute indicators of ability. And so it would seem on the surface of it that the tests are over-valued in the market place. I've argued that this is an opportunity for colleges that want to investigate non-cognitive indicators, or other alternative ways of better valuing student potential.

But the question I've entertained for the last week is what is the economic effect of an imperfect test. I imagine some economist has dealt with this somewhere, but here's a simple approach.

How do we make decisions with imperfect information? We can apply the answer to any number of situations, including college admissions or learning outcomes assessment, but let me take a simpler application as an analogue. Suppose we operate in a market where there is a proportion $g$ of good money and counterfeit, or bad, money $1-g := b$. [You need javascript on to see the formulas properly.] We also have a test that can imperfectly distinguish between the two. I've sketched a diagram to show how well the test works below.
A perfect test would avoid two types of errors--false negatives and false positives. These may happen with different rates. Suppose an agent we call the Source goes to the market to exchange coins for goods or services with another agent, the Receiver. Assume that the Receiver only accepts coins that test "good." It's interesting to see what happens.

The fraction of good coins that the Source apparently gives the receiver is: $gt+b(1-f)$
The fraction of good coins actually received by the Receiver is $gt$

So the Source will obtain more goods and services in return than are warranted, an excess of $b(1-f)$. The inefficiency can be expressed as the ratio $\frac{b(1-f)}{gt+b(1-f)}$, which is also the conditional probability Pr[false positive|test = "true"]. (Pr means probability, and the vertical bar is read "given.")

There are two factors in $b(1-f)$: the fraction of bad coins and the false positive rate. So the Sender has an incentive to increase both. Increasing the fraction of bad coins is easy to understand. Increasing $1-f$ means trying to fool the test. Students who take SAT prep tests are manipulating this fraction, for example.

So we have mathematical proof that it's better to give than to receive!
In a real market, it might work the other direction too. The Receiver might try to fool the Sender with a fraction of worthless goods or services in return. In that case, the best test wins, which should lead to an evolutionary advantage to good tests. In fact, in real markets we see that at least for easily-measured quantities like weight.

In many cases, the test only goes one direction. When you buy a house, for example, there's no issue about counterfeiting money since it all comes from the bank anyway. The only issue is whether your assessment of the value of the house is good. The seller (the Source) has economic incentive to fool your tests by hiding defects and exaggerating the value of upgrades, for example. It's interesting to me how poor the tests for real estate value seem to be.

In terms of college admissions or learning outcomes assessment, we don't hear much about testing inefficiency. The effects are readily apparent, however. For example, the idea of "teaching to the test" that crops up in conversations about standardized testing. If teachers receive economic or other benefit from delivering students that score above certain thresholds on a standardized test, then they are the Source, and the school system (or public) is the Receiver. It's somewhat nebulous what the actual quality the tests are testing for because there isn't any discussion that I can find about the inherent $b(1-f)$ inefficiency that must accompany any less-than-perfect test. For teachers, the supply of "currency" (their students) is fixed, and they don't have any incentive to keep back the "good" currency for themselves. It's a little different from the market scenario described above, but we can easily make the switch. The teachers are motivated to increase the number of tested positives, whether these are true or not. They would also find false negatives galling, which they would see as cheating them out of goods they have delivered. As opposed to the exchange market, they want to increase both $gt$ and $b(1-f)$, not just the latter. They are presumed to have the ability to transmute badly-performing students into academic achievers (shifting the ratio to a higher $g$), and they can also try to fool the test by decreasing $f$. It is generally assumed that the ethical solution is to do the former.

A good case study in this instance is the New York City 2009 Regents exam results, as described in this Wall Street Journal article. The charge is made that the teachers manipulated test results to get higher "true" rates, and the evidence given clearly indicates this possibility. The graphs show that students are somehow getting push over the gap from not acceptable to acceptable, which is analogous to receiving a "good" result on the test. One of these is reproduced below.

Quoting from the article:
Mr. Rockoff, who reviewed the Regents data, said, "It looks like teachers are pushing kids over the edge. They are very reluctant to fail a kid who needs just one or two points to pass."
This could be construed as either teachers trying to fix false negatives or trying to create false positives. The judgments come down as if it were the latter. The argument that this effect is due to fiddling with $1-f$ instead of increasing $g$ is bolstered by this:
Mr. Rockoff points to the eighth-grade math scores in New York City for 2009, which aren't graded by the students' own teachers. There is no similar clustering at the break point for passing the test.
I find this very interesting because it is assumed that this is the normal situation--that if there were no $1-f$ "test spoofing" going on, then we should see a smooth curve with no bump. The implication is that teachers don't have a good understanding of how students will test--that is, despite the incentive to increase the skill levels of students so that they will convert from $b$ to $g$ (and have a better chance of testing "good"), they just don't know how to do it.

Consider an analogous situation on an assembly line for bags of grain. Your job is to make sure that each bag has no less than a kilo of grain, and you have a scale at your disposal to test. Your strategy would probably be to just top up the bags so they have a kilo of grain, and then move on to the next one. Mr. Rockoff (and probably most of us) assumes that this is not possible for educators. It's an admission that teaching and testing are not very well connected. Otherwise, we would expect to see teachers "topping off" the educational level of students to pass the test and then moving on to other students, to maximize the $gt$ part of their payoff. This quote shows that the school system administrators don't even believe this is possible:
After the audit, the state said it took a series of actions and plans to conduct annual "spike/cluster analysis of scores to identify schools with suspicious results."
It's ironic that on the one hand, this latent assumption questions the value of the tests themselves, and at the same time the system is built around their use. Other language in the article includes such expressions of certitude as this:
Michelle Costa, a high-school math teacher in New York City, said she often hears from friends who teach at other schools who [bump up scores on] tests, though she doesn't do it. "They are really doing the student a disservice since the student has so obviously not mastered the material," she said.
Missing the mark by a couple of points is equated to "obviously not mastering the material." There is no discussion about the inherent inefficiencies in the test, although there seems to be a review process that allows for some modification of scores (called scrubbing in the article).

This situation is designed for teachers to try to affect $f$ as the most sensible approach. Teaching students how to take standardized tests is one way of doing that. This critique of one of the tests makes fascinating reading.  Here's a short quote:
Do not attempt to write an original essay. You don't have time. Points are awarded and subtracted on the basis of a formula. Write the five-paragraph essay, even though you will never again have a personal or professional occasion to use this format. It requires no comprehension of the text you are citing, and you can feel smart for having wasted no time reflecting on the literature selections.
We don't usually know if the tests are meaningful. If we did, we would know the ratio $g:b$ both before and after the educational process, and we would be able to tell what the efficiency of the test was. This is essentially a question of test validity, which seems to get short shrift. Maybe it seems like a technicality, and maybe the consumers of the tests don't really understand the problem, but it's essential. Imagine buying a test for counterfeit currency without some known standard against which to judge it!

In education, the gold standard is predictive validity: we don't really care whether or not Tatiana can multiply single digit numbers on a test because that's not going to happen in real life. We care about whether she can use multiplication in some real world situation like calculating taxes, and that she be able to actually do that when it's needed, not in some artificial testing environment. If we identified these outcomes, we could ascertain the efficiency of the test. The College Board publishes reports of this nature, relating test scores to first year college grades, for example. From these reports we can see that the efficiency of the test is quite low, and it's a good guess that most academic standardized tests are equally poor.

Yet the default assumption is often that the test is 100% efficient, that is $f=t=1$: we always perfectly distinguish $g$ from $b$.

The perspective from the commercial test makers is enlightening. If the teachers, faced with little way of relating teaching to testing, and no hope of relating that to (probably hypothetical) predicted real outcomes, choose to modify $1-f$ as a strategy, what is the likely motivation of the test makers?

The efficiency of the test is certainly a selling point. Given the usual vagueness about real predictable outcomes, test makers can perhaps sell the idea that their products really are 100% efficient (no false positives or negatives). An informed consumer would have a clear goal as to the real outcomes that are to be predicted, and demand a ROC curve to show how effective the test is. In some situations, like K-12 testing, we have the confused situation of teachers not having a direct relationship to the tests, which have no accountability to predict real outcomes. It's similar to trying to be the "best city to live in," by optimizing the formula that produces the rankings.

Since there is an assumed (but ironically unacknowledged) gap between teaching and testing, even the testing companies have no real incentive to try to improve teaching, improving the $g:b$ ratio. It's far easier for them to sell the means to fool their own tests. By teaching students how to optimize their time, and deal with complexities of the test itself, which very likely has nothing to do with either the material or any real predictable outcomes, test makers can sell to schools the means to increase the number of reported positives. They are selling ways to raise $t$ and $f$ without affecting $g$ at all. Of course, this is an economic advantage and doesn't come for free. Quoting again from the critique of one test:

[S]everal private, for-profit companies have already developed Regents-specific test-preparation courses for students who can afford their fees [...]

It's as if instead of trying to distinguish counterfeit coins from good ones, we all engage in trying to fool the test that imperfectly distinguishes between the two. That way we can pretend that there's more good money in circulation than there actually is.