Friday, June 08, 2012

Bad Reliability, Part Two

In the last article, I showed a numerical example of how to increase the accuracy of a test by splitting it in half and judging the sub-scores in combination. I'm sure there's a general theorem that can be derived from that, but haven't looked for it yet. I find it strange that in my whole career in education, I've never heard of anyone doing this in practice.

I first came across the idea that there is a tension between validity and reliability in Assessment Essentials by Paloma and Banta, page 89:

An […] issue related to the reliability of performance-based assessment deals with the trade-off between reliability and validity. As the performance task increases in complexity and authenticity, which serves to increase validity, the lack of standardization serves to decrease reliability.

So the idea is the reliability is generally good up to the point where it interferes with validity. To analyze that more closely, we have to ask what we mean by validity.

Validity is often misunderstood to be an intrinsic property of a test or other assessment method. But 'validity' just means 'truth', and it really refers to statements made about the results from using the instrument in question. We might want to say "the test results show that students are proficient writers," for example. The problem with this is that we probably don't have a way (independent of the test) to see if the statement is true or not, so we can't actually check the validity. There are all sorts of hacks to try to get around this, like  comparisons and statistical deconstructions, but I don't find any of these particularly convincing. See the Wikipedia page for more. Here's a nice image from that site that shows the usual conception of validity and reliability.

In the picture, validity is the same as accuracy, for example when shooting a rifle. Reliability would be precision.

This is an over-simplified picture of the situation that usually applies to educational testing, though.The conclusion of the previous article was that the situation depicted top right is better (at least in some circumstances) than the one bottom left, because we can discover useful information by repeated testing. There's no point in repeating a perfectly reliable test.

But there's another kind of 'bad reliability'.

Questions like "Can Tatiana write effectively?" can only be seriously considered if we establish what it would mean for that to be actually true. And because in education, we're supposed to be preparing students for life outside the ivy-covered walls, the answer has to be a real-world answer. It can't simply be correlations and factor analysis results from purely internal data.

The tension between trying to control for variation in order to understand 'ability' and the applicability of the results to situations where such controls are not present is nicely illuminated on the wiki page in this statement:
To get an experimental design you have to control for all interfering variables. That's why you often conduct your experiment in a laboratory setting. While gaining internal validity (excluding interfering variables by keeping them constant) you lose ecological or external validity because you establish an artificial lab setting. On the other hand with observational research you can't control for interfering variables (low internal validity) but you can measure in the natural (ecological) environment, at the place where behavior normally occurs. However, in doing so, you sacrifice internal validity.
Educational testing doesn't seem to concern itself overly much with external (ecological) validity, which is ironic given that the whole purpose of education is external performance. There are some really nice studies like the Berkeley Law study "Identification, Development, and Validation of Predictors for Successful Lawyering" by Shultz and Zedeck (2008), which found little external validity for either grades or standardized test results. It amazes me that our whole K-12 system has been turned into a standardized test apparatus without this sort of external validity check.

All tests are wrong all the time. It's only a matter of degree. To me, the only sensible sorts of questions that can be checked for validity are ones like this: "The probability of X happening is bounded by Y." This involves a minimum amount of theoretical construction in order to talk about probabilities, but avoids the reification fallacy of statements like "Tatiana has good writing ability."

Now the latter sort of statement can avoid a reification fallacy by converting it into a probabilistic assertion: "When others read Tatiana's writing, X proportion will rate it as good." This is now a statement about external reality that can be checked, and with a few assumptions, the results extrapolated into a probabilistic statement, which can itself be validated over time with more checking.

Imagine a perfectly reliable test. Given the same inputs it always produces the same outputs, meaning that all variation of any kind has been removed. Is this a good thing?

If we're measuring time or distance or energy or velocity, then perfect reliability is a great thing, and scientists spend a lot of energy perfecting their instruments to this end. But those are singular physical dimensions that correspond to external reality amazingly (some say unreasonably) well. The point is that, until you get to the quantum level where things get strange, physical quantities can be considered to have no intrinsic variation. Another way to say it is that all variation has to be accounted for in the model. If energy input doesn't equal energy output, then the difference had to go somewhere as heat or something.

It took a long time and a lot of work to come upon exactly the right way to model these dimensions and measure them. You can't just start with any old measurement and imagine that it can be done with perfect reliability:
Just because you provide the same input conditions gives you no right to expect the same outputs. These sorts of relationships in the real world are privileged and hard to find.
Unlike electrons, humans are intrinsically variable. So in order to squeeze out all the variability in human performance, we have to imagine idealizing them, and that automatically sacrifices reality for a convenient theory (like economists assuming that humans are perfectly rational).

When my daughter went off to take her (standardized) history test Monday, here are some of the factors that probably influenced her score: the review pages she looked at last, how much time she spent on each of them, the ones she chose to have me go over with her, how much sleep she got, the quality and quantity of the eggs I scrambled for her breakfast, the social issues and other stresses on her mind (like moving to a new city over the summer), excitement about her study-abroad trip next week, what stuck in her mind from classroom instruction, the quality of the assignments, her time invested in completing said assignments, and the unknowable effects of combinations of all these factors in her short and long-term memory. Add to this the total experience of having lived for almost fifteen years, the books she's read and culture she's been exposed to, the choice of vocabulary we use, etc., and we get a really complex picture, not much like a black box at all.

All of these factors are real, and affect performance on tests and whatever it is that tests are supposed to predict. One would think that educators ought not to just be interested in snapshots of idealized 'ability' but also the variability that comes with it.

Squeezing Reliability

Given that variability is intrinsic to most everything we look at, we can't just say that variability represents a threat to validity. Casinos make their profits based on controlled variation in games that give perfectly assessed unequivocal measurements. No one is going to claim that a dice roll is invalid because of the variability involved.

There is a real zeal in educational assessment to squeeze out all forms of variability in results, however, and the fact that variability is an important part of performance easily gets lost. I wrote about about this in "The Philosophical Importance of Beans" a couple of years ago, after a disagreement on the ASSESS email list about whether or not there was a uniquely true assessment of tastiness of green beans. The conversation there was similar to ones that go on about the use of rubrics, where the idea is that all raters are ideally supposed to agree on their ratings of commonly assessed work. I recommend the book Making the Grades: My Misadventures in the Standardized Testing Industry by Todd Farley, for a detailed look at what it means to force such agreement.

What sort of rubric and rating system would be required to ensure perfectly reliable ratings on such things as "how good a movie is" or "whether P is a good public servant" or "the aesthetic value of this piece of art"?

All of these examples have intrinsic variability that cannot be removed without cost to validity. It's just a fact that people disagree about art and culture and performance of practically every kind. And this includes thinking and communications skills, which are commonly assessed in higher education.

We shouldn't assume that a test can be made more reliable without sacrificing validity. The phenomena where one can get nearly zero reliability and retain all meaning in some sort of predictive model (like physics) are a very special sort of knowledge that was hard won. It's a property of the universe that we don't get to dictate.

Rather than trying to find these special relationships (which frankly may not exist in learning assessment), researchers seem to assume that we are entitled to just assume them. In this happy view of the world, variability is just a nuisance that impedes the Platonic perfection of 'true ability.'

There are practical implications to the idea of "valid variability," as the previous post demonstrated. More on that next time.

No comments:

Post a Comment