Thursday, April 30, 2009

Part Eight: If Testing isn't Measurement, What Is It?

Why Assessment is Hard: [Part 1] [Part 2] [Part 3] [Part 4] [Part 5] [Part 6] [Part 7]

Last time I argued that although we use the word "measurement" for educational outcomes in the same way it's used for weighing a bag of coconuts, it really doesn't mean the same thing. It is a kind of deception to substitute one meaning for another without warning the audience. Of course, this happens all the time in advertising, where "no fat" doesn't really mean no fat (such terms are, however, now defined by the FDA). In education, this verbal blurring of meaning has gotten us into trouble.

Maybe it's simply wishful thinking to imagine that we could have the kind of precise identification of progress in a learner that would correspond to the graduations on a measuring cup. Socrates' simile that education is kindling a flame, not filling a cup is apt--learning is primarily qualitative (the rearrangement of neurons and subtle changes in brain chemistry perhaps) and not quantitative (pouring more learning stuff into the brain bucket). As another comparison, the strength of a chess position during a game is somewhat related to the overall number of pieces a player has, but far more important is the arrangement of those pieces.

The subject of quality versus quantity with regard to measurement deserves a whole discussion by itself, with the key question being how does one impose an order on a combinatorial set. I'll have to pass that today and come back to it another time.

The sleight of hand that allows us to get away with using "measurement" out of context is probably due to the fluidity with which language works. I like to juxtapose two quotes that illustrate the difference between the language of measurement and normal language.
We say that a sentence is factually significant to any given person, if and only if, [she or] he knows how to verify the proposition which it purports to express—that is, if [she or] he knows what observations would lead [her or him], under certain conditions, to accept the proposition as being true, or reject it as being false. – A. J. Ayer, Language, Truth, and Logic

[T]he meaning of a word is its usage in the language. – L. Wittgenstein
The first quote is a tenet of positivism, which has a scientific outlook. The second is more down-to-earth, corresponding to the way words are used in non-technical settings. I make a big deal out of this distinction in Assessing the Elephant about what I call monological and dialogical definitions. I also wrote a blog post about it here.

Words like "force" can have meanings in both domains. Over time, some common meanings get taken over by more scientific versions. What a "second" means becomes more precise every time physicists invent a more accurate clock. The word "measurement" by now has a meaning that's pretty soundly grounded in the positivist camp. That is, if someone says they measured how much oil is dripping from the bottom of the car, this generates certain expectations--a number and a unit, for example. There is an implied link to the physical universe.

But as we saw last time, the use of "measurement" in learning outcomes doesn't mean that. What exactly are we doing, though, when we assign a number to the results of some evidence of learning? It could be a test or portfolio rating, or whatever. If it's not measurement, what is it?

We can abstract our assessment procedures into some kind of statistical goo if we imagine that the test subject has some intrinsic ability to successfully complete the task at hand, but that this ability is perhaps occluded by noise or error of various sorts. Under the right probabilistic assumptions, we can then imagine that we are estimating this parameter--this ability to ace our assessment task. Typically this assessment will itself be a statistical melange of tasks that have different qualities. A spelling test, for example, could have a staggering variety of words on it in English. If there are a million words in the language, then the number of ten-item spelling tests is about
10,00,000,000,000,000,000,000,000,000,000
,000,000,000,000,000,000,000,000,000,000.
So the learning outcomes question "can Stanislav spell?" depends heavily on what test we give him, if that's how we are assessing his ability. Perhaps his "true" ability (the parameter mentioned above) is the average score over all possible tests. Obviously that is somewhat impractical, since his pencil would have to move faster than the speed of light to finish within a lifetime. And this is just a simple spelling test. What are the qualitative possibilities for something complex like "effective writing" or "critical thinking?"

When we assess, we dip a little statistical ruler into a vast ocean of heaving possibilities, changing constantly as our subject's brain adapts to its environment. Even if we could find the "true" parameter we seek, it would be different tomorrow.

All of this is to say that we should be modest about what we suppose we've learned through our assessments. We are severely limited in the number of qualities (such as combinations of testable items) that we can assess. If we do our job really well, we might have a statistically sound snapshot of one moment in time: a probabilistic estimate of our subject's ability to perform on a general kind of assessment.

If we stick to that approach--a modest probabilistic one--we can claim to be in positivist territory. But the results should be reported as such, in appropriately technical language. What actually happens is that a leap is made over the divide between Ayer and Wittgenstein, and we hear things like "The seniors were measured at 3.4 on critical thinking, whereas the freshmen were at 3.1, so let's break out the bubbly." In reality, the numbers are some kind of statistical parameter estimation of unknown quality, that may or may not have anything to do with what people on the street would call critical thinking.

Note that I've only attempted to describe assessment as measurement in this installment. There are plenty of types of assessments that do not claim to be measurement, and don't have to live up to the inherent unrealistic expectations. But there are plenty of outcomes assessments that do claim to be measurements, and they get used in policy as if they really were positivist-style tick marks on a competency ruler. Administrators at the highest levels probably do not have the patience to work through for themselves the limits of testing, and may take marketing of "education measurement" at face value.

In summary, "measurement" belongs in positivist territory, and most educational outcomes assessments don't live up to that definition. Exacerbating this situation is that "critical thinking" and "effective writing" don't live in the positivist land--they are common expressions with meanings understood by the population at large (with a large degree of fuzziness). Co-opting those words borrows from the Wittgenstein world for basic meaning, and then assigns supposedly precise (Ayer) measurements. This is a rich topic, and I've glossed over some of the complexities. My answer to the question in the title is this: educational assessment is a statistical parameter estimation, but how that parameter corresponds to the physical world is uncertain, and should be interpreted with great caution, especially when using it to make predictions about general abilities.

1 comment:

  1. I am not sure about the future of assessment as we know it now. For example in the UK project based learning is gradually replacing traditional modes of assessment in higher education as well as further education.

    ReplyDelete