Tuesday, November 22, 2011

Tests and Dialogues

In "The End of Preparation" I argued that standardized tests, as they exist now, are not very suited to the task of correctly classifying quality of the partial products we call students. Certainly the tests give us information beyond mere guessing, but the accuracy (judging from the SAT) is not high enough to support a factory-like production model. I pointed out that test makers do not usually even attempt to ascertain what the accuracy rate is. Instead we get validity reports that use a variety of associations. If we brought that idea back to the factory line, it would look something like this.
Announcing the new Auto Wiper Assessment (AWA). It is designed to test the ability of an auto to wipe water off the windshield. Its validity has been determined by high negative correlations with crash rates of autos on rainy days and low correlation on sunny days. 
On a real assembly line, the question would be as simple as Does it work now? and Are the parts reliable enough to keep it working?  Both of these can be tested with high precision. And of course, we can throw water on the windshield to directly observe whether the apparatus functions as intended. Direct observation of the functional structure of learning is not possible without brain scanners. Even then, we wouldn't really know what we are looking at--the science isn't there yet. What we do know is fascinating, like the London Taxi Cab study, but we're a long way from understanding brains the way we understand windshield wipers.

Validity becomes a chicken-and-egg problem. Suppose our actual outcome is "critical thinking and complex reasoning," to pick one from Academically Adrift. There are tests that supposedly tell us how capable students are at this, but how do we know how good the tests are? If there were already a really good way to check, we wouldn't need the test! In practice, the test-makers get away with waving their hands and pointing to correlations and factor analyses, like the Auto Wiper Assessment example above. This is obviously not a substitute for actually knowing, and it's impossible to calculate the accuracy rate from the kind of current validity studies that are done. The SAT, as I mentioned is an exception. This is because it does try to predict something measurable: college grades.

This is not a great situation. How do we know if the test makers are selling flim-flam? In practice, I think tests have to "look good enough" to pass casual inspection, and they can amount to neo-phrenology without anyone every knowing. How else can the vast amount of money being spent on standardized tests be explained? I'd be happy to be wrong if someone can point me to validity studies that show test classification error rates similar to the SAT's. A ROC graph would be nice.

The argument might be that since reductionist definitions are not practical, and there really is no way to know whether a test works except through indirect indications like correlations, this is the best we can do. But it isn't. In order to support that claim, let me develop the idea by contrasting two sorts of epistemology. It's essential to the argument and also worth the exposition for its own sake. When I first encountered these ideas, they changed the way I see the world.

Monological Knowing
Sometimes we know something via a simple causal mechanism: an inarguable definition. For example, when the home plate umpire calls a strike in a baseball game, that's what it is. It doesn't matter if the replay on television shows that the pitch was actually out of the strike zone.  Any argument about that will be in a different space--perhaps a meta-discussion about the nature of how such calls should be made. But within the game, as veteran umpire Bill Klein is quoted as saying "It ain't nothin till I call it!"

Monological definitions are generally associated with some obvious sign. An umpire jerking his clenched fist after a pitch means it was a strike. Sometimes the definitions come down to chance, as with a jury trial. In the legal system, you are guilty if the jury finds you guilty, which has only indirectly to do with whether or not you committed a crime. The unequivocal sign of your guilt is a verdict from the jury. Other examples include:
  • Course grades, defining 'A' Student, 'B' Student etc.
  • Time on the clock at a basketball or football game, which corresponds only roughly to shared perception of time passing (perceived time doesn't stop during a time-out, but monological time can).
  • Pernicious examples of classifying a person's race, e.g. leading up to the Rwandan genocide. You are what it says you are on your documents.
Sometimes the assignments are random or arbitrary. Sometimes a single person gets to decide the classification, as with course grades. There is sometimes pressure from administrators to create easily understood algorithms for computing grades in order to handle grade appeals, but instructors usually have wide latitude in assigning what amounts to the monological achievement level of the student.

I got bumped from a flight one time, and came away from the gate with the knowledge that I was "confirmed" on the next flight. That didn't mean what I thought it did, however. According to the airline's (monological) definition, "confirmed" means that the airline knows you are in the airport waiting, so you're a sure seat if they have an extra. It does not mean that such a seat is guaranteed for you.

Dialogical Knowing
This might be more properly called polyphonic, but for the sake of parallelism, allow me the indulgence. In contrast to a monological handing down of definitions from some source, dialogical knowledge has these characteristics:
  • It comes from multiple sources
  • There isn't universal agreement about it (definitions are not binding if they exist)
  • It's subjective
Whereas there is a master copy of what a Kilogram is in a controlled chamber in France, there is no such thing for the concept of "heavy." A load you are carrying will feel heavier after an hour than at the beginning of the hour. Furthermore, we can disagree about the heaviness. This is messy and imperfect, but very flexible because no definitions are needed. Anyone can create a dialogical concept, and it gets to compete with all the others in an ecology where the most fit survive. This fact is what prevents loose shared understanding from devolving too far into nonsense as a whole. There's plenty of nonsense (like fortune-telling), but we can communicate in a shared language very effectively even in the absence of formal definitions. 

If I tell you that I liked the movie Kung Fu Panda, you know what I mean. There are movies you like too, and you probably assume I feel about this movie the way you feel about those is some vague sense. You may disagree, but that's not a barrier to understanding. We could have a complex conversation about what constitutes a "good" movie, which doesn't have a final, monological answer. In Assessing the Elephant I compared this to the parable of the blind men inspecting an elephant, each sharing their own perspective. I used this as a metaphor for assessing general education outcomes, which are generally broad and hard to define monologically.

Tension between Monologue and Dialogue
Parallel to the tension between accountability and improvement in outcomes assessment, there is a tension between monological and dialogical knowledge in any system. The demand for locked-down monological approaches is the natural consequence of being part of a system, which as I described last time, needs to manage fuzziness and uncertainty in order to function. That's why we have monological definitions for what it means to be an adult, or "legally drunk." It makes systematization possible. Much of the time, this entails replacing a hard dialogical question ("what is an adult?") by a simple monological definition ("anyone 21 years or older"). In ordinary conversation we may switch these meanings without noticing, but sometimes the tension is obvious.

The question "which candidate will do the best job in office?" gets answered by "which candidate got the most votes?" It replaces an intractable question with one that can be answered systematically in a reasonable amount of time Of course it's an approximation of unknown validity. Monologically, the system decides on the "best" candidate, but the dialogical split on the issue can be 49% vs 51%.

Someone put together an page describing the relationship between monological Starbucks definitions of drink sizes and the shared understanding of small, medium, large. The site, which you can find here, is a perfect foil for this discussion. I find it hysterically funny. Here's a bit of it:
The first problem is that Starbucks is right, in a sense. I've established that asking for a "small coffee" gets you the 12-ounce size; "medium" or "medium-sized" gets you 16 ounces; and "large" gets you a 20 ounce cup. However, in absolute rather than relative terms, this is nuts. A "cup" is technically 8 ounces, and in the case of coffee, a nominal "cup" seems to be 6 ounces, as indicated by the calibrations on the water reservoirs of coffee makers, [...]
When a referee makes a bad call in a sports event, the crowd reacts negatively. The dialogical "fact" doesn't agree with the monological one, which is seen as artificial and not reflecting the reality of shared experience.

It may be appalling, but it makes sense that the Oxford English Dictionary now includes the work "nucular" as a synonym for "nuclear." This is the emodiment of a philosophy that the dictionary should reflect the dialogical use of language, not some monological official version.

In assessment, it's quite natural to fall victim to the tension between these two kinds of knowledge. As noted, tests of learning almost never come with warning labels that say This test gives the wrong answer 35% of the time. The test doesn't have any other monological ways of knowing to compete with, other than possibly other similar tests, so by default the test becomes the monological definition of the learning outcome. Because it replaces a hard question ("how well can our students think?") with an easily systematized one ("what was the test score?") it's attractive to anyone who has to watch dial and turn knobs in the system. In the classroom, however, the test may or may not have anything to do with the shared dialogical knowledge--that messy, subjective, imperfect consensus about how well students are really performing.

A Proposal to Bridge the Gap
Until we better understand how brains work, it's not realistic to hope for a physiology-based monological definition of learning to emerge to compete with testing. However, it would be very interesting to see how well tests align with the shared conception of expert observers. This doesn't seem to be a standard part of validity testing in education, and I'm not sure why. It's in everyone's best interests to align the two.

There is a brilliant history of this kind of research in psychology, culminating in the definition of the Big Five personality traits, which you can read about here. From Wikipedia, here is the kernel of the idea:
Sir Francis Galton was the first scientist to recognize what is now known as the Lexical Hypothesis This is the idea that the most salient and socially relevant personality differences in people’s lives will eventually become encoded into language. The hypothesis further suggests that by sampling language, it is possible to derive a comprehensive taxonomy of human personality traits.
Subjective assessments have a bad reputation in education, but lexical hypothesis was shown to be workable in practice. It's not astounding that dialogical language has meaning, but it doesn't seem fashionable to admit it.

Given all this, it's obvious that we should at least try to understand the resemblance between monological tests of "critical thinking and complex reasoning" or "effective writing" and the dialogical equivalent. It's simple and inexpensive to do this if one already has test results. All that's required is to ask people who have had opportunity to observe students what they think. Any way it turns out, the results will be interesting.

Suppose the test results align very well with dialogical perceptions. That's great--we can use either tests or subjective surveys as we prefer.

If the two don't align, then we have to ask who's more likely to be correct. In this case the tests lose out because of a simple fact: test scores don't matter in the real world. What does matter are the subjective impressions of those who employ our graduates or otherwise interact with them professionally. In the world beyond the academy, it's common shared perceptions that are the metric of success, and it won't do any good to point to your test scores. In fact, there is a certain shadenfreude in disproving credentials, as in watching videos of graduates from Ivy U who don't know why the seasons change. It isn't just Missouri: we're a show me society.

You'll notice that either way, the test results are largely unneeded. This illuminates why they are being used. Self-reported dialogical assessments depend on trust. In theory, tests can be administered in an adversarial environment. This restates Peter Ewell's quote in my previous article. This is a recipe for optimizing irrelevance. In Assessing the Elephant, I called this a degenerate assessment loop and gave this example:
A software developer found that there were too many bugs in its products, so it began a new system of rewards. Programmers would be paid a bonus for every software bug they identified and fixed. The number of bugs found skyrocketed. The champagne was quickly put back on ice, however, when the company realized that the new policy had motivated programmers to create more bugs so that they could “find” them.
Similar "degenerate" strategies find their way into educational practices because of the economic value placed on monological simplifications used in low-trust settings. We read about them in the paper sometimes.

Surveying the Dialogical Landscape
I have implemented surveys at two institutions to gather faculty ratings of student learning outcomes. I have many thousands of data points, but no standardized test scores to compare them to, so I can't check the alignment as I described above. The reliability of these ratings is: about a 50% probability of exact match on a four-point scale for the same student, same semester, same learning outcome, with different instructors. I've already written extensively about that, for example here and on this blog, as well as some chapters in assessment books, which you can find on my vita.

In a tight system, monological approaches are useful. The human body is a good example of this, but we should note that at least two important systems are more dialogical than monological: the immune system and the conscious mind. The world beyond graduation resembles a competitive ecology more like what the immune system faces than a systematic by-the-numbers existence like a toenail.

The only reason to use monological tests is if we don't trust faculty. This can't even be done with any intellectual honesty because we can't say that the tests are any good. What I proposed in the "The End of Preparation" is that we move to dialogical methods of assessment throughout and beyond the academy. These can still be summarized for administrators to look at, but only if there is trust at all levels. And really, if there is no trust between faculty and administration, the whole enterprise is doomed.

The mechanism of using public porfolios showing student records of performance can be purely dialogical--a student's work can have different value to different observers inside and outside the academy.

Next time I'll address what all this has to do with rubric-based assessment.

[Next article in this series: "Assessments, Signals, and Relevance"]

Some Frivolous Thoughts
As I said, this dichotomy changed the way I think about the world, and I find interesting tid-bits everywhere. One interesting idea is the hypothesis that as a domain of interest becomes more reliably theoretical (like alchemy becoming chemistry), the nomenclature transitions from descriptive and dialogical to arbitrary and monological. I went poking through several dictionaries looking for evidence of the names of the elements, to find examples. Copper may be an instance, perhaps having been named for Crete, as in "Cretian metal." If the name is too old, the etymology is foggy. Steal is more recent, and it seems to derive from a descriptive Germanic word for stiff. Compare that to Plutonium, which is modern and non-descriptive. Of course, with arbitrary naming, the namer can choose to be descriptive, as Radium arguably is. This thesis needs some work.

In biology, Red-Winged Blackbird is a descriptive name for the monological Agelaius phoeniceus. In a good theory, it doesn't matter what you call something. What matters is the relationships between elements, like the evolutionary links between bird species as laid out in a cladistic family tree.Modern scientists are more or less free to name new species or sub-atomic particles whatever they want. Organic chemistry is an interesting exception, because the names themselves are associated with composition. They are simultaneously descriptive and monological.

Drug names are particularly interesting. Viagra, for example, has a chemical name that describes it, but that obviously wouldn't do for advertising purposes. Here's what one source says about the naming process:
Drug companies use several criteria in selecting a brand name. First and foremost, the name must be easy to remember. Ideally, it should be one physicians will like -- short and with a subliminal connotation of the drug. Some companies associate their drugs with certain letters (e.g., Upjohn with X and Glaxo with Z). If the drug is expected to be used eventually on a nonprescription basis, the name should not sound medicinal. There must be no trademark incompatibilities, and the company must take account of the drug's expected competition.
It sounds like the name is chosen to fit neatly into a dialogical ecology.

The history of the SAT's name is interesting from this perspective, but I will bring this overlong article to a close.

Acknowledgments: The idea for the monological/dialogical dichotomy came out of conversations with Dr. Adelheid Eubanks about her research on Mikhail Bakhtin. I undoubtedly have mangled Bakhtin's original ideas, and neither he nor Adelheid should be held responsible for that.

No comments:

Post a Comment