The students found this hilarious. Normally, a class officer would make sure the board was cleaned, but when I came to class the next day the left and right panels were clear, but my bad luck box was still there. I used it to talk about game theory and Pascal's wager, which they found interesting. A weekend passed, and the following Monday the whole board was clean. The students had found a janitor who didn't know any English to erase it. But I digress.
More recently, the subject of beans was a topic for a discussion on the assessment list server about the importance of agreeing on ratings (sometimes called centering or leveling) before assigning rubric-based scores. I was trying to find a topic where assessment was clearly subjective:
Suppose we ask 10 people to taste our green beans and ask them to rate as "good" or "not good." Assuming we don't tip the scales somehow, we could legitimately report out to the public that of the 10 people who sampled the beans, 8 said they were good.A response from Hugh Stoddard:
But if we have the raters center first, the 2 holdouts are going to have to be convinced that the beans really are good, or else we ignore them. Either way, we stamp GOOD on the can and give it a diploma, which may be misleading to 20% of the consumers.
Your green beans would be far more appealing to consumers if the raters discussed what constitutes a "good bean" and whether their judgement should be based on color, shape, size, sweetness, crunchiness, or nutritional value.And a response to that from Joan Hawthorne:
Now we're getting into food where I feel a great deal of expertise after oh-so-many years of very dedicated eating. I know in general that green beans are nutritious and I'm glad about that. But hypothetical attributes of green bean goodness matter not one whit when it comes to my pleasure (or lack of pleasure) in eating a particular green bean dish. Ditto for chocolate, wine, coffee, whatever. If you're doing a shared rating of goodness for shared discussion, criteria may be helpful. But I don't care if the top chocolate -- according to rater scoring based on criteria definition -- is this one or that one. I care about which one appeals most to ME.So who is right? I originally assumed that green bean tasting would be so subjective that the idea of using rubrics was clearly nonsense. But it took a while for me to really understand why. Here's my answer in a nutshell: more doesn't equal better, and life isn't really a linear combination.
More is worse. A little salt is good. More is bad. A little crispness is good, more is too much. Suppose Stanislav and Tatiana determine how salty they prefer their green beans. The graph might look like this:
Clearly, it's no good telling Stanislav that he's a wimp for not being able to tolerate more salt, or claiming that the "correct" value for optimum saltiness is somewhere between the two peaks. Notice that there are two conditions here that we imagine away when we work with rubrics and ratings:
- There may be legitimate differences in opinions and tastes that can't be "centered" away
- more of an expressed trait isn't necessarily better
- poetry, where one may bend the rules a little for style or form
- advertising, where space is at a premium
- dialogue or other less formal speech, where correctness can sound stilted.
’Twas brillig, and the slithy tovesAbout the only correctness present here is that the words take familiar forms (nouns, verbs, adjectives) despite not making sense. Is this poem fully correct with respect to use of language? Clearly not. Is that bad? No. It would lose its whole meaning if you corrected it to something like:
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
It was brilliant and the swaying boughs,That made me ill. We could get out of this by defining special kinds of correctness for each type of writing, but that still leaves the awkward question of how to assign a rating to something that is too much of something, AND it invalidates comparing one type of correctness to the other.
Did gyrate and gleam in the wind,
Tastes differ. Not everyone likes to read the same authors. We have different tolerances for style and tone, and even correctness. If I see a single "it's" where the author means "its," I usually stop reading; I take it as an indication of sloppiness. In academic writing, style is tricky, and tastes will vary. Too much informality, and you will not be taken seriously. Too much formality, and few people will slog through what you've written. But it's not necessary that we all agree on these things. Indeed, it's probably impossible.
I think Hugh Stoddard makes a good point when he suggests rating specific aspects of the green beans. You could apply that to saltiness, crispness, and so on. We still won't get agreement, but it would focus the conversation. There is, however, a snag with this project too: combinations aren't necessarily linear.
We see this effect in analyzing student evaluations from classes, where students fill in bubbles on a Likert scale to answer such questions as "Were the goals of the course clearly stated?" Students who rate the instructor low tend to also rate everything else low--the textbook, for example. Similarly, someone who has a low tolerance for salt may be turned off by the mushiness of the beans simply due to too much salt.
In writing this is true also. I worked with a colleague who would fail any paper with a single spelling error, regardless of any other merits the paper might have. For him, correctness and style were tightly linked--errors ruin the style. Or imagine a well-reasoned essay about the Battle of Britain, where the student misspells the names of all the important leaders in Britain, Germany, and the US. Is that correctness or content or both?
These effects tend to complicate the clean-looking lines of rubrics, and the way people tend to think of them. They are not precision instruments, but guides. Their most valuable product is not numbers but ideas from focused conversations. Rubrics are not measurement tools in any reasonable definition of measurement.
The point is not that rubrics are bad or that centering is bad. Rather, the point of this discussion is to consider the complexities of the situation when designing assessments and interpreting results. If we "agree to agree" we lose information, and sometimes that's probably okay. But other times it might be useful to try to understand what the subtleties are. Imagine if we forced a diverse audience to watch the newest summer movie and then kept them in the room until they all rated it the same way. If we understand the complexities of what it is we're trying to do, we will report more honestly (more modestly probably), and more usefully. If you use a writing rubric across different disciplines, blend all the ratings into one average and then advertise that "we have measured the writing ability of our students to be X," it's misleading and not very useful. Unfortunately, enough of that happens that administrators, accreditors, and the department of education seem to think that such sentences are meaningful. That's a bad thing.
Aren't beans interesting?