Because the intent seems to be to be able to tap into crowd-sourcing for assessment (you assess some of my students, I assess some of yours, for example) I wonder if the group has considered using Amazon's Mechanical Turk service as a cost-effective way of getting ratings from "the public."
The blog post has some interesting analysis on results to date. There are a couple of points I'd like to comment on, but I only have time for one now. This is the question "how much inter-rater reliability is enough?" The graph shows this to be between 70 and 80 percent over a five year period. The author, Nil Peterson, writes that
This inter-rater reliability is borderline and problematic because, when extrapolated to high stakes testing, or even grades, this marginal agreement speaks disconcertingly to the coherence (or lack there of) of the program.To me, the numbers actually sound really good. I don't know if anyone has measured what the inter-rater reliability for "normal" grading is, for say a history midterm, but I doubt that it's any better. As complexity increases, reliability ought to decline in general. For example, minor changes in the way the assignment is written can create changes in the scores. Minor changes to rubrics can too. These variations point to an inherent fuzziness that comes from data compressing complex things down to simple ones. For an example of lossy compression, see this fascinating article at StackOverflow on sending images via Twitter:
(image by Quasimondo, licensed under a Creative Commons Attribution-Noncommercial license).
How much reliability is required to be useful? That one's easy: anything better than random data can be useful (electronic stock trading programs, for example, try to ferret out very small correlations to make money on). But if we have to stand up and swear that we believe in the results for a particular student, that's a different question. So the question really turns on what we are using the data for. This is an interesting dilemma faced by administrations that try to actively use assessments of any kind. And of course the unfortunate Assessment Director will be caught in the crossfire. My contention is that too often too much faith is put in the assessments, and damage can easily result. One should not believe too much in one's convenient fictions. More on this topic later.
Dave
ReplyDeleteThanks for an interesting analysis. One tiny correction, which probably implies the blog post was unclearly worded. The presentation was made at the request of the TLT Group, in a series they were doing on rubrics. The work is original with the Center for Teaching Learning and Technology at Washington State University ~Nils
Thanks! Fixed it.
ReplyDeleteDavid
ReplyDeleteI replied in a post and attempted to track back to here, but it does not seem to have worked.
In any event, your post and my reply are the subject of our Wed Morning Reading Group 8:30 pacific. We love it if you could join us. Here is the announcement that just came out:

Hello MRGers-
The reading for this Wednesday, August 12, will be a post
http://highered.blogspot.com/2009/07/tlts-harvesting-feedback-project.html
from David Eubank's 'Higher Ed (assessing the elephant)' blog where he responds to some of the Harvesting Gradebook work Nils, Theron, and Jayme presented at the July 28 TLT webinar on the creative uses of rubrics.
Eubanks suggests that a cost-effective method for getting ratings from community members might use a mechanism like Amazon's Mechanical Turk. Nils ran with the suggestion, exploring Mechanical Turk and then responded in a blogpost
http://communitylearning.wordpress.com/2009/08/03/crowd-sourcing-feedback/
comparing it to other crowd-source feedback systems.
Should make for a lively discussion. Hope you can make it. If you can't join us physically, please consider joining us via Zorap: http://www.zorap.com/JaymeJacobson
As always, feel free to post suggestions and comments in the blog space at the MRG site:
http://morningreadinggroup.ning.com
See you Wednesday!
Jayme