TLT's Harvesting Feedback Project

I encourage you to take a look at the ratings system demo from a presentation made at the request of the TLT Group in a series they were doing on rubrics. The work originates with the Center for Teaching Learning and Technology at Washington State University. For background, see the Community Based Learning blog.The design is very attractive, and it doesn't take too long to work through the rubrics to rate a sample assignment and student work. Or you can skip to the results if you want to see the kind of assessment summaries produced. It's a fascinating project, and to me the most interesting design element is one not actually highlighted here, viz. that the plan is to be able to rate any kind of work anywhere on the Internet. The era of "enclosed garden" portfolio systems may be drawing (thankfully) to an end.

Because the intent seems to be to be able to tap into crowd-sourcing for assessment (you assess some of my students, I assess some of yours, for example) I wonder if the group has considered using Amazon's Mechanical Turk service as a cost-effective way of getting ratings from "the public."

The blog post has some interesting analysis on results to date. There are a couple of points I'd like to comment on, but I only have time for one now. This is the question "how much inter-rater reliability is enough?" The graph shows this to be between 70 and 80 percent over a five year period. The author, Nil Peterson, writes that
This inter-rater reliability is borderline and problematic because, when extrapolated to high stakes testing, or even grades, this marginal agreement speaks disconcertingly to the coherence (or lack there of) of the program.
To me, the numbers actually sound really good. I don't know if anyone has measured what the inter-rater reliability for "normal" grading is, for say a history midterm, but I doubt that it's any better. As complexity increases, reliability ought to decline in general. For example, minor changes in the way the assignment is written can create changes in the scores. Minor changes to rubrics can too. These variations point to an inherent fuzziness that comes from data compressing complex things down to simple ones. For an example of lossy compression, see this fascinating article at StackOverflow on sending images via Twitter:
How much reliability is required to be useful? That one's easy: anything better than random data can be useful (electronic stock trading programs, for example, try to ferret out very small correlations to make money on). But if we have to stand up and swear that we believe in the results for a particular student, that's a different question. So the question really turns on what we are using the data for. This is an interesting dilemma faced by administrations that try to actively use assessments of any kind. And of course the unfortunate Assessment Director will be caught in the crossfire. My contention is that too often too much faith is put in the assessments, and damage can easily result. One should not believe too much in one's convenient fictions. More on this topic later.


  Dave
    Thanks for an interesting analysis. One tiny correction, which probably implies the blog post was unclearly worded. The presentation was made at the request of the TLT Group, in a series they were doing on rubrics. The work is original with the Center for Teaching Learning and Technology at Washington State University ~Nils

  2. Thanks! Fixed it.

  David

    I replied in a post and attempted to track back to here, but it does not seem to have worked.

