Because the intent seems to be to be able to tap into crowd-sourcing for assessment (you assess some of my students, I assess some of yours, for example) I wonder if the group has considered using Amazon's Mechanical Turk service as a cost-effective way of getting ratings from "the public."
The blog post has some interesting analysis on results to date. There are a couple of points I'd like to comment on, but I only have time for one now. This is the question "how much inter-rater reliability is enough?" The graph shows this to be between 70 and 80 percent over a five year period. The author, Nil Peterson, writes that
This inter-rater reliability is borderline and problematic because, when extrapolated to high stakes testing, or even grades, this marginal agreement speaks disconcertingly to the coherence (or lack there of) of the program.To me, the numbers actually sound really good. I don't know if anyone has measured what the inter-rater reliability for "normal" grading is, for say a history midterm, but I doubt that it's any better. As complexity increases, reliability ought to decline in general. For example, minor changes in the way the assignment is written can create changes in the scores. Minor changes to rubrics can too. These variations point to an inherent fuzziness that comes from data compressing complex things down to simple ones. For an example of lossy compression, see this fascinating article at StackOverflow on sending images via Twitter:
(image by Quasimondo, licensed under a Creative Commons Attribution-Noncommercial license).
How much reliability is required to be useful? That one's easy: anything better than random data can be useful (electronic stock trading programs, for example, try to ferret out very small correlations to make money on). But if we have to stand up and swear that we believe in the results for a particular student, that's a different question. So the question really turns on what we are using the data for. This is an interesting dilemma faced by administrations that try to actively use assessments of any kind. And of course the unfortunate Assessment Director will be caught in the crossfire. My contention is that too often too much faith is put in the assessments, and damage can easily result. One should not believe too much in one's convenient fictions. More on this topic later.