Higher Ed/: Inter-rater Reliability

At the Assessment Institute this week, I saw presentations on three different ways to examine inter-rater reliability. These included classical parametric approaches, such as Krippendorff's Alpha, a paired-difference t-test to detect bias over time, and a Rasch model. I'm hoping to talk the presenters into summarizing these approaches and their uses in future blog posts. In the meantime, I will describe below a fourth approach I developed in order to look at inter-rater reliability of (usually) rubric-based assessments.

I like non-parametric approaches because you can investigate the structure of a data set before you assume things about it. As a bonus you get pretty pictures to look at.

Even without inter-rater data, the frequency chart of how ratings are assigned can tell us interesting things. Ideally, all the ratings are used about equally often. (This is ideal for the descriptive power of the rubric, not necessarily for your learning outcomes goals!) This is for the same reasons that we want to create tests that give us a wide range of results instead of a pronounced ceiling effect, for example.

If we do have inter-rater data, then we can calculate the frequency of exact matches or near-misses and compare those rates to what we would expect to see if we just sampled ratings randomly from the distribution (or alternatively from a uniform distribution). When shown visually, this can tell us a lot about how the rubric is performing.

I'll give some examples, but first a legend to help you decipher the graphs.

The graph on the left shows how the raters distributed scores. Although the scale goes from one to four, one was hardly used at all, which is not great: it means we effectively have a three point scale instead of a four point scale. So one way to improve the rubric is to try to get some more discrimination between lower and middle values. For the three ratings that are used, the match frequencies (yellow line on the left graph) are higher than what we'd expect from random assignment (orange dots). The difference is shown at the top of each graph bar for convenience. Altogether, the raters agreed about 52% of the time versus about 35% from random assignment. That's good--it means the ratings probably aren't just random numbers.

The graph on the right shows the distribution of differences between pairs of raters so that you can see how frequent 'near misses' are. Perfect inter-rater reliability would lead to a single spike at zero. Differences of one are left and right of center, and so on. The green bars are actual frequencies, and the orange line is the distribution we would see if the choices were independently random, drawn from the frequencies in the left graph. So you can also see in this graph that exact matches (at zero on the graph) are much more frequent than one would expect from, say, rolling dice.

Another example

Here we have a reasonably good distribution of scores, although they clump in the middle. The real problem is that raters aren't agreeing in the middle: the frequencies are actually slightly below random. The ends of the scale are in good shape, probably because it's easier to agree about extremes than distinguish cases in the middle. When I interviewed the program director, he had already realized that the rubric was a problem and the program was already fixing it.

The same report also shows the distribution of scores for each student and each rater. The latter is helpful in finding raters who are biased relative to others. The graphs below show Rater 9's rating differences on three different SLOs. They seem weighted toward -1, meaning that he/she rates a point less that other raters much of the time.

When I get a chance, I will turn the script that creates this report into a web interface so you can try it yourself.

How Much Redundancy is Needed?

We can spend a lot of time doing multiple ratings--how much is enough? For the kind of analysis done above, it's easy to calculate the gain in adding redundancy. With n subjects who are each rated k times by different raters, there are C(k,2) pairs to check for matches.

C(2,2) = 1 match per subject

C(3,2) = 3 matches per subject

C(4,2) = 6 matches per subject

C(5,2) = 10 matches per subject

So using one redundancy (two raters per subject) is okay, but you get disproportionately more power from adding one more rating (three times as many pairs to check for matches). So rule of thumb: three is great, four is super, more than that is probably wasting resources.

Connection to Elections

Since it's November of an election year, I'll add an observation I made while writing the code. In thinking about how voters make up their minds, we might hypothesize that they are influenced by others who express their views. This could happen in many ways, but let me consider just two.

A voter may tend to take an opinion more seriously based on the frequency of its occurrence in general.

But perhaps raw frequencies are not as important as agreements. That is, hearing two people agree about an opinion is perhaps more convincing than just the frequency the opinion occurs, and that distribution is different--it's the same one used to calculate the 'random' inter-rater dots and graphs above. To calculate it, we square the frequencies and renorm. Small frequencies become smaller and larger ones larger. The graphs below illustrate an example with four opinions.

Opinion four is the largest, but still a sub-majority of all opinions. However, if you got everyone around the campfire and counted up the agreements between them, opinion four becomes much more visible--the number of these 'handshakes' is now greater than 50% of the total. If this influences other opinions, it would seem to lead to a feedback mechanism that accelerates adoption of opinion four.

I looked for research on this, and found some papers on how juries function, but nothing directly related to this subject (although the results on how juries work are frightening). It would be interesting to know the relative importance of frequency versus 'handshake' agreement.

Higher Ed/

Friday, November 02, 2012

Inter-rater Reliability

No comments:

Post a Comment

Search This Blog