[Note: if the online app doesn't work, it's because my hourly limit for the month has been reached.]
The question of inter-rater agreement comes up when we have multiple assessments of the same subject using the same scale. For example, rubric scores of student work with multiple raters. The point of using more than one judge of the material is to find out the extent to which they agree. Unfortunately, the existing statistics for doing this aren't of much help in understanding what's going on with the rubric ratings. There are a variety of measures, including Cohen's Kappa, Fleiss's Kappa, and several others. These are designed to assess agreement between raters, and comparing the nominal agreement with what we might find if the raters were just making assignments randomly. This is called "chance correction." I find that a single index is not very helpful, and although there are conditional versions out there, they are not easy to find or use.
So I created one that I like, and have found very useful. The technical notes are in this paper, which is a pretty good draft at this point. I created an online app do to all the hard work, and put the code on Github. (Note: on 12/30 I renamed the repo to reflect the actual name).
Here's an example of the output. It shows a three-point scale, from 0 to 2, with comparisons between each response and each other response. The lower right graph is 1 versus 2, which shows how much agreement there was between those ratings. If a subject was rated with 1s and/or 2s, he or she would show up as part of this graph. The thin line is what we would expect from random assignments (given certain assumptions), and the darker line is actually one dot per observation. The vertical and horizontal components are perfect agreement--the subject got either all 1s or all 2s from all raters. The diagonal part is where the ratings are mixed, e.g. a 1 and a 2 or two of each.
The p-value is for a hypothesis test of the data against the null hypothesis of random assignments. Kappa is a computation very similar to Fleiss--it represents the available agreement beyond chance that we see here. So for the lower right graph, 45% of the possible 'extra' agreement is observed. Perfect agreement, with K=1, would look like an inverted L.
The program is designed to work with three types of data. Follow the on-screen guides to loading.You'll need it in CSV format, which is easy from Excel.
I started a wiki on the GitHub page.
Subscribe to:
Post Comments (Atom)
-
The student/faculty ratio, which represents on average how many students there are for each faculty member, is a common metric of educationa...
-
(A parable for academic workers and those who direct their activities) by David W. Kammler, Professor Mathematics Department Southern Illino...
-
The annual NACUBO report on tuition discounts was covered in Inside Higher Ed back in April, including a figure showing historical rates. (...
-
In the last article , I showed a numerical example of how to increase the accuracy of a test by splitting it in half and judging the sub-sco...
-
Introduction Stephen Jay Gould promoted the idea of non-overlaping magisteria , or ways of knowing the world that can be separated into mutu...
-
I'm scheduled to give a talk on grade statistics on Monday 10/26, reviewing the work in the lead article of JAIE's edition on grades...
-
Introduction Within the world of educational assessment, rubrics play a large role in the attempt to turn student learning into numbers. ...
-
"How much data do you have?" is an inevitable question for program-level data analysis. For example, assessment reports that attem...
-
Inside Higher Ed today has a piece on " The Rise of Edupunk ." I didn't find much new in the article, except that perhaps mai...
-
Introduction A few days ago , I listed problems with using rubric scores as data to understand learning. One of these problems is how to i...
No comments:
Post a Comment