Friday, November 22, 2019

Problems with Rubric Data

Introduction

Within the world of educational assessment, rubrics play a large role in the attempt to turn student learning into numbers. Usually this starts with some work students have done, like research papers for an assignment (often called "artifacts" for some reason). These are then scored (i.e. graded) using a grid-like description of performance levels. Here's an outline I found at Berkeley's Center for Teaching and Learning:


The scale levels are almost always described in objective language relative to some absolute standard, like "The paper has no language usage errors (spelling, grammar, punctuation)." 

Raters review each student work sample and assign a level that corresponds to the description, using their best judgment. There are usually four or five separate attributes to rate for each sample. For written work these might be:
  • language correctness
  • organization
  • style
  • follows genre conventions (e.g. a letter to the editor doesn't look like a research paper).
The result of this work (and it is usually a lot of work) is ideally a data set that includes the identification of the student being rated, the identification of the rater, and the numerical ratings for each attribute for each sample. If papers are being rated this way, the final data set might begin like this:


The data needs to be summarized in order to make sense of it. Often this is accomplished by averaging the scores or by finding the percent of scores greater than a threshold (e.g. scores greater than 3 are assumed to meet a minimum standard).

This is where the problems begin.

Problems with Rubric Data


Problem 1. There may not be a there there

The rubric rating process assumes that there is enough evidence within the piece of writing to make an informed judgment. In the case of written work, this is a sliding scale. For example, the length of the paper is probably proportional to the amount of evidence we have, so we theoretically should be able to make better decisions about longer papers. I don't know if anyone has ever tested this, but it's possible: measure the inter-rater reliability for each paper that has multiple readers and see if that correlates with the number of words in the paper. 

If papers that simply lack evidence are rated lower (rather than the ratings being left blank, say), then the scale isn't being used consistently: we're conflating the amount of evidence with the qualities of the evidence. John Hathcoat did a nice presentation about this issue at the AAC&U meeting last year.


When I design rubrics now, I make them two-dimensional: one assesses how much basis there is for judgment, and the other the quality. There's a dependency relationship between the two: we can't assess quality lacking sufficient evidence.

Problem 2. Growth versus Sorting

Raters of the student work are naturally influenced by what they see during the rating. That is, our putative objectivity and ideal rubric levels can go out the window when confronted with reality. Raters may simply rank the student work from worst to best using the scale they are given. The logic, which I've heard from raters, goes like this: "Paper 4 is obviously better than paper 3, so the score for paper 4 should be higher than the score for paper 3." 

What follows from this (very natural) rating style is a sorting of the papers. If the sorting is any good, the poor papers have low scores and the good papers have high scores. This sounds like a good thing, but it's not what was advertised. The intention of a so-called analytic rubric is to produce absolute measures of quality, not relative ones. Why does that matter?

If we just sort the students by paper quality, we'll get pretty much the same answer every time. The high grade earners will end up at the top of the pile and the low grade earners at the bottom. This pattern will, on average persist over the years of college. The A-student freshman will get 5s on their papers, and when they're seniors, they'll still be getting 5s.  We can't measure progress with sorting.

One project I worked on gathered rubric ratings of student work over a two-semester sequence, where the instruction and training of instructors was closely supervised. Everyone was trained on the analytic rubric, which was administered online to capture many thousands of work samples with rubric ratings.

The graph below shows the average change in rubric quality ratings over six written assignments (comprising two terms). 

The scores start low for the first assignment, then improve (higher is better) throughout the first term. This is evidence that within this one class, the analytic rubric is working to some extent, although the absolute difference in average score over the term (about 3.0 to about 3.3) is quite low. At first glance, it's mostly sorting with a little growth component added in.

But when we get to the second term, the scores "reset" to a lower level than they finished the first term at. We could interpret this in various ways, but the scale clearly is not functioning as an analytic rubric is intended to over the two terms, or else learning quickly evaporates over the holidays. 

By way of contrast, here's a summary of thousands of ratings from a different project that does appear to show growth over time. As a bonus, the sorting effect is also demonstrated by disaggregating the ratings by high school GPA.


More on that project another time, but you can read about it here and here (requires access). 

Problem 3. Rater Agreement

The basic idea of rubric rating is common sense: we look at some stuff and then sort it into categories based on the instructions. But the invention of statistics was necessary because it turns out that common sense isn't a very good guide for things like this. The problem in this case, is that the assigned categories (rubric ratings) may be meaningless. For example, we wouldn't simply flip a coin to determine if a student writing portfolio meets the graduation requirement for quality. But it could be that our elaborate rubric rating system has the same statistical properties as coin-flipping. We don't know unless we check.

The general concept is measurement reliability. It's complicated, and there are multiple measures to choose from. Each of those comes with statistical assumptions about the data or intended use of the statistic. There are "best practices" that don't make any sense (like kappa must be > .7), and there is a debate within the research community about how to resolve "paradoxes" related to unbalanced rating samples. I don't have a link, but see this paper on that topic:
Krippendorff, K. (2013). Commentary: A dissenting view on so-called paradoxes of reliability coefficients. Annals of the International Communication Association, 36(1), 481-499.
For my own analysis and some results I'll refer you to this paper, which won an award from the Association for Institutional Research. 

A short version of the reliability problem for rubric rating is that:
  • it's vital to understand the reliability before using the results
  • it's difficult and confusing to figure out how to do that
  • reliability problems often can't easily be fixed

Problem 4. Data Types

What we get from rubric scales is ordinal data, with the numbers (or other descriptors) serving the role of symbols, where we understand that there's a progression from lesser to greater demonstration of quality: A > B > C. It's common practice, as I did in the example for Problem 2, to simply average the numerical output and call that a measure of learning. 

The averages "kind of" work, but they make assumptions about the data that aren't necessarily true. That approach also oversimplifies the data set and doesn't take advantage of richer statistical methods. Here are two references:
Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with metric models: What could possibly go wrong?. Journal of Experimental Social Psychology, 79, 328-348. [pdf]
and 
Engelhard Jr, G., & Wind, S. (2017). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge. [Amazon.com]
The same idea applies to analyzing survey items on ordinal scales (e.g. disagree / agree).

Some advantages of using these more advanced methods are that we get information on:
  • rater bias
  • student ability
  • qualitative differences between rubric dimensions
  • how the ordinal scale maps to an ideal "latent" scale
Since the measurement scale for all of these is the same, it's called "invariant," which you can read more about in the first reference.

Problem 5. Sample Size

It's time-consuming to use rubrics to rate student work samples. If the rating process isn't built into ordinary grading, for example, then this is an additional cost to get the data. That cost can limit sample sizes. How big a sample do you need? Probably at least 200.

That estimate comes from my own work, for example the power analysis shown below. It shows simulations of statistical tests to see how much actual difference in average ratings is required before I can detect it from samples. In the simulation, two sets of actual student ratings were drawn from a large set of them (several thousand). In one set (A) I took the average. In the other set (B) I took the average and then added an artificial effect size to simulate learning over time, which you can see along the top of the table below. Then I automated t-tests to see if the p-value was less than .05, which would suggest a non-zero difference in sample averages--detecting the growth. The numbers in the table show the rate that the t-test successfully identified that there was a non-zero difference. The slide below comes from a presentation I did at the SACSCOC Summer Institute. 



For context, a .4 change is about what we expect over a year in college. With 100 samples each of A and B, where the average of B was artificially inflated by .4 to simulate a year's development, the t-test will get the correct answer 76% of the time. The rest of the time, there will be no conclusion (at alpha = .05) that there's a difference. Rubric ratings are often statistically noisy, and to average out the noise requires enough samples to detect actual effects.

For an article that essentially reaches the same conclusion, see:
Bacon, D. R., & Stewart, K. A. (2017). Why assessment will never work at many business schools: A call for better utilization of pedagogical research. Journal of Management Education, 41(2), 181-200. [pdf]
Because most rubric rating projects end up with a lot fewer than 200 samples, it's fair to conclude that most of them don't have sufficient data to do the kind of statistics we need to do--including testing reliability and checking for differences within student types. 

It's possible that we can get by with smaller samples if we use more sophisticated methods, like those found in Problem 4. I'll be working on that for a paper due in February, so stay tuned.

Problem 6. Standardization

Standardization in testing is supposed to help with reliability by reducing extraneous variables. In the case of rubric ratings, this often means restricting the student work to a single item: one research paper per student, for example. But if we want to assess the student's writing ability that is different from evaluating a written work. In the most extreme case, the student might have just bought the paper on the Internet, but even under usual circumstances, using a single paper as a proxy for writing ability is a crude approximation. Consider a student who is a strong writer, but chooses to invest study time in another class. So she gets up early the day the paper is due, knocks out one draft, and gets an easy A-. A different student works very hard on draft after draft, sits for hours in the writing lab, visits the professor for feedback, and--after all that--earns an A-. Does the grade (or equivalent rubric rating) tell us anything about how that rating was earned? No--for that we need another source of information. Surveys of students is one possibility. For a different approach, see my references at the end of Problem 2.

Some attempts review a whole student portfolio of work to get more information about writing, for example. But rating a whole portfolio is an even bigger burden on the raters, reducing sample size and--because of the inevitable non-standardization of samples--lowers reliability.

Problem 7. Confounding

The AAC&U started a rubric-construction project with its LEAP initiative some years ago. The result is a set of VALUE rubrics. They also provide scorer training and will score your papers externally for a fee. The paper cited below was published last year. It's based on a data set that's large enough to work with, and includes students at different points in their college careers, so it's conceivable to estimate growth over time. This assumes that Problem 2 doesn't bias the data too much. 
Sullivan, D. F., & McConnell, K. D. (2018). It's the Assignments—A Ubiquitous and Inexpensive Strategy to Significantly Improve Higher-Order Learning. Change: The Magazine of Higher Learning, 50(5), 16-23 [requires access]
The authors found that they could detect plausible evidence of growth over time, but only after they included assignment difficulty as an explanatory variable. You can read the details in the paper.

If this finding holds up in subsequent work, it's not just an example of needing more information than what's in the rubric data to understand it; it's a subtle version of Problem 1. Suppose freshmen and seniors are both in a general education class that assigns a paper appropriate to freshmen. Assuming everything works perfectly with the rubric rating, we still may not be able to distinguish the two classes simply because the assignment isn't challenging enough to require demonstration of senior-level work. This is a cute idea, but it may not be true--we'll have to see what other studies turn up.

The are two lessons to draw from this. One is that it's essential to keep student IDs and course information in order to build explanatory models with the data. From my experience, college GPA correlates with practically every kind of assessment measure. So if we analyze rubric scores without taking that variable into account, we're likely to misunderstand the results.

The other lesson comes from the reaction of the assessment community to this paper: there was none. The VALUE rubrics are widely adopted across the country, so there are probably at least a thousand assessment offices that employ them. The paper challenges the validity of using all that data without taking into account course difficulty. One could have naturally expected a lot of heated discussion about this topic. I haven't seen anything at all, which to me is another sign that assessment offices mostly operate on the assumption that rubric ratings just work, rather than testing them to see if they do work.

Problem 8. Dimensionality

While most rubrics have five or so attributes to rate, the data that comes from these tends to be highly correlated. A dimensional analysis like Principal Component Analysis reveals the structure of these correlations. Most often in my experience the primary component looks like a holistic score, and captures the majority of the variation in ratings. This is important, because you may be working your raters too hard, trading away additional sample size for redundant data. If the correlations are too high, you're not measuring what you think you are.

To connect to Problem 7: correlation the first principal component with student GPA to see how closely what you're measuring with the rubric is already accounted for in grade averages.

Problem 9. Validity

Do the ratings have any relationship to reality outside the rating context? We see reports that employers want strong problem-solving skills. Do our rubric ratings of "problem-solving" have any useful relationship to the way the employers understand it? There's no easy way to answer that question--it's another research project, e.g. start by surveying student internship supervisors. 

The trap to avoid is assuming that just because we have all this language about what we think we're measuring (e.g. the words on the rubric) doesn't mean that the numbers actually measure that. That question may not even make sense: if the reliability is too low, we're not measuring anything at all. And if the reviewers that matter--observers of our graduates later in life--don't themselves agree about "problem-solving" or whatever it is, the measurement problem is doomed from the start. 

If we just take rating averages and announce to the world that "our student problem-solving ability improved on average from 3.2 to 3.5 last year," it's almost certainly not going to be understood in a legitimate way. This kind of thing leads people to think they know something when they don't. 

For a nice discussion of the complexities of this topic see:
Moss, P. A. (1994). Can there be validity without reliability?. Educational researcher, 23(2), 5-12. [pdf]

Discussion

In higher education administration, particularly in assessment offices, rubrics are used as an easy way to plausibly turn observations of student work products or performances into data. After being involved with conferences and peer reviews for 18 years, it's apparent that the complexities of dealing with this kind of data are almost uniformly ignored. Small, noisy data sets that are not critically examined are used to produce "measures of learning" that are almost certainly not that. It is possible to learn about student development from rubric ratings: there are plenty of academic journals that publish such research. It's just not common in assessment circles to see appropriate skeptical inquiry related to this type of data, rather an unspoken faith in methods seems to rule the day.

Assessment of learning in higher education administration needs to pivot from faith-based research to data science, and the sooner the better.

2 comments:

  1. "Pivot[ing] from faith-based research to data science" would require significantly changing the training, hiring, and support processes and incentive structures of assessment. Many of us would love to invest a lot more time in psychometrically rigorous work but rarely if ever have the time, training, and support to do so. We also need to figure out how that would work alongside the incredibly complex and demanding political and social aspects of assessment work. Much of my work, for example, is translation for faculty in disparate disciplines, departments, and programs that each have their own histories, epistemologies, priorities, and reward structures.

    In my work with faculty, we often focus on incentives and reward structures. In other words, why would faculty do X (revise a curriculum, explore a different pedagogy, etc.)? If we use that perspective in this different situation, it seems possible that there simply isn't a uniform, lasting demand for highly rigorous, quantitative assessment data. Change that and assessment - processes, tools, professionals, etc. - will have to follow. Are rubrics fatally flawed or are they good enough (for what purposes and which audiences?)?

    ReplyDelete
  2. You're right--such a change would require rethinking how assessment offices operate, particularly with respect to regulatory requirements. I'm trying to encourage assessment staff (as I am) to innovate. The good stuff we do with respect to faculty development and program review doesn't need to go away, we just replace the Sysiphean report-writing part of the job with more meaningful activity. That might be more faculty development for some institutions, and more research for others, as need be.

    Rubrics clearly can give useful results if we pay attention to the details listed above, i.e. as a research tool. They can also be useful informally as aid to communicating standards to students, improving grading fairness, and so on. What I think is not so useful is the insistence that rubric ratings automatically make good data for telling us about the quality of a program. For example, in the assessment culture I'm most familiar with, an average of 5 rubric ratings would be unquestionably accepted as a measure of a program's quality, while a much larger number of course grades (which connect to everything else we know about a student) are considered "indirect evidence," and can't be used without a lot of pain involved.

    ReplyDelete