Higher Ed/: Kuder-Richardson formula 20

My academia.edu account emails me links to articles they think I'll like, and the recommendations are usually pretty good. A couple of weeks ago I came across a paper on the reliability of rubric ratings for critical thinking that way:

Saxton, E., Belanger, S., & Becker, W. (2012). The Critical Thinking Analytic Rubric (CTAR): Investigating intra-rater and inter-rater reliability of a scoring mechanism for critical thinking performance assessments. Assessing writing, 17(4), 251-270. [link]

Rater agreement is a topic I've been interested in for a while, and the reliability of rubric ratings is important to the credibility of assessment work. I've worked with variance measures like intra-class correlation, and agreement statistics like the Fleiss kappa, but I don't recall seeing Cronbach's alpha used as a rater agreement statistic before. It usually comes up in assessing test items or survey components.

Here's the original reference.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. psychometrika, 16(3), 297-334. [link]

In reading this paper, it's apparent that Cronbach was attempting to resolve a thorny problem in testing theory--how to estimate the reliability of a standardized test? Intuitively, reliability is the tendency for the test results to remain constant when we change unimportant details, like swap out items for similar ones. More formally, reliability is intended to estimate how good the test is a measuring a test-takers "true score."

The alpha coefficient is a generalization of a previous statistic called the Kuder-Richardson formula 20, which Cronbach notes is a mouthful and will never catch on. He was right!

Variance and Covariance

The alpha statistic is a variance measure, specifically a ratio of covariance to variance. It's a cute idea. Imagine that we have several test items concerning German noun declension (the endings of words based on gender, tense, etc.). Which is correct:

Das Auto ist rot.
Der Auto ist rot.
Den Auto ist rot.

And so on. If we believe that there is a cohesive body of knowledge comprising German noun declension (some memorization and some rules), then we might imagine that the several test items on this subject might tell us about a test-taker's general ability. But if so, we would expect some consistency in the correct-response patterns. A knowledgeable test taker is likely to get them all correct, for example. On the other hand, for a set of questions that mixed grammar with math and Russian history, we would probably not assume that such correlations exist.

As a simple example, imagine that we have three test items we believe should tap into the same learned ability, and the items are scaled so that each of them has variance one. Then the covariance matrix is the same as the correlation matrix:

$$ \begin{bmatrix} 1 & \rho_{12} & \rho_{13} \\ \rho_{21} & 1 & \rho_{23} \\ \rho_{31} & \rho_{32} & 1 \end{bmatrix} $$

The alpha statistic is based on the ratio of the off-diagonal correlation to the sum of the whole matrix, asking how much of the total variance is covariance? The higher that ratio is, then--to a point--the more confidence we have that the items are in the same domain. Note that if the items are completely unrelated (correlations all zero), we'd have zero over one, an indication of no inter-item reliability.

Scaling

One immediate problem with this idea is that it depends on the number of items. Suppose that instead of three items we have $n$, so the matrix is $n \times n$, comprising $n$ diagonal elements and $n^2 - n $ covariances. Suppose that these are all one. Then the ratio of the sum of covariance to the total is

$$ \frac{n^2 - n}{n^2} = \frac{n - 1}{n}. $$

Therefore, if we want to keep the scale of the alpha statistic between zero and one, we have to scale by $ \frac{n}{n-1} $. Awkward. It suggests that the statistic isn't just telling us about item consistency, but also about how many items we have. In fact, we can increase alpha just by adding more items. Suppose all the item correlations are .7, still assuming variances are one. Then

$$ \alpha = \frac{n}{n - 1} \frac{.7 (n^2 - 1)}{.7 (n^2 - 1) + n } = \frac{.7n}{.7n + .3} $$

which asymptotically approaches one as $ n $ grows large. Since we'd like to think of item correlation as the big deal here, it's not ideal that with fixed correlations, the reliability measure depends on how many items there are. This phenomenon can be linked to the Spearman-Brown lift formula, but I didn't track that reference down.

This dependency on the number of items is manageable for tests and surveys as long as we keep it in mind, but it's more problematic for rater agreement.

It's fairly obvious that if all the items have the same variance, we can just divide and get the correlation matrix, so alpha will not change in this case. But what if items have different variances? Suppose we have correlations of .7 as before, but items 1, 2, and 3 have standard deviations of 1, 10, and 100, respectively. Then the covariance matrix is

$$ \begin{bmatrix} 1 & 7 & 70 \\ 7 & 100 & 700 \\ 70 & 700 & 10000 \end{bmatrix} $$

The same-variance version has alpha = .875, but the mixed-variance version has alpha = .2, so the variance of individual items matters. This presents another headache for the user of this statistic, and perhaps we should try to ensure that variances are close to the same in practice. Hey, what does the wikipedia page advise?

Assumptions

I've treated the alpha statistic as a math object, but its roots are in standardized testing. The Cronbach's paper addresses a problem extant at that time: how to estimate test reliability, or at least items on a test that are intended to measure the same skill. From my browsing that paper it seems that one popular method at the time was to split the items in half and correlate scores from one half to the other. This is unsatisfactory, however, because the result depends on how we split the test. The alpha is shown in Cronbach's paper to be the average over all such choices, so it standardizes the measure.

However, there's more to this story, because the wiki page describes assumptions about the test item statistics that should be satisfied before alpha is a credible measure. The strictest assumption is called "tau equivalence," in which "data have equal covariances, but their variances may have different values." Note that this will always be true if we only have two items (or raters, as in the critical thinking paper), but generally I would think this is a stretch.

Never mind the implausibility, however. Does tau-equivalence fix the problem identified in the previous section? It seems unreasonable that the reliability of a test should change if I simply change the scores for items. Suppose I've got a survey with 1-5 Likert-type scales, and I compute alpha. I don't like the result, so I arbitrarily change one of the scales to 2-10 by doubling the values, and get a better alpha. That adjustment is obviously not desirable in a measure of reliability. But it's possible without preconditions on the covariance matrix. Does tau-equivalence prevent such shenanigans?

For a discussion of the problems caused by the equal-covariance assumption and others see

Trizano-Hermosilla, I., & Alvarado, J. M. (2016). Best alternatives to Cronbach's alpha reliability in realistic conditions: congeneric and asymmetrical measurements. Frontiers in psychology, 7, 769. [link]

The authors make the point that alpha keeps getting used, long after more reasonable methods have emerged. I think this is particularly true for rater reliability, which seems like an odd use for alpha.

Two Dimensions

The 2x2 case is useful, since it's automatically tau-equivanent (there is only one covariance element, so it's equal to itself). Suppose we have item standard deviations of $ \sigma_1, \sigma_2 $ and a correlation of $ \rho_{12} $. Then the covariance will remain the same if we scale one of the items proportionally by a constant $ \beta $ and the other inversely so. In detail, we have

$$ \begin{bmatrix} \beta^2 \sigma_1^2 & \rho_{12} \sigma_1 \sigma_2 \\ \rho_{12} \sigma_1 \sigma_2 & \frac{ \sigma_2^2}{\beta^2} \end{bmatrix} $$

Then

$$ \alpha = 2 \frac{2 \rho_{12} \sigma_1 \sigma_2}{2 \rho_{12} \sigma_1 \sigma_2 + \beta^2 \sigma_1^2 + \frac{ \sigma_2^2}{\beta^2} } $$

The two out front is that fudge factor we have to include to keep the values in range. Consider everything fixed except for $ \beta $, so the extreme values of alpha will occur when the denominator is largest or smallest. Taking the derivative of the denominator, setting to zero, and solving gives

$$ \beta = \pm \sqrt{\frac{\sigma_2}{\sigma_1}} $$

Since the negative solution doesn't make sense in this context, we can see with a little more effort that alpha is maximized when the two variances are each equal to $ \sigma_1 \sigma_2 $ so that we have

$$ \begin{bmatrix} \sigma_1 \sigma_2 & \rho_{12} \sigma_1 \sigma_2 \\ \rho_{12} \sigma_1 \sigma_2 & \sigma_1 \sigma_2 \end{bmatrix}. $$

At that point we can divide the whole thing by $ \sigma_1 \sigma_2 $, which won't change alpha, to get the correlation matrix

$$ \begin{bmatrix} 1 & \rho \\ \rho & 1 \end{bmatrix} $$

So for a 2x2 case, the largest alpha occurs when the variances are equal, and we can just consider the correlation. My guess is that this is true more generally, and that a convexity argument could show that, e.g. with Jensen's Inequality. But that's just a guess.

This fact for the 2x2 case seems to be an argument for scaling the variances to one before starting the computation. However, that isn't the usual approach used when assessing alpha as a test-retest reliability statistic, as far as I can tell.

Rater Reliability

For two observers who assign numerical ratings to the same sources, the correlation between ratings is a natural statistic to assess reliability with. Note that the correlation calculation rescales the respective variances to one, which will maximize alpha as noted above.

In fact, calculating such correlations is the aim of the intra-class correlation (ICC), which you can read more about here:

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.

The simplest version of ICC works by partitioning variance into between-students and within-students, with the former being of most importance: we want to be able to distinguish cases, and the with-student variation is seen as error or noise due to imperfect measurement (rater disagreement in this case). The simple version of the ICC, which Shrout calls ICC(1,1) is equivalent to the correlation between scores for the same student. See here for a derivation of that.

With that we can see that for the 2x2 case using variances = 1, the correlation $ \rho $, which is also the ICC, is related to Cronbach's alpha via

$$ \alpha = 2 \frac{ \rho } { \rho + 1 } $$

You can graph it on google:

The point selected shows (upper right) that an alpha level of about .7 is equivalent to a within-student correlation of about .54.

Implications

I occasionally use the alpha statistic when assessing survey items. I first correlate all the discrete-scale responses by respondent, dropping blanks on a case-by-case basis (in R it's cor(x, use = "pairwise.complete.obs")). Then collect the top five or so that are correlated and calculate alpha for those. Given the analysis above, I'll start using the correlation matrix instead of the covariance matrix for the alpha calculation, in order to standardize the metric across scales. This is because our item responses can vary in standard deviation quite a bit.

In the paper cited about critical thinking, the authors find alphas > .7 and cite the prevailing wisdom that this is good enough reliability. I tracked down that .7 thing one time, and it's just arbitrary--like the .05 p-value ledge for "significance." Not only is it arbitrary, but the same value (.7) shows up for other statistics as a minimum threshold. For example, correlations.

The meaninglessness of that .7 thing can be seen here by glancing at the graph above. If a .7 ICC is required for good enough reliability, that equates to a .82 alpha, not .7 (I just played around with the graph, but you could invert the formula to compute exactly). See the contradiction? Moreover, the square root of the ICC can also be interpreted as a correlation, which makes the .7 threshold even more of a non sequitur.

If you took the German mini-quiz, then the correct answer is "Das Auto ist rot." Unless you live in Cologne, in which case it's "Der Auto." Or so I'm told by a native who has a reliability of .7.

Edits

I just came across this article, which is relevant.

Managing Validity Versus Reliability Trade-Offs in Scale-Building Decisions by Jeremy D. W. Clifton. From page 260:

Despite its popularity, is not well understood; John and Soto (2007) call it the misunderstood giant of psychological research.

[...]

This dependence on the number of items in a scale and the degree to which they covary also means that does not indicate validity, even though it is commonly used to do so in even flagship journals [...]

There's a discussion of an item independence assumption and much more.

Higher Ed/

Friday, November 05, 2021

Kuder-Richardson formula 20