My academia.edu account emails me links to articles they think I'll like, and the recommendations are usually pretty good. A couple of weeks ago I came across a paper on the reliability of rubric ratings for critical thinking that way:
Rater agreement is a topic I've been interested in for a while, and the reliability of rubric ratings is important to the credibility of assessment work. I've worked with variance measures like intra-class correlation, and agreement statistics like the Fleiss kappa, but I don't recall seeing Cronbach's alpha used as a rater agreement statistic before. It usually comes up in assessing test items or survey components.
Here's the original reference.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. psychometrika, 16(3), 297-334. [link]
In reading this paper, it's apparent that Cronbach was attempting to resolve a thorny problem in testing theory--how to estimate the reliability of a standardized test? Intuitively, reliability is the tendency for the test results to remain constant when we change unimportant details, like swap out items for similar ones. More formally, reliability is intended to estimate how good the test is a measuring a test-takers "true score."
The alpha coefficient is a generalization of a previous statistic called the Kuder-Richardson formula 20, which Cronbach notes is a mouthful and will never catch on. He was right!
Variance and Covariance
- Das Auto ist rot.
- Der Auto ist rot.
- Den Auto ist rot.
And so on. If we believe that there is a cohesive body of knowledge comprising German noun declension (some memorization and some rules), then we might imagine that the several test items on this subject might tell us about a test-taker's general ability. But if so, we would expect some consistency in the correct-response patterns. A knowledgeable test taker is likely to get them all correct, for example. On the other hand, for a set of questions that mixed grammar with math and Russian history, we would probably not assume that such correlations exist.
As a simple example, imagine that we have three test items we believe should tap into the same learned ability, and the items are scaled so that each of them has variance one. Then the covariance matrix is the same as the correlation matrix:
$$ \begin{bmatrix} 1 & \rho_{12} & \rho_{13} \\ \rho_{21} & 1 & \rho_{23} \\ \rho_{31} & \rho_{32} & 1 \end{bmatrix} $$
The alpha statistic is based on the ratio of the off-diagonal correlation to the sum of the whole matrix, asking how much of the total variance is covariance? The higher that ratio is, then--to a point--the more confidence we have that the items are in the same domain. Note that if the items are completely unrelated (correlations all zero), we'd have zero over one, an indication of no inter-item reliability.
Scaling
One immediate problem with this idea is that it depends on the number of items. Suppose that instead of three items we have \(n\), so the matrix is \(n \times n\), comprising \(n\) diagonal elements and \(n^2 - n \) covariances. Suppose that these are all one. Then the ratio of the sum of covariance to the total is
$$ \frac{n^2 - n}{n^2} = \frac{n - 1}{n}. $$
Therefore, if we want to keep the scale of the alpha statistic between zero and one, we have to scale by \( \frac{n}{n-1} \). Awkward. It suggests that the statistic isn't just telling us about item consistency, but also about how many items we have. In fact, we can increase alpha just by adding more items. Suppose all the item correlations are .7, still assuming variances are one. Then
$$ \alpha = \frac{n}{n - 1} \frac{.7 (n^2 - 1)}{.7 (n^2 - 1) + n } = \frac{.7n}{.7n + .3} $$
which asymptotically approaches one as \( n \) grows large. Since we'd like to think of item correlation as the big deal here, it's not ideal that with fixed correlations, the reliability measure depends on how many items there are. This phenomenon can be linked to the Spearman-Brown lift formula, but I didn't track that reference down.
This dependency on the number of items is manageable for tests and surveys as long as we keep it in mind, but it's more problematic for rater agreement.
It's fairly obvious that if all the items have the same variance, we can just divide and get the correlation matrix, so alpha will not change in this case. But what if items have different variances? Suppose we have correlations of .7 as before, but items 1, 2, and 3 have standard deviations of 1, 10, and 100, respectively. Then the covariance matrix is
$$ \begin{bmatrix} 1 & 7 & 70 \\ 7 & 100 & 700 \\ 70 & 700 & 10000 \end{bmatrix} $$
The same-variance version has alpha = .875, but the mixed-variance version has alpha = .2, so the variance of individual items matters. This presents another headache for the user of this statistic, and perhaps we should try to ensure that variances are close to the same in practice. Hey, what does the wikipedia page advise?
Assumptions
Two Dimensions
Rater Reliability
Implications
Edits
Despite its popularity, is not well understood; John and Soto (2007) call it the misunderstood giant of psychological research.
This dependence on the number of items in a scale and the degree to which they covary also means that does not indicate validity, even though it is commonly used to do so in even flagship journals [...]There's a discussion of an item independence assumption and much more.