Showing posts with label measurement. Show all posts
Showing posts with label measurement. Show all posts

Friday, November 05, 2021

Kuder-Richardson formula 20

My academia.edu account emails me links to articles they think I'll like, and the recommendations are usually pretty good. A couple of weeks ago I came across a paper on the reliability of rubric ratings for critical thinking that way:

Saxton, E., Belanger, S., & Becker, W. (2012). The Critical Thinking Analytic Rubric (CTAR): Investigating intra-rater and inter-rater reliability of a scoring mechanism for critical thinking performance assessments. Assessing writing, 17(4), 251-270. [link]

Rater agreement is a topic I've been interested in for a while, and the reliability of rubric ratings is important to the credibility of assessment work. I've worked with variance measures like intra-class correlation, and agreement statistics like the Fleiss kappa, but I don't recall seeing Cronbach's alpha used as a rater agreement statistic before. It usually comes up in assessing test items or survey components. 

Here's the original reference.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. psychometrika, 16(3), 297-334. [link] 

In reading this paper, it's apparent that Cronbach was attempting to resolve a thorny problem in testing theory--how to estimate the reliability of a standardized test? Intuitively, reliability is the tendency for the test results to remain constant when we change unimportant details, like swap out items for similar ones. More formally, reliability is intended to estimate how good the test is a measuring a test-takers "true score." 

The alpha coefficient is a generalization of a previous statistic called the Kuder-Richardson formula 20, which Cronbach notes is a mouthful and will never catch on. He was right!

Variance and Covariance

The alpha statistic is a variance measure, specifically a ratio of covariance to variance. It's a cute idea. Imagine that we have several test items concerning German noun declension (the endings of words based on gender, tense, etc.). Which is correct:
  1. Das Auto ist rot.
  2. Der Auto ist rot.
  3. Den Auto ist rot.

 And so on. If we believe that there is a cohesive body of knowledge comprising German noun declension (some memorization and some rules), then we might imagine that the several test items on this subject might tell us about a test-taker's general ability. But if so, we would expect some consistency in the correct-response patterns. A knowledgeable test taker is likely to get them all correct, for example. On the other hand, for a set of questions that mixed grammar with math and Russian history, we would probably not assume that such correlations exist. 

As a simple example, imagine that we have three test items we believe should tap into the same learned ability, and the items are scaled so that each of them has variance one. Then the covariance matrix is the same as the correlation matrix:

$$  \begin{bmatrix} 1 & \rho_{12} & \rho_{13} \\ \rho_{21} & 1 & \rho_{23} \\ \rho_{31} & \rho_{32} & 1 \end{bmatrix} $$

The alpha statistic is based on the ratio of the off-diagonal correlation to the sum of the whole matrix, asking how much of the total variance is covariance? The higher that ratio is, then--to a point--the more confidence we have that the items are in the same domain. Note that if the items are completely unrelated (correlations all zero), we'd have zero over one, an indication of no inter-item reliability. 

Scaling

One immediate problem with this idea is that it depends on the number of items. Suppose that instead of three items we have \(n\), so the matrix is \(n \times n\), comprising \(n\) diagonal elements and \(n^2 - n \) covariances. Suppose that these are all one. Then the ratio of the sum of covariance to the total is 

$$ \frac{n^2 - n}{n^2} = \frac{n - 1}{n}. $$

Therefore, if we want to keep the scale of the alpha statistic between zero and one, we have to scale by \( \frac{n}{n-1} \). Awkward. It suggests that the statistic isn't just telling us about item consistency, but also about how many items we have. In fact, we can increase alpha just by adding more items. Suppose all the item correlations are .7, still assuming variances are one. Then 

$$ \alpha = \frac{n}{n - 1} \frac{.7 (n^2 - 1)}{.7 (n^2 - 1) + n } = \frac{.7n}{.7n + .3} $$

which asymptotically approaches one as \( n \) grows large. Since we'd like to think of item correlation as the big deal here, it's not ideal that with fixed correlations, the reliability measure depends on how many items there are. This phenomenon can be linked to the Spearman-Brown lift formula, but I didn't track that reference down.

This dependency on the number of items is manageable for tests and surveys as long as we keep it in mind, but it's more problematic for rater agreement.

It's fairly obvious that if all the items have the same variance, we can just divide and get the correlation matrix, so alpha will not change in this case. But what if items have different variances? Suppose we have correlations of .7 as before, but items 1, 2, and 3 have standard deviations of 1, 10, and 100, respectively.  Then the covariance matrix is

$$  \begin{bmatrix} 1 & 7 & 70 \\ 7 & 100 & 700 \\ 70 & 700 & 10000 \end{bmatrix} $$

The same-variance version has alpha = .875, but the mixed-variance version has alpha = .2, so the variance of individual items matters. This presents another headache for the user of this statistic, and perhaps we should try to ensure that variances are close to the same in practice. Hey, what does the wikipedia page advise?

Assumptions

I've treated the alpha statistic as a math object, but its roots are in standardized testing. The Cronbach's paper addresses a problem extant at that time: how to estimate test reliability, or at least items on a test that are intended to measure the same skill. From my browsing that paper it seems that one popular method at the time was to split the items in half and correlate scores from one half to the other. This is unsatisfactory, however, because the result depends on how we split the test. The alpha is shown in Cronbach's paper to be the average over all such choices, so it standardizes the measure. 

However, there's more to this story, because the wiki page describes assumptions about the test item statistics that should be satisfied before alpha is a credible measure. The strictest assumption is called "tau equivalence," in which "data have equal covariances, but their variances may have different values." Note that this will always be true if we only have two items (or raters, as in the critical thinking paper), but generally I would think this is a stretch. 

Never mind the implausibility, however. Does tau-equivalence fix the problem identified in the previous section? It seems unreasonable that the reliability of a test should change if I simply change the scores for items. Suppose I've got a survey with 1-5 Likert-type scales, and I compute alpha. I don't like the result, so I arbitrarily change one of the scales to 2-10 by doubling the values, and get a better alpha. That adjustment is obviously not desirable in a measure of reliability. But it's possible without preconditions on the covariance matrix. Does tau-equivalence prevent such shenanigans?
 
For a discussion of the problems caused by the equal-covariance assumption and others see 

Trizano-Hermosilla, I., & Alvarado, J. M. (2016). Best alternatives to Cronbach's alpha reliability in realistic conditions: congeneric and asymmetrical measurements. Frontiers in psychology, 7, 769. [link]

The authors make the point that alpha keeps getting used, long after more reasonable methods have emerged. I think this is particularly true for rater reliability, which seems like an odd use for alpha.
 

Two Dimensions

The 2x2 case is useful, since it's automatically tau-equivanent (there is only one covariance element, so it's equal to itself). Suppose we have item standard deviations of \( \sigma_1, \sigma_2 \) and a correlation of \( \rho_{12} \). Then the covariance will remain the same if we scale one of the items proportionally by a constant \( \beta \) and the other inversely so. In detail, we have

$$  \begin{bmatrix} \beta^2 \sigma_1^2 & \rho_{12} \sigma_1 \sigma_2 \\ \rho_{12} \sigma_1 \sigma_2 & \frac{ \sigma_2^2}{\beta^2} \end{bmatrix} $$
 
Then 
 
$$ \alpha = 2  \frac{2 \rho_{12} \sigma_1 \sigma_2}{2 \rho_{12} \sigma_1 \sigma_2 +  \beta^2 \sigma_1^2 + \frac{ \sigma_2^2}{\beta^2} } $$

The two out front is that fudge factor we have to include to keep the values in range. Consider everything fixed except for \( \beta \), so the extreme values of alpha will occur when the denominator is largest or smallest. Taking the derivative of the denominator, setting to zero, and solving gives

$$ \beta = \pm \sqrt{\frac{\sigma_2}{\sigma_1}} $$
 
Since the negative solution doesn't make sense in this context, we can see with a little more effort that alpha is maximized when the two variances are each equal to \( \sigma_1 \sigma_2 \) so that we have
 
$$  \begin{bmatrix} \sigma_1 \sigma_2  & \rho_{12} \sigma_1 \sigma_2 \\ \rho_{12} \sigma_1 \sigma_2 & \sigma_1 \sigma_2 \end{bmatrix}. $$
 
At that point we can divide the whole thing by  \( \sigma_1 \sigma_2 \), which won't change alpha, to get the correlation matrix
 
 
$$  \begin{bmatrix} 1  & \rho  \\ \rho  & 1 \end{bmatrix} $$
 
So for a 2x2 case, the largest alpha occurs when the variances are equal, and we can just consider the correlation. My guess is that this is true more generally, and that a convexity argument could show that, e.g. with Jensen's Inequality. But that's just a guess.
 
This fact for the 2x2 case seems to be an argument for scaling the variances to one before starting the computation. However, that isn't the usual approach used when assessing alpha as a test-retest reliability statistic, as far as I can tell.   

Rater Reliability

For two observers who assign numerical ratings to the same sources, the correlation between ratings is a natural statistic to assess reliability with. Note that the correlation calculation rescales the respective variances to one, which will maximize alpha as noted above.
 
In fact, calculating such correlations is the aim of the intra-class correlation (ICC), which you can read more about here:
 
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. 

The simplest version of ICC works by partitioning variance into between-students and  within-students, with the former being of most importance: we want to be able to distinguish cases, and the with-student variation is seen as error or noise due to imperfect measurement (rater disagreement in this case). The simple version of the ICC, which Shrout calls ICC(1,1) is equivalent to the correlation between scores for the same student. See here for a derivation of that. 
 
With that we can see that for the 2x2 case using variances = 1, the correlation \( \rho \), which is also the ICC, is related to Cronbach's alpha via

 
$$ \alpha = 2  \frac{ \rho } { \rho  + 1 } $$
 
You can graph it on google:


The point selected shows (upper right) that an alpha level of about .7 is equivalent to a within-student correlation of about .54. 

Implications

I occasionally use the alpha statistic when assessing survey items. I first correlate all the discrete-scale responses by respondent, dropping blanks on a case-by-case basis (in R it's cor(x, use = "pairwise.complete.obs")). Then collect the top five or so that are correlated and calculate alpha for those. Given the analysis above, I'll start using the correlation matrix instead of the covariance matrix for the alpha calculation, in order to standardize the metric across scales. This is because our item responses can vary in standard deviation quite a bit. 

In the paper cited about critical thinking, the authors find alphas > .7 and cite the prevailing wisdom that this is good enough reliability. I tracked down that .7 thing one time, and it's just arbitrary--like the .05 p-value ledge for "significance." Not only is it arbitrary, but the same value (.7) shows up for other statistics as a minimum threshold. For example, correlations. 

The meaninglessness of that .7 thing can be seen here by glancing at the graph above. If a .7 ICC is required for good enough reliability, that equates to a .82 alpha, not .7 (I just played around with the graph, but you could invert the formula to compute exactly). See the contradiction? Moreover, the square root of the ICC can also be interpreted as a correlation, which makes the .7 threshold even more of a non sequitur.
 
If you took the German mini-quiz, then the correct answer is "Das Auto ist rot." Unless you live in Cologne, in which case it's "Der Auto." Or so I'm told by a native who has a reliability of .7.
 

Edits

I just came across this article, which is relevant.
 

Despite its popularity, is not well understood; John and Soto (2007) call it the misunderstood giant of psychological research.
[...]

This dependence on the  number of items in a scale and the degree to which they covary also means that does not indicate validity, even though it is commonly used to do so in even flagship journals [...]
There's a discussion of an item independence assumption and much more.

Friday, November 22, 2019

Problems with Rubric Data

Introduction

Within the world of educational assessment, rubrics play a large role in the attempt to turn student learning into numbers. Usually this starts with some work students have done, like research papers for an assignment (often called "artifacts" for some reason). These are then scored (i.e. graded) using a grid-like description of performance levels. Here's an outline I found at Berkeley's Center for Teaching and Learning:


The scale levels are almost always described in objective language relative to some absolute standard, like "The paper has no language usage errors (spelling, grammar, punctuation)." 

Raters review each student work sample and assign a level that corresponds to the description, using their best judgment. There are usually four or five separate attributes to rate for each sample. For written work these might be:
  • language correctness
  • organization
  • style
  • follows genre conventions (e.g. a letter to the editor doesn't look like a research paper).
The result of this work (and it is usually a lot of work) is ideally a data set that includes the identification of the student being rated, the identification of the rater, and the numerical ratings for each attribute for each sample. If papers are being rated this way, the final data set might begin like this:


The data needs to be summarized in order to make sense of it. Often this is accomplished by averaging the scores or by finding the percent of scores greater than a threshold (e.g. scores greater than 3 are assumed to meet a minimum standard).

This is where the problems begin.

Problems with Rubric Data


Problem 1. There may not be a there there

The rubric rating process assumes that there is enough evidence within the piece of writing to make an informed judgment. In the case of written work, this is a sliding scale. For example, the length of the paper is probably proportional to the amount of evidence we have, so we theoretically should be able to make better decisions about longer papers. I don't know if anyone has ever tested this, but it's possible: measure the inter-rater reliability for each paper that has multiple readers and see if that correlates with the number of words in the paper. 

If papers that simply lack evidence are rated lower (rather than the ratings being left blank, say), then the scale isn't being used consistently: we're conflating the amount of evidence with the qualities of the evidence. John Hathcoat did a nice presentation about this issue at the AAC&U meeting last year.


When I design rubrics now, I make them two-dimensional: one assesses how much basis there is for judgment, and the other the quality. There's a dependency relationship between the two: we can't assess quality lacking sufficient evidence.

Problem 2. Growth versus Sorting

Raters of the student work are naturally influenced by what they see during the rating. That is, our putative objectivity and ideal rubric levels can go out the window when confronted with reality. Raters may simply rank the student work from worst to best using the scale they are given. The logic, which I've heard from raters, goes like this: "Paper 4 is obviously better than paper 3, so the score for paper 4 should be higher than the score for paper 3." 

What follows from this (very natural) rating style is a sorting of the papers. If the sorting is any good, the poor papers have low scores and the good papers have high scores. This sounds like a good thing, but it's not what was advertised. The intention of a so-called analytic rubric is to produce absolute measures of quality, not relative ones. Why does that matter?

If we just sort the students by paper quality, we'll get pretty much the same answer every time. The high grade earners will end up at the top of the pile and the low grade earners at the bottom. This pattern will, on average persist over the years of college. The A-student freshman will get 5s on their papers, and when they're seniors, they'll still be getting 5s.  We can't measure progress with sorting.

One project I worked on gathered rubric ratings of student work over a two-semester sequence, where the instruction and training of instructors was closely supervised. Everyone was trained on the analytic rubric, which was administered online to capture many thousands of work samples with rubric ratings.

The graph below shows the average change in rubric quality ratings over six written assignments (comprising two terms). 

The scores start low for the first assignment, then improve (higher is better) throughout the first term. This is evidence that within this one class, the analytic rubric is working to some extent, although the absolute difference in average score over the term (about 3.0 to about 3.3) is quite low. At first glance, it's mostly sorting with a little growth component added in.

But when we get to the second term, the scores "reset" to a lower level than they finished the first term at. We could interpret this in various ways, but the scale clearly is not functioning as an analytic rubric is intended to over the two terms, or else learning quickly evaporates over the holidays. 

By way of contrast, here's a summary of thousands of ratings from a different project that does appear to show growth over time. As a bonus, the sorting effect is also demonstrated by disaggregating the ratings by high school GPA.


More on that project another time, but you can read about it here and here (requires access). 

Problem 3. Rater Agreement

The basic idea of rubric rating is common sense: we look at some stuff and then sort it into categories based on the instructions. But the invention of statistics was necessary because it turns out that common sense isn't a very good guide for things like this. The problem in this case, is that the assigned categories (rubric ratings) may be meaningless. For example, we wouldn't simply flip a coin to determine if a student writing portfolio meets the graduation requirement for quality. But it could be that our elaborate rubric rating system has the same statistical properties as coin-flipping. We don't know unless we check.

The general concept is measurement reliability. It's complicated, and there are multiple measures to choose from. Each of those comes with statistical assumptions about the data or intended use of the statistic. There are "best practices" that don't make any sense (like kappa must be > .7), and there is a debate within the research community about how to resolve "paradoxes" related to unbalanced rating samples. I don't have a link, but see this paper on that topic:
Krippendorff, K. (2013). Commentary: A dissenting view on so-called paradoxes of reliability coefficients. Annals of the International Communication Association, 36(1), 481-499.
For my own analysis and some results I'll refer you to this paper, which won an award from the Association for Institutional Research. 

A short version of the reliability problem for rubric rating is that:
  • it's vital to understand the reliability before using the results
  • it's difficult and confusing to figure out how to do that
  • reliability problems often can't easily be fixed

Problem 4. Data Types

What we get from rubric scales is ordinal data, with the numbers (or other descriptors) serving the role of symbols, where we understand that there's a progression from lesser to greater demonstration of quality: A > B > C. It's common practice, as I did in the example for Problem 2, to simply average the numerical output and call that a measure of learning. 

The averages "kind of" work, but they make assumptions about the data that aren't necessarily true. That approach also oversimplifies the data set and doesn't take advantage of richer statistical methods. Here are two references:
Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with metric models: What could possibly go wrong?. Journal of Experimental Social Psychology, 79, 328-348. [pdf]
and 
Engelhard Jr, G., & Wind, S. (2017). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge. [Amazon.com]
The same idea applies to analyzing survey items on ordinal scales (e.g. disagree / agree).

Some advantages of using these more advanced methods are that we get information on:
  • rater bias
  • student ability
  • qualitative differences between rubric dimensions
  • how the ordinal scale maps to an ideal "latent" scale
Since the measurement scale for all of these is the same, it's called "invariant," which you can read more about in the first reference.

Problem 5. Sample Size

It's time-consuming to use rubrics to rate student work samples. If the rating process isn't built into ordinary grading, for example, then this is an additional cost to get the data. That cost can limit sample sizes. How big a sample do you need? Probably at least 200.

That estimate comes from my own work, for example the power analysis shown below. It shows simulations of statistical tests to see how much actual difference in average ratings is required before I can detect it from samples. In the simulation, two sets of actual student ratings were drawn from a large set of them (several thousand). In one set (A) I took the average. In the other set (B) I took the average and then added an artificial effect size to simulate learning over time, which you can see along the top of the table below. Then I automated t-tests to see if the p-value was less than .05, which would suggest a non-zero difference in sample averages--detecting the growth. The numbers in the table show the rate that the t-test successfully identified that there was a non-zero difference. The slide below comes from a presentation I did at the SACSCOC Summer Institute. 



For context, a .4 change is about what we expect over a year in college. With 100 samples each of A and B, where the average of B was artificially inflated by .4 to simulate a year's development, the t-test will get the correct answer 76% of the time. The rest of the time, there will be no conclusion (at alpha = .05) that there's a difference. Rubric ratings are often statistically noisy, and to average out the noise requires enough samples to detect actual effects.

For an article that essentially reaches the same conclusion, see:
Bacon, D. R., & Stewart, K. A. (2017). Why assessment will never work at many business schools: A call for better utilization of pedagogical research. Journal of Management Education, 41(2), 181-200. [pdf]
Because most rubric rating projects end up with a lot fewer than 200 samples, it's fair to conclude that most of them don't have sufficient data to do the kind of statistics we need to do--including testing reliability and checking for differences within student types. 

It's possible that we can get by with smaller samples if we use more sophisticated methods, like those found in Problem 4. I'll be working on that for a paper due in February, so stay tuned.

Problem 6. Standardization

Standardization in testing is supposed to help with reliability by reducing extraneous variables. In the case of rubric ratings, this often means restricting the student work to a single item: one research paper per student, for example. But if we want to assess the student's writing ability that is different from evaluating a written work. In the most extreme case, the student might have just bought the paper on the Internet, but even under usual circumstances, using a single paper as a proxy for writing ability is a crude approximation. Consider a student who is a strong writer, but chooses to invest study time in another class. So she gets up early the day the paper is due, knocks out one draft, and gets an easy A-. A different student works very hard on draft after draft, sits for hours in the writing lab, visits the professor for feedback, and--after all that--earns an A-. Does the grade (or equivalent rubric rating) tell us anything about how that rating was earned? No--for that we need another source of information. Surveys of students is one possibility. For a different approach, see my references at the end of Problem 2.

Some attempts review a whole student portfolio of work to get more information about writing, for example. But rating a whole portfolio is an even bigger burden on the raters, reducing sample size and--because of the inevitable non-standardization of samples--lowers reliability.

Problem 7. Confounding

The AAC&U started a rubric-construction project with its LEAP initiative some years ago. The result is a set of VALUE rubrics. They also provide scorer training and will score your papers externally for a fee. The paper cited below was published last year. It's based on a data set that's large enough to work with, and includes students at different points in their college careers, so it's conceivable to estimate growth over time. This assumes that Problem 2 doesn't bias the data too much. 
Sullivan, D. F., & McConnell, K. D. (2018). It's the Assignments—A Ubiquitous and Inexpensive Strategy to Significantly Improve Higher-Order Learning. Change: The Magazine of Higher Learning, 50(5), 16-23 [requires access]
The authors found that they could detect plausible evidence of growth over time, but only after they included assignment difficulty as an explanatory variable. You can read the details in the paper.

If this finding holds up in subsequent work, it's not just an example of needing more information than what's in the rubric data to understand it; it's a subtle version of Problem 1. Suppose freshmen and seniors are both in a general education class that assigns a paper appropriate to freshmen. Assuming everything works perfectly with the rubric rating, we still may not be able to distinguish the two classes simply because the assignment isn't challenging enough to require demonstration of senior-level work. This is a cute idea, but it may not be true--we'll have to see what other studies turn up.

The are two lessons to draw from this. One is that it's essential to keep student IDs and course information in order to build explanatory models with the data. From my experience, college GPA correlates with practically every kind of assessment measure. So if we analyze rubric scores without taking that variable into account, we're likely to misunderstand the results.

The other lesson comes from the reaction of the assessment community to this paper: there was none. The VALUE rubrics are widely adopted across the country, so there are probably at least a thousand assessment offices that employ them. The paper challenges the validity of using all that data without taking into account course difficulty. One could have naturally expected a lot of heated discussion about this topic. I haven't seen anything at all, which to me is another sign that assessment offices mostly operate on the assumption that rubric ratings just work, rather than testing them to see if they do work.

Problem 8. Dimensionality

While most rubrics have five or so attributes to rate, the data that comes from these tends to be highly correlated. A dimensional analysis like Principal Component Analysis reveals the structure of these correlations. Most often in my experience the primary component looks like a holistic score, and captures the majority of the variation in ratings. This is important, because you may be working your raters too hard, trading away additional sample size for redundant data. If the correlations are too high, you're not measuring what you think you are.

To connect to Problem 7: correlation the first principal component with student GPA to see how closely what you're measuring with the rubric is already accounted for in grade averages.

Problem 9. Validity

Do the ratings have any relationship to reality outside the rating context? We see reports that employers want strong problem-solving skills. Do our rubric ratings of "problem-solving" have any useful relationship to the way the employers understand it? There's no easy way to answer that question--it's another research project, e.g. start by surveying student internship supervisors. 

The trap to avoid is assuming that just because we have all this language about what we think we're measuring (e.g. the words on the rubric) doesn't mean that the numbers actually measure that. That question may not even make sense: if the reliability is too low, we're not measuring anything at all. And if the reviewers that matter--observers of our graduates later in life--don't themselves agree about "problem-solving" or whatever it is, the measurement problem is doomed from the start. 

If we just take rating averages and announce to the world that "our student problem-solving ability improved on average from 3.2 to 3.5 last year," it's almost certainly not going to be understood in a legitimate way. This kind of thing leads people to think they know something when they don't. 

For a nice discussion of the complexities of this topic see:
Moss, P. A. (1994). Can there be validity without reliability?. Educational researcher, 23(2), 5-12. [pdf]

Discussion

In higher education administration, particularly in assessment offices, rubrics are used as an easy way to plausibly turn observations of student work products or performances into data. After being involved with conferences and peer reviews for 18 years, it's apparent that the complexities of dealing with this kind of data are almost uniformly ignored. Small, noisy data sets that are not critically examined are used to produce "measures of learning" that are almost certainly not that. It is possible to learn about student development from rubric ratings: there are plenty of academic journals that publish such research. It's just not common in assessment circles to see appropriate skeptical inquiry related to this type of data, rather an unspoken faith in methods seems to rule the day.

Assessment of learning in higher education administration needs to pivot from faith-based research to data science, and the sooner the better.

Wednesday, April 28, 2010

Reflection on Generalization of Results

Blogging is sometimes painful.  The source of the discomfort is the airing of ideas and opinions that I might find ridiculous later (like maybe the next day).  Having an eternal memorial to one's dumb ideas is not attractive.  I suppose the only remedy is public reflection, which is no less discomforting.  To wit...

Yesterday I wrote:
The view from a discipline expert is naturally dubious of the claims that learning can be weighed up like a sack of potatoes, and the neural states of a hundred billion brain cells can be summarized in a seven-bit statistic with an accuracy and implicit model that can predict future behavior in some important respect.  Aren't critical thinkers supposed to be skeptical of claims like that?
I've mulled this over for a day.  A counter-argument might go like this:  A sack of potatoes has a very large number of atoms in it, and yet we can reduce those down to a single meaningful statistic (weight or mass) that is a statistical parameter determined from multiple measurements.  The true value of this parameter is presumed to exist, but we cannot know it except within some error bounds with some degree of probabilistic certainty.  This is not different from, say, an IQ test in those particulars.

I think that there is a difference, however.  Let's start with the basic assumption at work: that our neighborhood of the universe is reliable, meaning that if we repeat an experiment with the same initial conditions, we'll get the same outcomes.  Or, failing that, we'll get a well-defined distribution of outcomes (like the double slit experiment in quantum mechanics).  Moreover, we additionally assume that similar experiments yield similar results for a significant subset of all experiments.  This "smoothness" assumption grants us license to do inductive reasoning, to generalize results we have seen to ones we have not.  Without these assumptions, it's hard to see how we could do science. Restating the assumptions:
1. Reliability  Experiment under same conditions gives same results, or (weaker version) a frequency distribution with relatively low entropy

2. Continuity  Experiments with "nearby" initial conditions give "nearby"results.
Condition 1 grants us license assume the experiment relates to the physical universe.  If I'm the only one who ever sees unicorns in the yard, it's hard to justify the universality of the statement.  Condition 2 allows us to make inductive generalizations, which is necessary to make meaningful predictions about the future.  This why the laws of physics are so powerful--with just a few descriptions, validated by a finite number of experiments, we can predict an infinite number of outcomes accurately across a landscape of experimental possibilities. 

My implicit point in the quote above is that outcomes assessment may satisfy the first condition but not the second.  Let's look at an example or two.
Example.  A grade school teacher shows students how the times table works, and begins assessing them daily with a timed test to see how much they know.  This may be pretty reliable--if Tatiana doesn't know her 7s, she'll likely get them wrong consistently.  What is the continuity of the outcome?  Once a student routinely gets 100% on the test, what can we say?  We can say that Tatiana has learned her times tables (to 10 or whatever), and that seems like an accurate statement.  If I said instead that Tatiana can multiply numbers, this may or may not be true.  Maybe she doesn't know how to carry yet, and so can't multiply two-digit numbers.  Therefore, the result is not very generalizable. 
Example.  A university administers a general "critical thinking" standardized test to graduating students.  Careful trials have shown a reasonable level of reliability.  What is the continuity of the outcome?  If we say "our students who took the test scored x% on average," that's a statement of fact.  How far can we generalized?  I can argue statistically that the other students would have had similar scores.  I may be nervous about that, however, since I had to bribe students to take the test.  Can I make a general statement about the skill set students have learned?  Can I say "our graduates have demonstrated on average that they can think critically"? 
To answer the last question we have to know the connection between the test and what's more generally defined as critical thinking.  This is a validity question.  But what we see on standardized tests are very particular types of items, not a whole spectrum of "critical thinking" across disciplines.  In order to be generally administered, they probably have to be that way. 

Can I generalize from one of these tests and say that good critical thinkers in, say, forming an argument, are also good critical thinkers in finding a mathematical proof or synthesizing an organic molecule or translating from Sanskrit or creating an advertisement or critiquing a poem?  I don't think so.  I think there is little generality between these.  Otherwise disciplines would not require special study--just learn general critical thinking and you're good to go.

I don't think the issue of generalization (what I called continuity)  in testing gets enough attention.  We talk about "test validity," which wallpapers over the issue that validity is really about a proposition.   How general those propositions can be and still be valid should be the central question.  When test-makers tell us they're going to measure the "value added" by our curriculum, there ought to be a bunch of technical work that shows exactly what that means.  In the most narrow sense, it's some statistic that gets crunched, and is only a data-compressed snapshot of an empirical observation.  But the intent is clearly to generalize that statistic into something far grander in meaning, in relation to the real world. 

Test makers don't have to do that work because of the sleight of hand between technical language and everyday speech.  We naturally conjure an image of what "value added" means--we know what the words mean individually, and can put them together.  Left unanalyzed, this sense is misleading.  The obvious way to see if that generalization can be made would be to scientifically survey everyone involved to see if the general-language notion of "value added" lines up nicely with the technical one.  This wouldn't be hard to do.  Suppose they are negatively correlated.  Wouldn't we be interested in that?

Harking back to the example in the quote, weighing potatoes under normal conditions satisfies both conditions.  With a good scale, I'll get very similar results every time I weigh.  And if I add a bit more spud I get a bit more weight.  So it's pretty reliable and continuous.  But not under all conditions.  If I wait long enough, water will evaporate out or bugs will eat them, changing the measurement.  Or if I take them into orbit, the scale will read differently.  The limits of generalization are trickier when talking about learning outcomes.  Even if we assume that under identical conditions, identical results will occur (condition 1) the continuity condition is hard to argue for.  First, we have to say what we mean by "nearby" experiments.  This is simple for weight measurements, but not for thinking exercises.  Is performance on a standardized test "near" the same activity in a job capacity?  Is writing "near" reading?  It seems to me that this kind of topological mapping would be a really useful enterprise for higher education to do.  At the simplest level it could just be a big correlation matrix that is reliably verified.  As it is, the implicit claims of generalizability of the standardized tests of thinking ability are too much to take on faith. 

So, I stand by the quoted paragraph. It just took some thinking about why.

Thursday, April 22, 2010

Validity and Measurement

My wife is in charge of an "applied research" center here at the university, and ordered some books on research to satisfy her curiosity about how the other half lives (she's a lit geek).  I browsed them, looking for definitions of validity and measurement.   

From The Research Methods Knowledge Base by William M. K. Trochim and James P. Donnelly:
When people think about validity in research, they tend to think in term of research components.  You might say that a measure is a valid one, that a  valid sample was drawn, or that the design had strong validity, but all of those statements are technically incorrect.  Measures, samples, and designs don't have validity--only propositions can be said to be valid.  Technically, you should say that a measure leads to valid conclusions or that a sample enables valid inferences, and so on.  It is a proposition, inference, or conclusion that can have validity.
 This is the usual definition, which most people seem to ignore.  In casual conversation, people in assessment land and in psychology say things like "use a valid measurement" routinely in my experience.  More problematic is the usual linkage between validity and reliability.  Reliability is repeatability of results, or as Wikipedia puts it nicely "the consistency of a set of measurements or measuring instrument [...]"

Do you notice anything illogical here?  If validity is about propositions (about some instrument), and reliability is a property of the test itself, we can't justify the common assertion that validity requires reliability.  It's like saying that a thermometer needs someone to read it correctly--there are two issues conflated.

Example:  Jim Bob steps onto the balcony of his hotel room on his first evening of an all-expense paid trip to Paris for winning the Bugtussel Bowling Championship.  He looks at the thermometer nailed to the door frame, and is amazed to see that it is 10 degrees.  It's chilly out, but he didn't think it was that cold!
Clearly Jim Bob mistook Celsius for Fahrenheit, and made an invalid conclusion.  This has nothing to do with the reliability of the thermometer. You may rightfully object that of course we can always reach invalid conclusions--the trick is to find valid ones, and for that we require reliable instruments.  Not so.

Example: A standardized test is given to 145 students.  The assessment director gets a list of the students who took the test.  Is this list valid?  Yes.  In what sense is it reliable?

Here, the proposition is valid--it reflects reality--but isn't reliable longitudinally.  It's reliable in the trivial sense that if we look at the same instance of testing, we'd see the same roster of students, but almost anything is reliable by that standard.  The next time we give the instrument, we won't have the same roster.  Does this unreliability mean that the roster is invalid?  Of course not.

If reliability is a sine qua non for validity, then no unique observation can be valid.  Your impressions and conclusions about a movie on first viewing or first date are invalid because they can't be repeated.  When you read a book you really like, finish it and tell your friend "I enjoyed it," this is invalid according to testing dogma standards. You would have to first repeatedly read it for the first time and then assess the resulting statistics.

To underline the illogic of validity => reliability, consider the proposition "The inter-rater statistics on this method of assessment show it to be unreliable."  This is a common enough thing.  So how would we evaluate the validity of that statement?  If the central dogma it is true that validity requires reliability, there must be an underlying reliability that is a prerequisite for the validity of the conclusion.  Does that mean that we have to show that the inter-rater statistics are reliably unreliable?  That makes no sense: one instance of unreliability is all that is required to demonstrate unreliability.  A drunk driver may be only occasionally unreliable, but still absolutely unreliable, right?

If I make the statement "no cow is ever brown" the validity of that can be negated by a single instance of a brown cow.  I don't have to be able to reliably find brown cows, just one.  Yes, I have to be sure that the cow I saw really was brown, but this is a very low bar for reliability, and not what we're talking about.  Therefore, some kinds of propositions do not require reliability in order to be valid.

If I hear on the news that the Dow went up 15 points today, how should I evaluate the validity of this statement?  I could check other sources, but this will just establish the one fact.  I cannot repeat the day over and over to see if the Dow repeats its performance each time.  There is no way to establish longitudinal reliability.  Does that negate the validity of the statement?

The role of reliability is to allow us to make an inductive leap: if X has happened consistently in the past, maybe X is a feature of the universe, and will continue to happen.  Every time we eat a hamburger (or anything else), we make such an inductive leap.  Sometimes we have to be really clever to find out where the reliable parts are--like measuring gravity.  So rather than a dogma strong requirement about reliability and validity, we should say something like "we assume that reliability implies something persistent about the objective reality of the subject."

So, on to measurement.  I looked up that chapter in the book and found this:
Measurement is the process of observing and recording the observations that are collected as part of a research effort. (pg 56)
Contrast that with the nice Wikipedia definition:
In science, measurement is the process of obtaining the magnitude of a quantity, such as length or mass, relative to a unit of measurement, such as a meter or a kilogram.

From dictionary.com, a measurement is the "extent, size, etc., ascertained by measuring".

The scientific definitions lend themselves to units and physical dimensions.  The first definition, from the research methods book, is much more general.  Let's parse it.  There are three parts.  First, measurement is a process of observing.  Then we note that we should record the observation.  I would argue that this provision is unnecessary--we only talk about any datum as long as it's recorded somewhere, even if only in our minds.  If it's not recorded, it's nowhere to be found and irrelevant. We can, however, glean from this that 'measurement' is both a verb and a noun.

The last provision--that it must be part of a research effort--can also be discarded.  The purpose of the observer may not be known when the measurements are later used.  Perhaps an ancient astronomer made star charts for religious reasons.  Does that negate them as measurements, solely for that reason?  No.  So we are left simply with the first part: a measurement is a process of observing (verb), or the product of the same (noun).  This is much weaker than the scientific version, because we aren't tied to standard units of measurement.

My main objection to this is that there's no need to use another word.  If we mean observation, why don't we just say observation?

Edits: Fixed a vanished sentence fragment, and crossed out 'dogma', which is a silly word to use.  I hadn't had enough coffee yet. 

Thursday, April 30, 2009

Part Eight: If Testing isn't Measurement, What Is It?

Why Assessment is Hard: [Part 1] [Part 2] [Part 3] [Part 4] [Part 5] [Part 6] [Part 7]

Last time I argued that although we use the word "measurement" for educational outcomes in the same way it's used for weighing a bag of coconuts, it really doesn't mean the same thing. It is a kind of deception to substitute one meaning for another without warning the audience. Of course, this happens all the time in advertising, where "no fat" doesn't really mean no fat (such terms are, however, now defined by the FDA). In education, this verbal blurring of meaning has gotten us into trouble.

Maybe it's simply wishful thinking to imagine that we could have the kind of precise identification of progress in a learner that would correspond to the graduations on a measuring cup. Socrates' simile that education is kindling a flame, not filling a cup is apt--learning is primarily qualitative (the rearrangement of neurons and subtle changes in brain chemistry perhaps) and not quantitative (pouring more learning stuff into the brain bucket). As another comparison, the strength of a chess position during a game is somewhat related to the overall number of pieces a player has, but far more important is the arrangement of those pieces.

The subject of quality versus quantity with regard to measurement deserves a whole discussion by itself, with the key question being how does one impose an order on a combinatorial set. I'll have to pass that today and come back to it another time.

The sleight of hand that allows us to get away with using "measurement" out of context is probably due to the fluidity with which language works. I like to juxtapose two quotes that illustrate the difference between the language of measurement and normal language.
We say that a sentence is factually significant to any given person, if and only if, [she or] he knows how to verify the proposition which it purports to express—that is, if [she or] he knows what observations would lead [her or him], under certain conditions, to accept the proposition as being true, or reject it as being false. – A. J. Ayer, Language, Truth, and Logic

[T]he meaning of a word is its usage in the language. – L. Wittgenstein
The first quote is a tenet of positivism, which has a scientific outlook. The second is more down-to-earth, corresponding to the way words are used in non-technical settings. I make a big deal out of this distinction in Assessing the Elephant about what I call monological and dialogical definitions. I also wrote a blog post about it here.

Words like "force" can have meanings in both domains. Over time, some common meanings get taken over by more scientific versions. What a "second" means becomes more precise every time physicists invent a more accurate clock. The word "measurement" by now has a meaning that's pretty soundly grounded in the positivist camp. That is, if someone says they measured how much oil is dripping from the bottom of the car, this generates certain expectations--a number and a unit, for example. There is an implied link to the physical universe.

But as we saw last time, the use of "measurement" in learning outcomes doesn't mean that. What exactly are we doing, though, when we assign a number to the results of some evidence of learning? It could be a test or portfolio rating, or whatever. If it's not measurement, what is it?

We can abstract our assessment procedures into some kind of statistical goo if we imagine that the test subject has some intrinsic ability to successfully complete the task at hand, but that this ability is perhaps occluded by noise or error of various sorts. Under the right probabilistic assumptions, we can then imagine that we are estimating this parameter--this ability to ace our assessment task. Typically this assessment will itself be a statistical melange of tasks that have different qualities. A spelling test, for example, could have a staggering variety of words on it in English. If there are a million words in the language, then the number of ten-item spelling tests is about
10,00,000,000,000,000,000,000,000,000,000
,000,000,000,000,000,000,000,000,000,000.
So the learning outcomes question "can Stanislav spell?" depends heavily on what test we give him, if that's how we are assessing his ability. Perhaps his "true" ability (the parameter mentioned above) is the average score over all possible tests. Obviously that is somewhat impractical, since his pencil would have to move faster than the speed of light to finish within a lifetime. And this is just a simple spelling test. What are the qualitative possibilities for something complex like "effective writing" or "critical thinking?"

When we assess, we dip a little statistical ruler into a vast ocean of heaving possibilities, changing constantly as our subject's brain adapts to its environment. Even if we could find the "true" parameter we seek, it would be different tomorrow.

All of this is to say that we should be modest about what we suppose we've learned through our assessments. We are severely limited in the number of qualities (such as combinations of testable items) that we can assess. If we do our job really well, we might have a statistically sound snapshot of one moment in time: a probabilistic estimate of our subject's ability to perform on a general kind of assessment.

If we stick to that approach--a modest probabilistic one--we can claim to be in positivist territory. But the results should be reported as such, in appropriately technical language. What actually happens is that a leap is made over the divide between Ayer and Wittgenstein, and we hear things like "The seniors were measured at 3.4 on critical thinking, whereas the freshmen were at 3.1, so let's break out the bubbly." In reality, the numbers are some kind of statistical parameter estimation of unknown quality, that may or may not have anything to do with what people on the street would call critical thinking.

Note that I've only attempted to describe assessment as measurement in this installment. There are plenty of types of assessments that do not claim to be measurement, and don't have to live up to the inherent unrealistic expectations. But there are plenty of outcomes assessments that do claim to be measurements, and they get used in policy as if they really were positivist-style tick marks on a competency ruler. Administrators at the highest levels probably do not have the patience to work through for themselves the limits of testing, and may take marketing of "education measurement" at face value.

In summary, "measurement" belongs in positivist territory, and most educational outcomes assessments don't live up to that definition. Exacerbating this situation is that "critical thinking" and "effective writing" don't live in the positivist land--they are common expressions with meanings understood by the population at large (with a large degree of fuzziness). Co-opting those words borrows from the Wittgenstein world for basic meaning, and then assigns supposedly precise (Ayer) measurements. This is a rich topic, and I've glossed over some of the complexities. My answer to the question in the title is this: educational assessment is a statistical parameter estimation, but how that parameter corresponds to the physical world is uncertain, and should be interpreted with great caution, especially when using it to make predictions about general abilities.

Tuesday, April 28, 2009

Part Seven: Measurement, Smeasurement

Why Assessment is Hard: [Part one] [Part two] [Part three] [Part four] [Part five] [Part six]

In outcomes assessment we use the word 'measure' as a matter of course. Our task today is to make sense of this language.

Measurement is a word laden with meaning. It means more than assessment or judgment or rating. Consider the following statements.
  • I picked some strawberries today. We measured them to be very tasty!
  • I measured the kids before they went to bed--they were all happy.
  • We went to the art museum, and measured the artists' creativity.
To me this sounds quite odd. On the other hand, we might easily say:
  • I measured the bag of potatoes. It was five pounds.
  • I measured three cups of flour for the bread.
  • When the builder measured the door, he discovered it was crooked.
There is a difference in common language between a subjective, perhaps casual assessment, and a more rigorous objective and verifiable one. Objectivity and reliability might be said to be the hallmarks of measurement, but there's a lot more to it than that.

If we say we can measure something, we evoke a certain kind of image--a child's growth over time marked off on the closet wall perhaps. Because we reduce complex information to a single scalar, for convenience we usually choose some standard amount as a reference. We aren't required to create this unit of measurement, but any type of measurement should allow this possibility. Hence we have pounds and inches and so forth.

Despite all the language about measuring learning, there are no units. At least I've never seen any proposed. So I will take it upon myself to do that here: let's agree to call a unit of learning an Aha. So we can speak of Stanislav learning 3 Ahas per semester on average if we want. Of course, we need to define what an Aha actually is. I have come to this backwards, defining a unit without a procedure to measure the phenomenon. What might be the procedure for measuring learning?

Because of the objectivity and reliability criteria for real measurement, things like standardized tests come to mind. Good! We can measure Ahas by standardized test. Of course, these instruments aren't really objective (they are complex things, created by people who are influenced by culture, fad, and so forth) nor truly reliable (you can't test the same student twice, as Heraclitus might say). But if we wave our hands enough, we can imagine those problems away.

Still, there is a substantial problem before us. We can't put all knowledge of everything on this test, so what particular kinds of questions are there to be? We run smack into the question what kind of learning? Unlike length, of which there is only one type, or weight, or energy, or speed, there are multiple types of learning: learning to read, learning to jump rope, learning to keep quiet in committee meetings so you don't get volunteered for something. If an Aha is to be meaningful, we have to be specific about what kind of learning it is. But each type is different and needs its own unit. We could coordinate the language to paper over this difficulty, just like we have one kind of ounces for liquid and another kind of ounces for weight. But this is not recommended since it creates the illusion of sameness. Undeterred, we might propose different units for different types of learning: Reading-Aha, Jump-Rope-Aha, Committee-Aha, etc.

How specific do we need to be? Reading, for example, is not really a single skill. I'm no expert, but there are questions about vocabulary, recognition of letters and words (dyslexia might be an eussi), pronunciation, understanding of grammar, and so forth. So reading itself is just a kind of general topic, more or less like height and weight are "physical dimensions." In the same way that it would be silly to average someone's height and weight to produce a "size" unit, we don't want to mix the important dimensions of reading into one fuzzy grab bag and then have the audacity to call this a unit of measure. Where does this devolution stop? What is the bottom level--the basic building block of learning--that we can assign a unit to with confidence?

There may be an answer to that question. If you've read my opinions about assessing thinking on this blog, you'll know I find "critical thinking" too hard to define, and prefer the dichotomy of "analytical/deductive" and "creative/inductive" because those can be defined in a relatively precise (algorithmic) way. A couple of research papers tie electrical brain activity to creative thinking exercises. Science Daily has articles here and here. This is a topic I want to come back to later, but for now consider the point that neurological research may eventually have the ability to distinguish measurable differences in brain activity and potentially provide a physical basis for studying learning.

There are tremendous difficulties with this project, even if there is an identified physical connection. That's because brains are apparently networks of complex interactions, and by nature highly dimensional. It's going to be very hard to squash all those dimensions into one without sacrificing something important.

Note that none of these issues prevents us from talking about learning as if it were a real thing. It's meaningful if I say "Tatianna learned how to checkmate using only a rook and a king." Most of language is not about measurable quantities. We can make very general comparisons without being precise about it. Shakespeare:

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date . . . "Sonnet 18," 1–4

The amazing thing about language is that we converge on meanings without external definitions or units of measure. Meaning seems to evolve so that there is enough correlation between what you understand to be the case and what I understand that we can effectively communicate. This facility is so good that I think we can easily make false logical leaps. I would put it like this:
Normal subjective communication is not an inferior version of some idealized measurement.
We should not assume that just because we can effectively talk about love, understanding, compassion, or learning, that those things can be measured. Failing a real definition of an "atom of learning" and commensurate unit, we shouldn't use the word "measurement." But if learning assessments aren't measurements, what are they? I'll try to tackle that question next time.

Next: Part Eight