Monday, February 26, 2007

Questions of Validity

The debate on the Spellings Commissions’ report continues. For me, the central question is how accreditation will change, and particularly to what extent we will be required to use standardized testing.

I have to confess that it was disheartening to me to see the intermediate reports from the Commission as they seemed to tilt from the beginning toward a standardized testing ‘solution’ to the ‘problem’ of comparing institutions to each other. One should not be surprised that a debate that originates in Washington turns out to be political, I suppose, but the irony is thick. To all appearances, the debate on how to measure critical thinking did not proceed very scientifically.

Personally, I see this rush to judgment about standardized testing as intellectually offensive. Here’s why.
  1. The problem that the Commission wants to solve is not clearly spelled out. They never really say in plain, measurable terms what evidence of learning should be measured.
  2. Having made the fundamental error of not defining the desired outcome, they confuse it with the assessment device that one might use to measure the outcome. Finding a test that makes you happy is not the same as defining the outcomes.
Real learning outcomes are complicated, messy, and hard to measure. Frankly, I think the Commission would have been much better off to focus on the results that a good education should bring, not the education itself. These could include:
  1. Employment
  2. Income earned
  3. Civic engagement
  4. Graduate school degrees
  5. Professional certification
All of these outcomes are measurable with data already on hand (if you’re the US government, that is). Where are the reports showing what investment in a public or private education pays off over a lifetime? Yes, it’s important for graduates to be able to think and communicate effectively, but wouldn’t those abilities show up indirectly as higher employment and income? If not, they must not be that important to begin with.

This is the kind of self-evident homework I would have hoped the Commission would have done. Even their desire to compare institutions could be accommodated using indexes like these, although one would hope that they would take into account the different kinds of starting material.

But the testing steamroller is well underway at this point, so I’d like to explore that topic. It seems that we will be in the situation of having a test define a goal for us. That is, if score comparisons become the basis for performance ratings and institutional rankings, then what the test purports to measure is actually irrelevant.
Example. Joe thinks perhaps Mary doesn’t love him anymore. He reads in a magazine that emotional commitment causes a person’s pupils to dilate. Joe decides to measure Mary’s pupil dilation at certain intervals, and if they fall below a critical value he will leave her.
The example may sound absurd, but it is apt. Joe hasn’t really thought out what he means by ‘love’ and simply latches on to a convenient definition. We might call his functional definition love*, where the asterisk denotes a restricted definition (I borrow this notation from Philosophical Foundations of Neuroscience). Joe is committing a merelogical fallacy—confusing a part with the whole.

There is no danger in defining a restricted type of learning* as long as we’re aware that it’s not the whole picture. Standardized tests can be valuable for comparison to other institutions—it’s hard to do that any other way. But if learning* takes on regulatory and perhaps financial importance, which would be the case if scores are used for ranking institutions, it will create a false economy.

A strong emphasis on learning* will undoubtedly produce it in buckets. As I mentioned earlier, it shouldn’t be hard to improve standardized test scores with a little coaching (an industry devoted to this will certainly spring up). My friend and colleague David Kammler has penned a very funny piece called The Well Intentioned Commissar that describes the effects of such a false economy.

But am I being too cynical here? Perhaps the test is so good that the difference between learning and learning* is acceptable. That is, maybe the test is valid for this purpose. If so, my hypothesis that scores can be quickly improved by a few hours of coaching must be wrong. Otherwise, why do we need to waste four years of college on these students?

If standardized testing is indeed to become mandatory for higher education, the Collegiate Learning Assessment (CLA) is likely to be one of the candidates. This instrument seems to have been a darling of Mr. Charles Miller, chair of the commission. The test and its putative purpose of measuring value-added learning outcomes has been criticized by George Kuh and Trudy Banta, among others, and prompted a response from CLA researchers (see CLA: Facts & Fantasies for the latest salvo).

Let us ask our first question about the validity of this test. How long would it take to coach a freshman so that he or she improved to senior-level scores on the CLA? How many weeks of study do you think would be required to really significantly improve a student’s score? I’d guess about four, but who knows. The point is that it’s likely to be a lot less than four years. Shouldn’t that give us pause? Why are we wasting four years of education on these students if all that matters is four weeks worth of knowledge?

This thought experiment shows that learning* is likely to be significantly different from what students get from a college education. Perhaps it is a small subset, but certainly not enough to compare institutions with. This clearly shows the need to define goals in a comprehensive way before trying to measuring them. Assessing learning will always be incomplete and approximate, and the goal of an instrument precise enough to rate colleges and universities is probably unrealistic. I would argue it’s far better to choose metrics like the ones listed earlier (post-graduation employment, income, etc.) if we really have to have such a rating. What if we find out later that high CLA scores predict unemployment?

Standardized tests like the CLA are attractive because they eliminate complexity by reducing variables, which usually results in reasonably reliable instruments. But one can reliably get wrong answers too, like Joe with his pupil measurements. Richard Feynman commented on this subject.

I think the educational and psychological studies I mentioned are examples of what I would like to call cargo cult science. In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they've arranged to imitate things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas--he's the controller--and they wait for the airplanes to land. They're doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn't work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but they're missing something essential, because the planes don't land.
This quote makes me squirm a bit because I spend a fair amount of time trying to measure learning outcomes myself. Am I conducting ‘cargo cult’ science? Feynman goes on to make a pitch for intellectual honesty, which is how I see a way around his criticism.
I would like to add something that's not essential to the science, but something I kind of believe, which is that you should not fool the layman when you're talking as a scientist. I am not trying to tell you what to do about cheating on your wife, or fooling your girlfriend, or something like that, when you're not trying to be a scientist, but just trying to be an ordinary human being. We'll leave those problems up to you and your rabbi. I'm talking about a specific, extra type of integrity that is not lying, but bending over backwards to show how you are maybe wrong, that you ought to have when acting as a scientist. And this is our responsibility as scientists, certainly to other scientists, and I think to laymen.
As long as we’re clear to distinguish learning* from learning, and to expose the weaknesses of our methods, and above all not to try to fool people about these things, I think we have license to talk about measuring learning outcomes.

An excellent example of this kind of professionalism can be found in the College BASE marketing material:
[S]tudies support the conclusion that the scores yielded by College BASE are a reasonable basis for making particular inferences about the achievement of individual examinees, and that the scores neither reflect the assessment of unintended abilities, nor are they beyond reasonable error. While validity studies completed by the Assessment Resource Center provide evidence for using College BASE as an estimation of a student’s mastery of skills for purposes described in its test content, an institution must still document use of the examination for its own specific purposes. Technical assistance for implementing College BASE at an institution is available from the Assessment Resource Center.

Responsibility for ensuring other validity evidence rests with the testing institution. Within its assessment plan the institution should address issues related to test administration conditions, cut score requirements, and appropriate accommodations to students as needed. The needs of each individual institution are unique to its mission and
its assessment program. The Assessment Resource Center can assist administrators in using individual and institutional data in appropriate and meaningful ways to improve student outcomes.
Why all the careful language? Because the test designers realize that equating learning* with learning is problematic. The quote above hints at the pains the designers have taken to try to find evidence of validity for the test. You can also request to be sent a Technical Manual. In the introductory material on validity in this manual they write:
The emphasis [on validity] is not on the instrument itself; rather, it is on the interpretation of the scores yielded by a test.
And later:
It is a common misconception that validity is a particular phenomenon whose presence in a test may be evaluated concretely and statistically. One often hears exclamations that a given test is “valid” or “not valid.” Such pronouncements are not credible, for they reflect neither the focus nor the complexity of validity.
I have heard IR directors use these very words. It’s usually a sentence like “We just use X standardized test because it’s been proven reliable and valid.” Valid for what? That’s how I interpret what the College BASE people are saying: how you interpret the results determines whether or not you’re conducting cargo cult science. I highly recommend taking a look at the whole document—it reads like a manual on how to construct a test. I’m afraid my own attempts at measurement fall far short of this standard.

To sum up where we are at this point: Think carefully about the goals, then try to find ways to assess them, but be skeptical of the ability of these instruments to render complete results. Like Descartes, you should spend a lot of time doubting yourself.

Compare the College BASE materials with a direct mail marketing flyer I got from the CLA. Here’s an excerpt from a box entitled “Validity and Reliability.”
The CLA measures were designed by nationally recognized experts in psychometrics and assessment, and field tested in order to ensure the highest levels of validity and reliability. For more information contact the CLA staff.
This is pure marketing hype, and violates Feynman’s dictum to not fool the public. There’s no indication that deciding validity is a complex process that depends heavily on what the results are to be used for. Perhaps we can charitably assume that the CLA marketing department hasn’t taken direction very well from the researchers. Surely the decision-makers who recommend this test are better informed, right?

Perhaps not. Richard Hersh, one of the main proponents of the CLA, writes in the Atlantic Monthly that:
For our purposes [grades and grade point averages] are nearly useless as indicators of overall educational quality--and not only because grade inflation has rendered GPAs so suspect that some corporate recruiters ask interviewees for their SAT scores instead. Grades are a matter of individual judgment, which varies wildly from class to class and school to school; they tend to reduce learning to what can be scored in short-answer form; and an A on a final exam or a term paper tells us nothing about how well a student will retain the knowledge and tools gained in coursework or apply them in novel situations.
He cannot be unaware that correlation between CLA scores and grade point averages is one of the pieces of evidence for validity of the test. According to a research article on the CLA website, GPA can ‘explain’ around 25% of the variance in CLA scores in the study they conducted. It seems quite odd to reject GPA in such strong language when there is such a strong relationship between the two. As a potential consumer of the product, this kind of disconnect is not reassuring.

My own sense is that the recent success of the CLA is due more to marketing and lobbying than to efficacy of the test. This is not to make any judgment about the test—I’m very interested in the serious research being done on it. From the data we’ve gathered at Coker College, I think you can make an argument that the value added idea works. Our method is the opposite of standardized testing, however, and could not reasonably be used to compare institutions. Incidentally, our research also seems validate that grade point averages are meaningful: higher averages lead to more value added. I blogged about it here.

I do have some questions about the CLA, however. These may be easily answered by the experts.
  1. Are freshman/senior comparisons being done longitudinally (i.e. by waiting four years to retest)? If not, how do you control for survivorship bias? The freshmen who eventually become seniors are likely to have more ability than those who drop out along the way. Even controlling for SAT and GPA are not going to make this question go away.
  2. The heavy reliance on the SAT for a predictor of CLA scores seems like a dubious method to me. The SAT is not very good at predicting success in college (at least at mine), so how do you convince a skeptic that the ‘value added’ is meaningful? Maybe the students are just getting better at taking standardized tests.
  3. I think there will be significant problems in getting students to take a three-hour test for which they are not graded. I can’t speak to whether or not this will affect the results of the test, but I can tell you that it will cause IR directors to pull their hair out.
  4. How much can scores be improved with a weekend training session?
I’m writing this from the perspective of an IR director who has to translate test results into action. In order to do this, I have to be able to explain in simple words what test results mean and how this relates to the curriculum and methods we employ. In short, how do I know I’m not engaging in Feynman’s cargo cult science?


  1. Anonymous8:43 AM

    Let me take a shot at answering these...

    1. No, the testing is done all in the same year. And, to add to the problem, at many institutions, many seniors transfered into the institution and were NOT ever freshmen at that institution.

    2. The CLA administrators HAVE to include the SAT as a predictor in their models because the CLA and SAT are highly correlated. So if you don't include the SAT scores in your models, you end up with scores that are confounded by SAT (and thus, institutional comparisons are even less meaningful).

    3. Yes. Few students are willing to sit for a three hour (or even the 90 minute version) of a test, paricularly when it has little meaning for them. There are no easy solutions to this problem and it seems clear that it would impact the validity of the test.

    4. Hard to say on this one...but it does make us aware of Berliner's principal: the more important a quantitative test becomes for accountability, the more it will distort the very thing it proports to measure. So if the CLA becomes important for institutions, they will find ways to game the system to get the scores they need.

  2. Per #3, I know of two large schools that are paying students to take the test. One requires that they spend a certain amount of time at the terminal before they can collect the cash. The other appeals to their altruism.