Tuesday, February 27, 2007

A Generalization of Value Added

It was probably coincidence, but about the same time I sent an email to the nice folks at CAE telling them about my blog post criticizing the marketing of the CLA, I got a newsletter from them called the collegiate learning pulse, for February 2007.

There’s too much interesting stuff in it for me to write about in one evening, so let me focus on one bit having to do with the relationship between SAT and CLA scores.
[S]ome believe that scores on tests of broad competencies would behave like SAT scores simply because there is a modest correlation between the two.
In footnote 6 of the referenced document, we find such a correlation:
When the school is the unit of analysis, the SAT by itself accounts for about 70% of the variance in performance task scores.
At least, I assume that’s the right one. Supposing that’s in the ballpark, the coefficient would be the square root of .7, which rounds up to .84. I would probably choose a different adjective than ‘modest’ for this level of correlation. But there’s a more salient point here.
[E]mpirical analyses of thousands of students show that the CLA’s measures are sensitive to the effects of instruction; e.g., even after holding SAT scores constant, seniors tend to earn significantly higher CLA scores than freshmen.
Seniors do better on the CLA than freshmen do when the SAT is the same. Two questions occur to me. First, if the freshmen and the seniors are different people (i.e. this isn’t longitudinal study) there are issues with survivorship that are important. You simply can’t compare a group of student who may or may not make it to their senior year to those who have done so demonstrably. At least, you shouldn’t do that without justifying it. Maybe that’s been done, but if it has been put to rest that fact needs to be mentioned with a reference. But I’ve raised this issue before. Enough already about survivorship.

The second thing is this. The freshmen are taking the CLA as freshmen, and the seniors as seniors. But they both probably took the SAT as high school students. So the argument that the CLA results are generally different from SAT scores doesn’t work unless you retest the seniors with the SAT. I suspect that those scores would increase too.

Let me pose this thought experiment. Suppose we administered the SAT to seniors and applied the same methodology that the CLA is using to calculate value-added outcomes. We’d have to find a different predictor, obviously, since SAT is currently be used to predict CLA scores as I understand it. We could probably use high school GPA as a measure of ability. If the results are shown to be roughly similar to the ones CLA is getting, could we conclude that SAT is just as useful as CLA for measuring value-added?

In other words, how general is the CLA method of calculating value-added scores? Can it be used with any standardized test of freshmen and seniors? Could the NSSE, for example, be used to calculate a value-added score for engagement using these methods?

Monday, February 26, 2007

Questions of Validity

The debate on the Spellings Commissions’ report continues. For me, the central question is how accreditation will change, and particularly to what extent we will be required to use standardized testing.

I have to confess that it was disheartening to me to see the intermediate reports from the Commission as they seemed to tilt from the beginning toward a standardized testing ‘solution’ to the ‘problem’ of comparing institutions to each other. One should not be surprised that a debate that originates in Washington turns out to be political, I suppose, but the irony is thick. To all appearances, the debate on how to measure critical thinking did not proceed very scientifically.

Personally, I see this rush to judgment about standardized testing as intellectually offensive. Here’s why.
  1. The problem that the Commission wants to solve is not clearly spelled out. They never really say in plain, measurable terms what evidence of learning should be measured.
  2. Having made the fundamental error of not defining the desired outcome, they confuse it with the assessment device that one might use to measure the outcome. Finding a test that makes you happy is not the same as defining the outcomes.
Real learning outcomes are complicated, messy, and hard to measure. Frankly, I think the Commission would have been much better off to focus on the results that a good education should bring, not the education itself. These could include:
  1. Employment
  2. Income earned
  3. Civic engagement
  4. Graduate school degrees
  5. Professional certification
All of these outcomes are measurable with data already on hand (if you’re the US government, that is). Where are the reports showing what investment in a public or private education pays off over a lifetime? Yes, it’s important for graduates to be able to think and communicate effectively, but wouldn’t those abilities show up indirectly as higher employment and income? If not, they must not be that important to begin with.

This is the kind of self-evident homework I would have hoped the Commission would have done. Even their desire to compare institutions could be accommodated using indexes like these, although one would hope that they would take into account the different kinds of starting material.

But the testing steamroller is well underway at this point, so I’d like to explore that topic. It seems that we will be in the situation of having a test define a goal for us. That is, if score comparisons become the basis for performance ratings and institutional rankings, then what the test purports to measure is actually irrelevant.
Example. Joe thinks perhaps Mary doesn’t love him anymore. He reads in a magazine that emotional commitment causes a person’s pupils to dilate. Joe decides to measure Mary’s pupil dilation at certain intervals, and if they fall below a critical value he will leave her.
The example may sound absurd, but it is apt. Joe hasn’t really thought out what he means by ‘love’ and simply latches on to a convenient definition. We might call his functional definition love*, where the asterisk denotes a restricted definition (I borrow this notation from Philosophical Foundations of Neuroscience). Joe is committing a merelogical fallacy—confusing a part with the whole.

There is no danger in defining a restricted type of learning* as long as we’re aware that it’s not the whole picture. Standardized tests can be valuable for comparison to other institutions—it’s hard to do that any other way. But if learning* takes on regulatory and perhaps financial importance, which would be the case if scores are used for ranking institutions, it will create a false economy.

A strong emphasis on learning* will undoubtedly produce it in buckets. As I mentioned earlier, it shouldn’t be hard to improve standardized test scores with a little coaching (an industry devoted to this will certainly spring up). My friend and colleague David Kammler has penned a very funny piece called The Well Intentioned Commissar that describes the effects of such a false economy.

But am I being too cynical here? Perhaps the test is so good that the difference between learning and learning* is acceptable. That is, maybe the test is valid for this purpose. If so, my hypothesis that scores can be quickly improved by a few hours of coaching must be wrong. Otherwise, why do we need to waste four years of college on these students?

If standardized testing is indeed to become mandatory for higher education, the Collegiate Learning Assessment (CLA) is likely to be one of the candidates. This instrument seems to have been a darling of Mr. Charles Miller, chair of the commission. The test and its putative purpose of measuring value-added learning outcomes has been criticized by George Kuh and Trudy Banta, among others, and prompted a response from CLA researchers (see CLA: Facts & Fantasies for the latest salvo).

Let us ask our first question about the validity of this test. How long would it take to coach a freshman so that he or she improved to senior-level scores on the CLA? How many weeks of study do you think would be required to really significantly improve a student’s score? I’d guess about four, but who knows. The point is that it’s likely to be a lot less than four years. Shouldn’t that give us pause? Why are we wasting four years of education on these students if all that matters is four weeks worth of knowledge?

This thought experiment shows that learning* is likely to be significantly different from what students get from a college education. Perhaps it is a small subset, but certainly not enough to compare institutions with. This clearly shows the need to define goals in a comprehensive way before trying to measuring them. Assessing learning will always be incomplete and approximate, and the goal of an instrument precise enough to rate colleges and universities is probably unrealistic. I would argue it’s far better to choose metrics like the ones listed earlier (post-graduation employment, income, etc.) if we really have to have such a rating. What if we find out later that high CLA scores predict unemployment?

Standardized tests like the CLA are attractive because they eliminate complexity by reducing variables, which usually results in reasonably reliable instruments. But one can reliably get wrong answers too, like Joe with his pupil measurements. Richard Feynman commented on this subject.

I think the educational and psychological studies I mentioned are examples of what I would like to call cargo cult science. In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they've arranged to imitate things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas--he's the controller--and they wait for the airplanes to land. They're doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn't work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but they're missing something essential, because the planes don't land.
This quote makes me squirm a bit because I spend a fair amount of time trying to measure learning outcomes myself. Am I conducting ‘cargo cult’ science? Feynman goes on to make a pitch for intellectual honesty, which is how I see a way around his criticism.
I would like to add something that's not essential to the science, but something I kind of believe, which is that you should not fool the layman when you're talking as a scientist. I am not trying to tell you what to do about cheating on your wife, or fooling your girlfriend, or something like that, when you're not trying to be a scientist, but just trying to be an ordinary human being. We'll leave those problems up to you and your rabbi. I'm talking about a specific, extra type of integrity that is not lying, but bending over backwards to show how you are maybe wrong, that you ought to have when acting as a scientist. And this is our responsibility as scientists, certainly to other scientists, and I think to laymen.
As long as we’re clear to distinguish learning* from learning, and to expose the weaknesses of our methods, and above all not to try to fool people about these things, I think we have license to talk about measuring learning outcomes.

An excellent example of this kind of professionalism can be found in the College BASE marketing material:
[S]tudies support the conclusion that the scores yielded by College BASE are a reasonable basis for making particular inferences about the achievement of individual examinees, and that the scores neither reflect the assessment of unintended abilities, nor are they beyond reasonable error. While validity studies completed by the Assessment Resource Center provide evidence for using College BASE as an estimation of a student’s mastery of skills for purposes described in its test content, an institution must still document use of the examination for its own specific purposes. Technical assistance for implementing College BASE at an institution is available from the Assessment Resource Center.

Responsibility for ensuring other validity evidence rests with the testing institution. Within its assessment plan the institution should address issues related to test administration conditions, cut score requirements, and appropriate accommodations to students as needed. The needs of each individual institution are unique to its mission and
its assessment program. The Assessment Resource Center can assist administrators in using individual and institutional data in appropriate and meaningful ways to improve student outcomes.
Why all the careful language? Because the test designers realize that equating learning* with learning is problematic. The quote above hints at the pains the designers have taken to try to find evidence of validity for the test. You can also request to be sent a Technical Manual. In the introductory material on validity in this manual they write:
The emphasis [on validity] is not on the instrument itself; rather, it is on the interpretation of the scores yielded by a test.
And later:
It is a common misconception that validity is a particular phenomenon whose presence in a test may be evaluated concretely and statistically. One often hears exclamations that a given test is “valid” or “not valid.” Such pronouncements are not credible, for they reflect neither the focus nor the complexity of validity.
I have heard IR directors use these very words. It’s usually a sentence like “We just use X standardized test because it’s been proven reliable and valid.” Valid for what? That’s how I interpret what the College BASE people are saying: how you interpret the results determines whether or not you’re conducting cargo cult science. I highly recommend taking a look at the whole document—it reads like a manual on how to construct a test. I’m afraid my own attempts at measurement fall far short of this standard.

To sum up where we are at this point: Think carefully about the goals, then try to find ways to assess them, but be skeptical of the ability of these instruments to render complete results. Like Descartes, you should spend a lot of time doubting yourself.

Compare the College BASE materials with a direct mail marketing flyer I got from the CLA. Here’s an excerpt from a box entitled “Validity and Reliability.”
The CLA measures were designed by nationally recognized experts in psychometrics and assessment, and field tested in order to ensure the highest levels of validity and reliability. For more information contact the CLA staff.
This is pure marketing hype, and violates Feynman’s dictum to not fool the public. There’s no indication that deciding validity is a complex process that depends heavily on what the results are to be used for. Perhaps we can charitably assume that the CLA marketing department hasn’t taken direction very well from the researchers. Surely the decision-makers who recommend this test are better informed, right?

Perhaps not. Richard Hersh, one of the main proponents of the CLA, writes in the Atlantic Monthly that:
For our purposes [grades and grade point averages] are nearly useless as indicators of overall educational quality--and not only because grade inflation has rendered GPAs so suspect that some corporate recruiters ask interviewees for their SAT scores instead. Grades are a matter of individual judgment, which varies wildly from class to class and school to school; they tend to reduce learning to what can be scored in short-answer form; and an A on a final exam or a term paper tells us nothing about how well a student will retain the knowledge and tools gained in coursework or apply them in novel situations.
He cannot be unaware that correlation between CLA scores and grade point averages is one of the pieces of evidence for validity of the test. According to a research article on the CLA website, GPA can ‘explain’ around 25% of the variance in CLA scores in the study they conducted. It seems quite odd to reject GPA in such strong language when there is such a strong relationship between the two. As a potential consumer of the product, this kind of disconnect is not reassuring.

My own sense is that the recent success of the CLA is due more to marketing and lobbying than to efficacy of the test. This is not to make any judgment about the test—I’m very interested in the serious research being done on it. From the data we’ve gathered at Coker College, I think you can make an argument that the value added idea works. Our method is the opposite of standardized testing, however, and could not reasonably be used to compare institutions. Incidentally, our research also seems validate that grade point averages are meaningful: higher averages lead to more value added. I blogged about it here.

I do have some questions about the CLA, however. These may be easily answered by the experts.
  1. Are freshman/senior comparisons being done longitudinally (i.e. by waiting four years to retest)? If not, how do you control for survivorship bias? The freshmen who eventually become seniors are likely to have more ability than those who drop out along the way. Even controlling for SAT and GPA are not going to make this question go away.
  2. The heavy reliance on the SAT for a predictor of CLA scores seems like a dubious method to me. The SAT is not very good at predicting success in college (at least at mine), so how do you convince a skeptic that the ‘value added’ is meaningful? Maybe the students are just getting better at taking standardized tests.
  3. I think there will be significant problems in getting students to take a three-hour test for which they are not graded. I can’t speak to whether or not this will affect the results of the test, but I can tell you that it will cause IR directors to pull their hair out.
  4. How much can scores be improved with a weekend training session?
I’m writing this from the perspective of an IR director who has to translate test results into action. In order to do this, I have to be able to explain in simple words what test results mean and how this relates to the curriculum and methods we employ. In short, how do I know I’m not engaging in Feynman’s cargo cult science?

Saturday, February 17, 2007

First, Do No Harm

Our state meeting of independent schools' institutional researchers was yesterday. As usual, I picked up some new ideas. I learned about Remark Office optical mark recognition, which can be used to create your own scannable forms for doing surveys. This eliminates having to buy the expensive forms from Scantron (which work well, however--that's what we currently use without problems). The scanner is apparently more forgiving about what kind of marks are made, also. So no more sharpening 1,000 pencils for student evaluation day.

There was also a good discussion of the role of the IR office in strategic planning. The 'safe' option for an IR office is to produce reports and forward them on to the decision-makers. I've done a fair amount of that myself when I was finding my way as a newly appointed IR director. The problem with that is that decision-makers aren't necessarily in the best position to use the data. Also, the information we work from is often incomplete and open to interpretation. The accreditors like SACS seem to imagine a perfect world where assessment inevitably leads to improvement. But the essence of leadership is courage, not data. In mulling this over, I have come to the conclusion that we in the IR business should help the decision makers as follows:
  1. First, do no harm. If the data are pretty strong against a current or proposed policy, use IR's influence to get the policy changed. Example: One of my colleagues at the meeting mentioned that he'd just finished a project having to do with AP testing. In this case, the faculty were convinced that students who used AP credit to skip the introductory class in a subject were ill-prepared to handle the second course. In looking at the data, however, he found that the opposite was true, except in one case. In that case the curriculum for the AP class wasn't a good match for the 'target' course. So he helped kill a proposal that would have eliminated AP credit at the university--a decision that would have had massive admissions implications.
  2. Most of the time, the data aren't going to point one way or the other. In these cases, it is better to encourage decision-makers to adopt policies that subjectively may have a positive affect. That is, rather than creating 'data paralysis' when there is no clear direction, keep moving and trying out new things. At worst you identify which policies are bad, and at best you may find the right one accidentally. Example. We adopted an initiative to improve student writing. We have some assessment data on writing, but nothing that clearly and unequivocally says DO THIS! So the committee assembled a variety of opinions about remedies and eventually adopted some of these as policy. They are quite reasonable and may be expected to improve the situation. In parallel, our assessments are improving, so we may be able to tell in a year or two whether or now we have succeeded. This is much better than simply giving up because there is no clear way forward.
  3. If the data clearly indicate the need for a new policy, do whatever you can to get it implemented. This is what an IR officer should live for--that one study or report that shows YES--this thing really matters and we should act on it. This is where the courage part of leadership comes into play because said IR director will be out there making the case as the spokesperson. Simply passing it on isn't good enough. Make it happen! Example. We recently did a shotgun-style survey of our students and found stark differences between first-generation students and students of college graduated parents. The differences were in attitudes about money, plans after graduation, and attitudes toward offices and services provided by the college. Given the national statistics on attrition for this group, this data cries out for action. So I've used three committees to sell the idea--the IE committee, our retention committee, and the one on improving writing. The issue of preparing first-gen students for their first year cuts across all these areas.
In summary, I advocate the position that the IR director form opinions about the usefulness of the data and then take the leadership to act on it. If it's equivocal, encourage the debate and look for opportunities to use data or at least improve assessment. But don't just publish a report full of charts and numbers with lists of standard errors and dump it on somebody's desk. Taking action by advocating for or against a policy takes courage because you can find yourself on the 'wrong' side of the administration, AND you can be wrong! But I believe it vastly increases the usefulness of the IR office to the institution. There is a danger, of course, that IR will seem to be more politicized. But if you call them as you see them rather than being a sycophant to the administration, you'll be respected for it.

This relates to another role of IR--to provide artificial certainty. The question "how many students do we have?" probably has a dozen possible answers, depending on what you count as a student (part-time? FTE? degree-seekers?) and when you take the snapshot. Decision-makers don't need to know all the possibilites. So figure out a method of measuring this fuzzy number and then publish it as fact. And it shall be so. As a story about three umpires has it, the third says of called plays: "They ain't nothing until I call them."

Monday, February 05, 2007

Predicting GPA

Every year we get an IR request to find better predictors for student success. This comes at me from three angles. One is the Standards Committee, which periodically reviews the cut-off values for admission and provisional admission. The other two interested parties are Admissions and the Budget Committee. We could improve the student body's entering statistics quite rapidly by eliminating the bottom half of the entering class, but that would cause heartburn in the Budget Committee when the calculations for revenue are done. Balancing these two isn't easy.

A large part of the problem is figuring out how to identify the students who are likely to fail. We can then either not admit them, or give them the help they need to succeed. Traditionally we have used composite SAT and high school GPA for this purpose. A linear regression model is used to predict GPA at the College. As an experiment I tried something new this year. In the Fall we gave 'Assessment Day' surveys to over a third of the traditional students. There were 100 questions, many of which pertain to students who are in attendance. Some of the questions could be answered by applicants, though, if we chose to do that. (For example: has either of your parents graduated from college?). I identified these and did a step-wise regression to see if any of them could add information to SAT and HSGPA in predicting grades. I found two. Both are surprising.

The two items are:
  • I work harder on homework than most students.
  • Family commitments often make studying difficult.
The responses were on a 1-5 Likert scale from Strongly Agree to Strongly Disagree. What's really interesting about these items is that they were significantly correlated (p < .03), and the effect sizes were as great as SAT. Strangely, both of the coefficients were the reverse of what I would have guessed.

The students who reported working harder on homework tended to have lower grades, and those who cited family commitments tended to have higher grades.

This is all experimental at present, but we're exploring the idea of using questions like these on an application form, so that we can better identify matches for the institution's goals. Without such additional information, our models can explain only about 44% of the variance in college grades. This is not bad--we can eliminate 56% of the 'bad' students at a cost of losing only 5% of the 'good' students. There are ways to improve this using demographic characteristics, but that has always seemed unethical to me, due to biases built into the SAT, for example.