Friday, June 10, 2011

An Addendum and Apology

On Wednesday I wrote about my take-aways from the AALHE meeting in Lexington, and drew on some remarks from Trudy Banta. Some of the responses I got were justly critical of the way I mangled Trudy's message about the wider importance of the SAT example. It's true--I botched it, and I'd like to set the record straight so that Trudy doesn't begin to doubt her communications skills.

In order not to further fold, spindle, or mutilate someone else's message, let me say that what follows is my interpretation, and any faults should be ascribed to the author alone.

In a comment on the still-infantile state of the art of measurement, Trudy observed that even after more than 80 years of development and efforts to improve the SAT for its intended purpose, there is as much disagreement as ever about the validity of the SAT as a predictor of success in college. I intended to emphasize this point and what it portends for other projects that are even more ambitious. In the original article I left this argument dangling rather ineptly. So here it is, starting with a closer look at SAT, generalizing to the characteristics of industrialized tests (meaning massive, usually commercial, standardized instruments), and how we can do better with more authentic assessments.

Even with all the research that has gone into the SAT, in absolute terms it still isn't very good. The 2008 validity study from the College Board gives us (pg. 2):
[T]he weighted average correlation between SAT writing scores and English composition course grades was 0.32, after correcting for range restriction.
This is the most direct link I could find in the study between SAT and learning outcomes. Here we have actual writing in a standardized environment, correlated with course grades in coursework where writing is taught. If the course grade is related to how well students can write, then their potential as demonstrated by the SAT writing component ought to line up with it. And it does, but only to the tune of explaining 10% of the variance in grades (.32 squared). More generally, the variance in first year college grades explained by all the components of the SAT combined is 25% (page 5, squaring the adjusted R).

Any predictor with greater than zero R may be useful in the right context. But if we consider SAT's performance as an upper limit to what industrialized tests can do, it's a warning for current and contemplated high-stakes applications. This is a good place to segue to the mis-step in my prior article.

I gave three advantages of standardized tests. Here they are:
  • They can have good reliability, so they seem to measure something.
  • Because of this, they can create self-sustaining corporations that provide a standardized service to higher education, so there are fixed points of comparison.
  • Even if the validity isn't stellar, some information is better than none, right?
I was still thinking mostly of the SAT when I wrote this, so you might consider this the best case scenario. However, I intended it to be clear (it wasn't) that this is damning with faint praise: although these bullets describe how standardized tests have colonized an ecological niche successfully, these reasons aren't sufficient to warrant their use for high-stakes purposes. Like value-added comparisons of institutions, for example.

The missing "anti-bullets" that describe the corresponding disadvantages are respectively:
  • Depending solely on reliability is like searching for your lost keys under the lamp because that's where the light is. The drive for reliability creates artificial conditions like timed multiple-choice tests that have very little mechanical relationship to real-world application of knowledge. So reliability comes at a cost in validity. 
  • The size of the testing industry means that it can use its resources to circumvent professional review of its products and sell them directly to politicians. 
  • The validity comment is specious: "some" information isn't good enough in many circumstances. Imagine driving on a cliff's edge and only knowing where 10% of the road is. (An unfortunate real case is the publication of ranks of teacher "value-added" scores by the LA Times last year.) Exacerbating the general dearth of validity is is the fact that validity has to be determined locally after application of the test results (that is, to decide if some proposition about the actual results can be supported or not). It's very hard to do. Most of the time we don't know if our applications are valid or not. 
I would like to apologize to Trudy for bungling the transition between my summary of her remarks and the bullets in the original. I detracted from the point and may have misled or confused some readers. Mea culpa.

We can now step beyond the problems and look for solutions, and Trudy gave a list of some of these alternative approaches. Reliability isn't produced in a factory in Iowa City: we can get good reliability using rubrics on authentic student work. Additionally, the convenience of industrialized testing can nowadays be matched with local technological solutions. If we want to use portfolios, we don't have to have a thousand file cabinets, or mail copies to prospective employers; it's all on the web.

The big win for local solutions is validity: with the right application of technique, we can link assessments solidly with pedagogy at many levels of resolution, from individual assignments up to the institutional level.

As noted above, the testing industry has a lot of ability to protect its own interests, including great access to policy makers. With that in mind, I should wrap back around to the  theme of the policy conversations at the conference: the calls for accountability. It's not good enough to just resist the K-12-like solution; we have to find an alternative we find acceptable, and there are some good candidates available. That, I hope, is a clear and unmangled statement of the main message.

I don't mean to demonize "big testing," by the way. Maybe the best solution would be if we could convince them that authentic assessment and electronic portfolios are profitable new markets. I don't know what the chances of that are.

No comments:

Post a Comment