Higher Ed/: standardized test

Showing posts with label standardized test. Show all posts

Wednesday, February 29, 2012

Test Fail

I came across two articles this morning with "test" and "fail" in the title. They're both worth a look.

"Failed tests" at The University of Chicago Magazine. Quote:

Neal, a professor in economics and the Committee on Education, insists it’s a “logical impossibility” that standardized tests, as they’re most often administered, could assess both teachers and students without compromising teacher integrity, student learning, or both. “The idea is that we want faculty held accountable for what students learn, so the tool that we use to measure what students learn is the tool that we should use to hold faculty accountable,” Neal says. “It’s all rhetorically very pleasing, but it has nothing to do with the economics of how you design incentive systems.”

Next is "Standardized Tests That Fail" in Inside HigherEd this morning. Quote:

“We find that placement tests do not yield strong predictions of how students will perform in college,” the researchers wrote. “In contrast, high school GPAs are useful for predicting many aspects of students’ college performance.”

You may want to contrast that with another article by a test score true believer (my term). This one has stuck in my craw for a long time it's so bad, but I'll say no more about it. Anyone with basic critical thinking skills can figure out what's wrong with it. "'Academically Adrift': The News Gets Worse and Worse" in The Chronicle. I can't disagree with the conclusion of the article, however, so I'll quote that:

For those who are dissatisfied with the methods or findings of Academically Adrift, who chafe at the way it has been absorbed by the politicians and commentariat, there is only one recourse: Get started on research of your own. Higher education needs a much broader examination of how and whether it succeeds in educating students. Some of that research will doubtless become fodder for reckless criticism. But there's no turning back now.

[Update 3/1/2012] Here's one more: "The True Story of Pascale Mauclair" in EdWize. It's a horror story about the abuses of publishing standardized test results that are used to rate public school teachers. The bold in the quote is mine.

On Friday evening, New York Post reporters appeared at the door of the father of Pascale Mauclair, a sixth grade teacher at P.S. 11, the Kathryn Phelan School, which is located in the Woodside section of Queens. They told Mauclair’s father that his daughter was one of the worst teachers in New York City, based solely on the [Teacher Data Results] reports, and that they were looking to interview her.

Wednesday, June 23, 2010

Proxy Problems

You may have seen this New York Times article on Loyola Law School about grades. Quote:

The school is retroactively inflating its grades, tacking on 0.333 to every grade recorded in the last few years. The goal is to make its students look more attractive in a competitive job market.

Or from the same source, "Under Pressure, Teachers Tamper With Tests." Quote:

The district said the educators had distributed a detailed study guide after stealing a look at the state science test by “tubing” it — squeezing a test booklet, without breaking its paper seal, to form an open tube so that questions inside could be seen and used in the guide.

Motivation?

Houston decided this year to use [standardized test] data to identify experienced teachers for dismissal

And then there's the metaphysics of time expressed with the concern "Credit Hours Should Be Worth the Cost, House Panel Members Say" in The Chronicle.

The standard of a credit hour, which is not actually a full 60 minutes in most cases, is deeply embedded in higher education as a benchmark for earning a degree. But the definition of what constitutes a credit hour has become muddled in recent years with the increase in online education.

Or an example from The People Republic of China, courtesy of Yong Zhao at Michigan State, where standardized tests have very high stakes:

[The test] puts tremendous pressure on students, resulting in significant psychological and emotional stress. Imagine yourself as a 6^th grader from poor rural village—how well you do one exam could mean bankrupt your family or lift your family out of poverty, give your parents a job in the city, and the promise of going to a great college.

About the tests themselves:

[T]he selection criterion is only test scores in a number of limited subjects (Chinese, Math, and English in most cases). Nothing else counts. As a result, all students are driven to study for the tests. As I have written elsewhere, particularly in my book Catching Up or Leading the Way, such a test-driven education system has become China’s biggest obstacle to its dream of moving away from cheap-labor-based economy to an economy fueled by innovation and creativity. The government has been struggling, through many rounds of reforms, to move away from testing and test scores, but it has achieved very little because the test scores have been accepted as the gold standard of objectivity and fairness for assessing quality of education and students (as much as everyone hates it).

Changing the topic yet again, there's a June 13th article in The Chronicle entitled "We Must Stop the Avalanche of Low-Quality Research." The thesis is:

While brilliant and progressive research continues apace here and there, the amount of redundant, inconsequential, and outright poor research has swelled in recent decades, filling countless pages in journals and monographs.

Then there are presidential aims of doubling the number of college graduates by 2020 juxtaposed to potentially "toxic degrees" and alpine debt generated by for-profit colleges (see last post). This tension, as well as speculation about what it means for all of higher ed is given in The Chronicle's "New Grilling of For-Profits could Turn Up the Heat for All of Higher Education." Here's the bit where the writing on the wall appears:

Many of the issues at stake, however, could mean harsher scrutiny for all of higher education, as worries about rapidly growing costs and low-quality education in one sector could raise questions about long-accepted practices throughout higher education.

Congress and colleges still lack a firm sense of "what our higher education system is producing," said Jamie P. Merisotis, president of the Lumina Foundation for Education. "The model of higher education is starting to evolve, but it's not clear to us what that evolution looks like," he said.

If you didn't read that and say "uh-oh," take another look :-). Here's a hint, from a May article:

The [Education] department is already working with the National Governors Association and its next chairman, West Virginia's governor, Joe Manchin III, a Democrat, to develop college-graduation-rate goals for each state and eventually, for each institution of higher education. That kind of push is necessary, Mr. Duncan said, if the country is to meet President Obama's goal for the United States to have the world's highest proportion of residents with a college degree by 2020.

What do all these have in common? In each case, we look at one thing and imagine that it is something else. Grades equate to academic performance, number of papers published equates to professional merit, credit-hours equates to time and effort expended in learning, standardized test score equates to academic potential for students and successful teaching for teachers, and graduation equates to (I presume) preparation for satisfying employment and a successful life.

All of these are proxies, and they each have problems. Anyone who works in assessment is in the business of creating proxies. They might have problems too. Problems are created, for example, if there is economic value associated with the outcome, which is true for all of the above. Dept of Ed is willing to pay for grads? Hey, we'll give you grads. Here's the bill.

Other problems are related to validity; how closely does the proxy track what you're actually interested in? Note that for true statements of fact, validity is never an issue. If I say "35 students took the survey," and if that's true, validity is not an issue. When I start to talk about what the survey results mean, then I have to worry about validity of statements like "students want a better salad bar in the cafeteria," or whatever.

There are some proxies that work pretty well. Money, for example. A dollar bill is a proxy for anything you can buy for a dollar. It's like a wild card. You'd think such a crazy idea would never work, but it mostly does. Why? Because it's very forcefully regulated. There's a strong incentive to go run off your own $20 bills on the color copier, but you'll end up in jail so you probably don't do that. Where this proxy fails is where it's not tightly controlled. Like in banks, which can print their own money (for all practical purposes, this is how banks work). If they create too much, and say cause a booming market in stuff we can't really afford, then problems occur. Or if the government itself just starts printing the stuff wholesale. In short, if the scarcity of the proxy matches the scarcity of some set of important commodities, the buying power ought to behave itself. (The Economist puts out a Big Mac index based on this idea of purchasing parity.)

How would you put this kind of enforcement into practice in an assessment situation? First, you have to keep people from cheating--printing their own money, so to speak. So researchers aren't allowed to break one long paper into two in order to get more publications. Schools can't artificially inflate grades. Colleges can't crank out graduates that didn't complete a suitable curriculum satisfactorily. Teachers can't see the standardized test before they're supposed to. And so on. This may not be possible for your assessment, in which case you should probably abandon it. Like the first example: how in the world can you keep researchers from padding their vitas?

The harder problem is validity. With dollar bills, validity is enforced by law and custom. It says "legal tender" on the bills, and while you could barter with tomatoes instead it would become impractical.

Part of the difficulty with validity is that it changes due to the fact that a proxy has been announced to be so. For example, if you looked back at the 17th century, it might make sense to rank researchers based on the number of their publications because they were probably not using that particular yardstick as a measure of success. But once it's announced that papers = tenure, any validity you might have assumed before cannot still be assumed to be true. Either economic motivation (leading to gaming the system) or lack of it (apathy and inaccurately low performance) may affect validity.

A really bad proxy can ironically bring about the opposite of what you intend. More on that here.

You would think with these problems that we would use proxies only as a last resort. But we seem to be hard-wired to want to use them. Maybe there's some psychological reason for this. Maybe it's part of the package for language-users. As exhibit A, I present one of the "solutions" to the posed problem of too many publications from the "low quality research" article above:

[M]ake more use of citation and journal "impact factors," from Thomson ISI. The scores measure the citation visibility of established journals and of researchers who publish in them. By that index, Nature and Science score about 30. Most major disciplinary journals, though, score 1 to 2, the vast majority score below 1, and some are hardly visible at all. If we add those scores to a researcher's publication record, the publications on a CV might look considerably different than a mere list does.

As some of the comments to this article point out, it's easier to inflate citations than it is even to crank out papers. It's the siren song of the next, better proxy just...over...that...hill... With a clever name like "impact factor," it must be good.

Moral: When you can, gather and analyze statements of fact about whatever it is you care about, rather than using a proxy. More on this subject anon.

For other ideas about alternatives to grades and standardized tests, see fairtest.org, Joe Bower's blog, and Alfie Kohn's site.

Update: There's a fascinating article in Inside Higher Ed this morning called "The White Noise of Accountability." There are many proxies for "measuring accountability" mentioned, including this example:

The Louisiana Board of Regents, for example, will provide extra funding for institutions that increase not the percentage, but the numbers, of graduates by … allowing them to raise tuition.

The author's observation about student learning outcomes is (in my opinion) right on:

But if the issue is student learning, there is nothing wrong with -- and a good deal to be said for -- posting public examples of comprehensive examinations, summative projects, capstone course papers, etc. within the information environment, and doing so irrespective of anyone requesting such evidence of the distribution of knowledge and skills.

I interpret this as saying just publish performance information. If we could get everyone interested in actual performance rather than proxies, we'd be a lot better off. See "Getting Rid of Grades" for an extreme version of that idea.

The IHE article cites (and criticizes) an Education Sector report "Ready to Assemble: A Model State Higher Education Accountability System" and summarizes thus:

By the time one plows through Aldeman and Carey’s banquet, one is measuring everything that moves -- and even some things that don’t.

I took a closer look at the article to see what it says about learning outcomes--the heart of the matter. It doesn't take long to find problems:

[S]everal nonprofits have developed promising new ways to measure higher education quality that have become widely accepted and implemented by colleges and universities. The Collegiate Learning Assessment (CLA), which measures higher-order critical thinking and analytic reasoning skills, and the National Survey of Student Engagement (NSSE), which measures effective teaching practices, are two examples.

Standardized proxies. The first one has serious validity and game theory problems (see "Questions of Validity" and these related posts*), and the NSSE doesn't directly look at outputs at all. There's more stuff about the CLA that looks like it came straight from the marketing department, and this assertion:

Colleges are often ranked by the academic standing of the students they enroll. But measures like the CLA allow states to hold colleges accountable for how much students learn while they’re in college.

Really? That's a lot of faith in a proxy that doesn't even test discipline-specific material (e.g. you could ace the test and still be a lousy engineer or biologist or whatever your major was). There are other tests mentioned, but all standardized tests of general reasoning. Maybe the attraction of such things is that they give the illusion of easy answers to very difficult questions. As Yong Zhao put it in his article, describing the situation in the PRC:

The government has been struggling, through many rounds of reforms, to move away from testing and test scores, but it has achieved very little because the test scores have been accepted as the gold standard of objectivity and fairness for assessing quality of education and students (as much as everyone hates it).

The siren song goes on...

Related: "Fixing Assessment."

*These are all obviously my opinions. You can find out about the CLA yourself from their website.The test is unique and has some interesting merits as contrasted to standard fill in the bubble tests. My point is not that the test (or other standardized tests of general knowledge) can't be used effectively, but that assuming that it's a suitable global measure for student learning at the college level is more weight than the proxy can bear. In the VSA and the "Ready to Assemble" article, the Spellings Report, and elsewhere such tests are granted status that resembles an ultimate assessment of collegiate learning, which is doesn't seem justified to me.

Thursday, May 27, 2010

Testing and Choice and Ruin

Diane Ravitch was an assistant secretary of Education during the latter Bush administration. She has done an about-face on No Child Left Behind, and written a book about it called The Death and Life of the Great American School System: How Testing and Choice Are Undermining Education. In a May 25th article in Education Week, Ravitch summarizes what she things is wrong with Race to the Top, the Obama administrations version of NCLB. Among her points are two game theory arguments:

The NCLB-induced obsession with testing and test-prep activities will intensify under Race to the Top because teachers will know that their future, their reputation, and their livelihood depend on getting the scores higher, by any means necessary.

By raising the stakes for tests even higher, Race to the Top will predictably produce more teaching to bad tests, more narrowing of the curriculum, more cheating, and more gaming the system. If scores rise, it will be the illusion of progress, rather than better education. By ratcheting up the consequences of test scores, education will be corrupted and cheapened. There will be even less time for history, geography, civics, foreign languages, literature, and other important subjects.

I wrote about this sort of effect in "The Irony of Good Intentions," and it's also related to the over-valuing of SAT scores in "Amplification Amplification." There must be some psychological flaw in our species that leads us to accept easy answers over ones that work, often taking the form of some sort of credential. For example:

Bond ratings substitute for real understanding of a company
Investment firm stock ratings do the same
Standardized test scores of complex behavior substitute for actual performance
Degrees and other credentials substitute for demonstrated performance
Paper currency stands in for something (what?) of value
Loan applications substitute for actual ability to repay

All of these are undoubtedly useful, but are also subject to bubbles. We're in a credential bubble now, with online for-profits cranking out degrees, racking up massive loan debts (for students), and milking the government of aid dollars.

Probably someone has done this already, but this inflation-of-value idea probably has an archeology we can discover in word origins. For example, the word "really" used to mean "actually," according to this etymology site.

really early 15c., originally in reference to the presence of Christ in the Eucharist. Sense of "actually" is from early 15c. Purely emphatic use dates from c.1600; interrogative use (oh, really?) is first recorded 1815.

Now, really means something like "very." The word "literal" has a precise meaning that describes a certain realness. I have noticed that's becoming fashionable to use it merely as a false credential: on the radio I heard "and the audience was literally eating out of her hand." Somehow I doubt that's true.

Similarly, words for "large" seem to have engorged themselves. Things become Super, and Mega. It can't be long before the advertising executives discover the untapped vein of other standard scientific prefixes: giga, tera, peta, exa, etc.

Would you like fries with that?
Sure.
Do you want the Peta-pack or the Exa-plosion?

This arms race of "believe me!" may ultimately inflate so much that we require exponential notation to express how certain we are. This may be another sign of Ray Kurzweil's Singularity:

Wow, that concert was REALLY10^25 good!
Oh yeah? Well I saw one last week that was REALLY10^30 good!

And we'll probably be paying those teachers with these:

(image courtesy of Wikipedia)

Sunday, May 16, 2010

Time Flies Like an Arrow

...and fruit flies like a banana.

Words are tricky things. Jerome Kagan puts it much more elegantly in his book What is Emotion? He takes issue with loose language in scientific writing like "yeasts cooperate," writing that:

Poets possess the license to use a predicate any way they wish, as long as an aesthetic effect is produced. Sylvia Plath can write the swans are brazen and pots can bloom, but scientists are denied this creative freedom because the meaning of a predicate for a process, action, or function always depends on the objects participating in the process.

He elaborates on this point, using an example of psychologists attributing "fear" to mice who have been exposed to unpleasant stimuli, and shows convincingly that a physiological response is not the same as an emotion. The larger point is that processes are more important than outcomes for scientific inquiry. Assessing an emotional state as "afraid" is not as meaningful or useful as describing the process that caused a physiological reaction that we give a label to.

It's the identification and study of processes, not outcomes, that makes the hard sciences effective. Anyone can see that the apple fell off the tree. Counting and sorting the apples doesn't help us understand what happened.

Both of these subtleties bedevil the business of learning outcomes assessment. For the most part, we only study the results (the outcomes) of the educational enterprise. There are some studies that try to identify better pedagogies (like 'active' vs 'passive' learning), but even with these it's not possible to build a procedure for predicting learning. The only predictions that come out of learning assessment is of the sort "we saw x in the past, therefore we think we'll see x in the future." If a student aces the calculus final, we believe that she will be able to do integrals in her engineering classes next term.

Even this simple inductive reasoning is subverted by what we call 'feature-creep' in programming. Tests and other assessments can never actually cover the whole range of the learning we would like to assess, so we assume that we can broaden the scope. Sometimes this is partially justified with other research, but usually not. The inductive assumption becomes:

We saw x on our assessments, and we predict that we will see X in the future.

Here, X is some broadened version of x that uses the flexibility of language to paper over the difference. I made a particular argument about this concerning the CLA in a recent post. An analogy will make this clear.

A test for flying a plane is developed, with the usual theory and practice mix. As part of the assessment, the test subject Joe has to solo on a single engined Cessna. This is what I've called x, the observation. We now certify that "Joe can fly." This is X, our generalized claim. Clearly, this is misleading. Joe probably can't fly anything except the Cessna with any aptitude. Would we trust him with a helicopter, a jet fighter, or a 747? Hardly.

Now the same thing in the context of writing:

A test for writing compositions is developed and standardized with rubrics and a 'centering' process to improve interrater reliability. Joe scores high on the assessment and is labeled 'a good writer.' But can we really claim that Joe can write advertising copy, or a lab report, or a law brief, or a math proof? Hardly.

Generalizing beyond the scope of what has actually been observed is misleading unless one has good data to build an inductive argument from. We can predict the orbits of celestial bodies we see for the first time because we have a model for the process. Is there a model that tells us the relationship between writing an essay and writing a contract for services?

And yet if we look at the goals set out by well-respected organizations like AAC&U, we see (from the LEAP Initiative) these goals for learning:

Inquiry and analysis
Critical and creative thinking
Written and oral communication
Quantitative literacy
Information literacy
Teamwork and problem solving

All of these are general, just like "flying" or "writing." This is where we begin to get into trouble. The list is fine--the trouble is when we start to imagine we can assess these and use the results for prediction. General skills require general assessments. Judging from his autobiography, I think Chuck Yeager was pretty good at "flying," in a general sense--he spent his whole life doing it. But in four years (or two years) of college we don't have the time nor the mission to teach a student everything there is to know about even one of the items on the bullet list. The list above only starts to make sense if we narrow each item to an endeavor that 's achievable in our limited time frame. For example:

Inquiry and analysis in world history
Critical and creative thinking in studio art
Written and oral communication in presidential politics
Quantitative literacy in basic statistics
Information literacy in using electronic resources
Teamwork and problem solving in basic circuit design

And of course, this is the normal way assessment works for improvement of a course or curriculum. If we get enough data points of this specificity, we can study how generalizable these skills are. We can compare notes between disciplines and see if critical thinking really means the same thing or not. This is the Faculty of Core Skills approach.

But big general tests, big general rubrics, and calls for accountability at the institutional level all assume that x implies X. This creates confusion and tension. Victor Borden wrote recently about this in "The Accountability/Improvement Paradox":

In the academic literature and public debate about assessment of student learning outcomes, it has been widely argued that tension exists between the two predominant presses for higher education assessment: the academy's internally driven efforts as a community of professional practitioners to improve their programs and practices, and calls for accountability by various policy bodies representing the “consuming public.”

I don't see this as a paradox. If you want a paradox, here's one from last week. A paradox is a logical contradiction--this situation is an illogical one. Borden gives a succinct summary of it (perhaps quoting Peter Ewell):

Assessment for improvement entails a granular (bottom-up), faculty-driven, formative approach with multiple, triangulated measures (both quantitative and qualitative) of program-specific activities and outcomes that are geared towards very context-specific actions. Conversely, assessment for accountability requires summative, policy-driven (top-down), standardized and comparable (typically quantitative) measures that are used for public communication across broad contexts.

This is a brilliant observation. What sorts of big complex enterprises are measurable in this way? Could you do this with a large business? Sure--they are called audits. A swarm of accountants goes through all the books and processes and a bunch of big documents are produced. Or a bond rating agency like Standard & Poor does something similar and arrives at a singular summative grade for your company--the bond rating. We can argue about whether or not this is something accreditors do (or should do), but these processes look nothing at all like what passes for generalized outcomes assessment, for example in the Voluntary System of Accountability (NSSE, CLA...). That would be the equivalent of giving the employees at Enron a survey on business practices, and extrapolating from a 10% response rate.

It's processes, not outcomes, that matter. We only look at outcomes to get a handle on processes. Audits look directly at processes for a good reason. A giant fuzzy averaged outcome is worse than useless because it gives the illusion of usefulness thanks to the flexibility of the language. The problem derives from misplaced expectations at the top:

Nancy Shulock describes an “accountability culture gap” between policy makers, who desire relatively simple, comparable, unambiguous information that provides clear evidence as to whether basic goals are achieved, and members of the academy, who find such bottom line approaches threatening, inappropriate, and demeaning of deeply held values.

As I have mentioned before, if policy makers want simple, comparable, unambiguous assessments of goals, they need to specify goals that can be assessed simply, comparably, and unambiguously. So I went looking at the Department of Education to see if I could find them. I came across a strategic plan, which seemed like the best place to look. In the mission statement:

[W]e will encourage students to attend college and will continue to help families pay college costs. The Department intends that all students have the opportunity to achieve to their full academic potential. The Department will measure success not only by the outcomes of its programs, but also by the nation's ability to prepare students to succeed in a global economy as productive and responsible citizens and leaders.

That last bit is probably the most pertinent to this discussion. We could extrapolate these goals:

graduates should be competitive in the global economy
graduates should be productive
graduates should be good citizens
graduates should be leaders

What direct measures of these things might we imagine? Here are some suggestions:

Employment rates
Salary histories
Voting records, percentage who hold public office, percentage who serve in the military
Percent who own their own buisnesses

I had trouble with the leadership one. You can probably think of something better than all of these. But notice that not a single one has anything to do with learning outcomes as we view them in higher education. You can stretch and say we have citizenship classes, or we teach global awareness, but it's not the same thing.

I find the goals wholly appropriate to a view from the national level. These are strategic, which is what they ought to be. They're still to fuzzy to be much good, but that's just my opinion. I'd like to see targeted goals at science and technology like we did during the space race, but they didn't put me in charge.

Digging into the document, we find actual objectives spelled out. Let's see how well I did at guessing. It starts on page 25.

Objective 1: Increase success in and completion of quality postsecondary education.
To ensure that America’s students acquire the knowledge and skills needed to succeed in college and the 21st-century global marketplace, the Department will continue to support college preparatory programs and provide financial aid to make college more affordable. Coordinated efforts with states, institutions, and accrediting agencies will strengthen American higher education and hold institutions accountable for the quality of their educational programs, as well as their students’ academic performance and graduation rates.

The last phrase is hair-raising. Hold higher education accountable for students' academic performance and graduation rates? That's like holding the grocery store accountable for my cooking. Sure, they have something to do with it--if the melon is rotten or whatever, but don't I bear some of the responsibility too?

That's just the objective. Next come strategies for implementation.

Strategy 3. Prepare more graduates for employment in areas of vital interest to the United States, especially in critical-need languages, mathematics, and the sciences.
The Department will encourage students to pursue course work in critical-need foreign languages, mathematics, and the sciences by awarding grants to undergraduate and graduate students in these fields. The SMART grant program will award grants to Pell-eligible third- and fourth-year bachelor’s degree students majoring in the fields of the sciences, mathematics, technology, engineering and critical foreign languages. In addition, priority will be given to those languages and world regions identified as most critical to national interests.

Well, I got what I wished for. This outlines objectives that are clearly strategic in nature, and relatively easy to measure success of. We can count the number of students who learn Arabic or graduate in Engineering.
Strategy 5 mentions learning outcomes for the first time, but the hammer falls in the next one:

Strategy 6. Expand the use of data collection instruments, such as the Integrated Postsecondary Education Data System (IPEDS), to assess student outcomes.
The Department will collaborate with the higher education community to develop and refine practices for collecting and reporting data on student achievement. To encourage information sharing, the Department will provide matching funds to colleges, universities, and states that collect and publicly report student learning outcomes.

We're going to put learning outcomes in IPEDS??? Didn't Conrad write about this? Where's all that useful stuff I dreamed up in one minute about employment, salaries, citizenship, and such? Those pieces of the mission statement aren't even addressed. Or maybe I missed them due to the lack of oxygen in my brain after reading strategy six. It seems perfectly obvious that IPEDS will want comparable learning outcomes, which can only mean standardized tests like the ones the VSA requires. If so, it won't be the end of the world, but it will be negative. Beyond the expense and missed opportunity, we will institutionalize the kind of misplaced reasoning that in the earlier example would lead one to believe that Bob the new pilot can fly an SR-71 Blackbird.

Coming full circle to Kagan's book, he remarks that

A scientific procedure that later research revealed to be an insensitive source of evidence for a concept, because the procedure was based on false assumptions, is analogous to a magical ritual.

He give the example of Rorschach's inkblot tests. That's hardly an apt analogy for standardized tests reported out nationally as institutional measures of success, however; inkblot tests are too subjective. I suggest phrenology as a better comparison.

Postscript. As I was writing this, my daughter Epsilon was studying for her French "End of Course" standardized test (see this post). I went down and reviewed a few sections with her. She didn't know the names of the months very well, so I suggested making some flashcards. She told me this: "I don't need to be able to spell them or tell you what they are, I only need to be able to recognize them. It's multiple-choice." The test only assesses reading and listening. Speaking and writing are not part of it. Since speech production and understanding are in different parts of the brain, we actually have some clue about the process in this case, and it tells us that the test is inadequate. This may be okay--maybe those parts are covered somewhere else. The point is that the subtleties matter. If I said "Epsilon aced her french final," you might be tempted to think that she knows the names of the months.

She's down there making flash cards now.

Tuesday, May 11, 2010

The End of the World Tests

My daughter Epsilon* has her End-Of-Grade (EOG) standardized test today in math. We spent last evening reviewing with the handy booklet provided. Here's one of the problems:

Joe is going to paint his bedroom: walls and ceiling. The room is 11' by 13' with 8' ceilings. There are three windows that have dimensions 4' x 6'. How much area does Joe have to paint?

This is typical of the sort of problem us math types come up with, and probably why students hate word problems. More fun than solving such problems is figuring out what's wrong with them. We had a good laugh because Joe doesn't have a door to his bedroom! I guess he goes in and out of the window.

It was interesting to watch the way Epsilon thinks about the problems. Clearly the teachers have a big job on their hands, and the simplest route to maximizing success on tests is to teach deductive processes through repetition. That is: there's a right way and a wrong way to do problems. Ironically, this can act like a rut on the test questions, which are multiple-choice. There were usually shortcuts we could employ that were quicker than using the deductive method. In case that dry description hasn't put you to sleep already, continuing the example above:

It's easy to see that the average wall length of Joe's bedroom is 12', so the perimeter of the room is 48'. This is quicker than the cookbook method of adding up 2x11+2x13 on the calculator. Then we have to multiply by 8 feet high walls, which is doubling three times. Round 48 up to 50, double once: 100, twice: 200, thrice: 400. Add in the approximately 12x12 ceiling to get 544, and subtract out the three 24 square foot windows (round up to 25, to get 75 square feet), subtract to get about 470 ft^2. This is close enough to find the right answer on the list of possible choices. And it passes the "smell test" because there are five surface to paint, each in the neighborhood of 10'x10'.

Part of my job was to let the kid know it's okay to take shortcuts to get the answer. She's at that stage I remember well, where her intuition tells her what to do, but she can't articulate it. Trusting that intuition will lead you astray occasionally, but in interesting ways. Not trusting it dooms you to reliance on formulas.

In terms of content, I was impressed by the number of topics addressed. Seventh grades here are learning far more than I did in seventh grade math. I don't have enough information to know if the standardization of the curriculum in math is ultimately good or bad. Math at this level is probably the best place to standardize because of the deductive nature of most of the work. Whether or not it works in English is another question.

*That's not really her name; it's a tip of the hat to Paul Erdős, who called children "epsilons" because in real analysis proofs that letter is often used to denote a very small quantity.

Wednesday, April 28, 2010

Reflection on Generalization of Results

Blogging is sometimes painful. The source of the discomfort is the airing of ideas and opinions that I might find ridiculous later (like maybe the next day). Having an eternal memorial to one's dumb ideas is not attractive. I suppose the only remedy is public reflection, which is no less discomforting. To wit...

Yesterday I wrote:

The view from a discipline expert is naturally dubious of the claims that learning can be weighed up like a sack of potatoes, and the neural states of a hundred billion brain cells can be summarized in a seven-bit statistic with an accuracy and implicit model that can predict future behavior in some important respect. Aren't critical thinkers supposed to be skeptical of claims like that?

I've mulled this over for a day. A counter-argument might go like this: A sack of potatoes has a very large number of atoms in it, and yet we can reduce those down to a single meaningful statistic (weight or mass) that is a statistical parameter determined from multiple measurements. The true value of this parameter is presumed to exist, but we cannot know it except within some error bounds with some degree of probabilistic certainty. This is not different from, say, an IQ test in those particulars.

I think that there is a difference, however. Let's start with the basic assumption at work: that our neighborhood of the universe is reliable, meaning that if we repeat an experiment with the same initial conditions, we'll get the same outcomes. Or, failing that, we'll get a well-defined distribution of outcomes (like the double slit experiment in quantum mechanics). Moreover, we additionally assume that similar experiments yield similar results for a significant subset of all experiments. This "smoothness" assumption grants us license to do inductive reasoning, to generalize results we have seen to ones we have not. Without these assumptions, it's hard to see how we could do science. Restating the assumptions:

1. Reliability Experiment under same conditions gives same results, or (weaker version) a frequency distribution with relatively low entropy

2. Continuity Experiments with "nearby" initial conditions give "nearby"results.

Condition 1 grants us license assume the experiment relates to the physical universe. If I'm the only one who ever sees unicorns in the yard, it's hard to justify the universality of the statement. Condition 2 allows us to make inductive generalizations, which is necessary to make meaningful predictions about the future. This why the laws of physics are so powerful--with just a few descriptions, validated by a finite number of experiments, we can predict an infinite number of outcomes accurately across a landscape of experimental possibilities.

My implicit point in the quote above is that outcomes assessment may satisfy the first condition but not the second. Let's look at an example or two.

Example. A grade school teacher shows students how the times table works, and begins assessing them daily with a timed test to see how much they know. This may be pretty reliable--if Tatiana doesn't know her 7s, she'll likely get them wrong consistently. What is the continuity of the outcome? Once a student routinely gets 100% on the test, what can we say? We can say that Tatiana has learned her times tables (to 10 or whatever), and that seems like an accurate statement. If I said instead that Tatiana can multiply numbers, this may or may not be true. Maybe she doesn't know how to carry yet, and so can't multiply two-digit numbers. Therefore, the result is not very generalizable.

Example. A university administers a general "critical thinking" standardized test to graduating students. Careful trials have shown a reasonable level of reliability. What is the continuity of the outcome? If we say "our students who took the test scored x% on average," that's a statement of fact. How far can we generalized? I can argue statistically that the other students would have had similar scores. I may be nervous about that, however, since I had to bribe students to take the test. Can I make a general statement about the skill set students have learned? Can I say "our graduates have demonstrated on average that they can think critically"?

To answer the last question we have to know the connection between the test and what's more generally defined as critical thinking. This is a validity question. But what we see on standardized tests are very particular types of items, not a whole spectrum of "critical thinking" across disciplines. In order to be generally administered, they probably have to be that way.

Can I generalize from one of these tests and say that good critical thinkers in, say, forming an argument, are also good critical thinkers in finding a mathematical proof or synthesizing an organic molecule or translating from Sanskrit or creating an advertisement or critiquing a poem? I don't think so. I think there is little generality between these. Otherwise disciplines would not require special study--just learn general critical thinking and you're good to go.

I don't think the issue of generalization (what I called continuity) in testing gets enough attention. We talk about "test validity," which wallpapers over the issue that validity is really about a proposition. How general those propositions can be and still be valid should be the central question. When test-makers tell us they're going to measure the "value added" by our curriculum, there ought to be a bunch of technical work that shows exactly what that means. In the most narrow sense, it's some statistic that gets crunched, and is only a data-compressed snapshot of an empirical observation. But the intent is clearly to generalize that statistic into something far grander in meaning, in relation to the real world.

Test makers don't have to do that work because of the sleight of hand between technical language and everyday speech. We naturally conjure an image of what "value added" means--we know what the words mean individually, and can put them together. Left unanalyzed, this sense is misleading. The obvious way to see if that generalization can be made would be to scientifically survey everyone involved to see if the general-language notion of "value added" lines up nicely with the technical one. This wouldn't be hard to do. Suppose they are negatively correlated. Wouldn't we be interested in that?

Harking back to the example in the quote, weighing potatoes under normal conditions satisfies both conditions. With a good scale, I'll get very similar results every time I weigh. And if I add a bit more spud I get a bit more weight. So it's pretty reliable and continuous. But not under all conditions. If I wait long enough, water will evaporate out or bugs will eat them, changing the measurement. Or if I take them into orbit, the scale will read differently. The limits of generalization are trickier when talking about learning outcomes. Even if we assume that under identical conditions, identical results will occur (condition 1) the continuity condition is hard to argue for. First, we have to say what we mean by "nearby" experiments. This is simple for weight measurements, but not for thinking exercises. Is performance on a standardized test "near" the same activity in a job capacity? Is writing "near" reading? It seems to me that this kind of topological mapping would be a really useful enterprise for higher education to do. At the simplest level it could just be a big correlation matrix that is reliably verified. As it is, the implicit claims of generalizability of the standardized tests of thinking ability are too much to take on faith.

So, I stand by the quoted paragraph. It just took some thinking about why.

Friday, April 23, 2010

Comparing CLA to Rubric Scores

“We found no statistically significant correlation between the CLA scores and the portfolio scores,” is the sentence that catches one's eye in this AAC&U feature article about University of Cincinnati's efforts to assess big fuzzy learning outcomes:

The students took the CLA, a ninety-minute nationally standardized test, during the same week in which faculty members assessed students’ e-portfolios using rubrics designed to measure effective communication and critical thinking. In the critical thinking rubric assessment, for example, faculty evaluated student proposals for experiential honors projects that they could potentially complete in upcoming years. The faculty assessors were trained and their rubric assessments “normed” to ensure that interrater reliability was suitably high.

One administrator's conclusion about the mismatched scores is that:

The CLA can provide broad institutional data that satisfies VSA requirements, while rubric-based assessment provides better information to facilitate continuous program improvement.“When we talk about standardized tests, we always need to investigate how realistic the results are, how they allow for drill-down,” Robles says. “The CLA provides scores at the institutional level. It doesn’t give me a picture of how I can affect those specific students’ learning. So that’s where rubric assessment comes in—you can use it to look at data that’s compiled over time.”

You can find a PowerPoint show for the research here. Here's a slide taken from that, which summarizes student perceptions of the CLA.

Here are the correlations in question:

It's hard to see how anything is related to anything else here, except maybe breaking arguments and analytical writing. I would conclude that the CLA isn't assessing what's important to the faculty, and is therefore useless for making improvements. Since UC is part of the VSA, they can't say that. Instead they say:

The CLA is more valid? [choking on my coffee here] Valid for what? Saying that one school educates students better than another school? The two bullets above seem Orwellian in juxtaposition. How can an assessment be valid if it isn't useful for student-level diagnostics? Yes, I understand the the CLA doesn't give the same items to each student, that it's intended to be only used to compare institutions or provide a "value-added" index, but the fact that cannot be escaped is that learning takes place within students, not aggregates of them. At some point, the dots have to be connected between actual student performance and test results if they're going to really be good for anything. Oh, but wait: here's how to do that.

By the way, if you don't know the story of Cincinnatus, it's worth checking out.

Monday, October 26, 2009

Impedance Mismatch

There's an article about NSSE in InsideHigherEd this morning about the "Wabash National Study," which studied changes in freshmen over their first year of liberal arts education. Their summary begins:

In our analysis of data from 3,081 students at 19 institutions in the first round of the study we found that, on the whole, students changed very little on the outcomes that we measured over their first year in college.

The provide an overview of findings (pdf) that give more details. In particular:

[A]lthough students’ improvement on the CAAP Critical Thinking test was statistically significant, the change was so small (less than 1% increase) that it was practically meaningless.

You can find two examples of the sorts of questions that the CAAP employs here (pdf). One asks a question about polling, referring to a hypothetical politician named Favor:

Favor's "unofficial poll" of her constituents at the Johnson County political rally would be more persuasive as evidence for her contentions if the group of people to whom she spoke had:
I. been randomly selected.
II. represented a broad spectrum of the population: young and old, white and non-white, male and female, etc.
III. not included an unusually large number of pharmacists.

The results of the CAAP and other surveys (such as student attitudes and behaviors) were correlated against six "teaching practices and institutional conditions:"

Good Teaching and High-Quality Interactions with Faculty
Academic Challenge and High Expectations
Diversity Experiences
Frequency of Interacting with Faculty and Staff
Interactions with Peers
Cooperative Learning

The summary of results shows that "critical thinking" component was correlated positively and significantly with "good teaching" and "diversity experiences."

I'm not a fan of the status quo in general education, but it does seem only fair that if we're going to judge accomplishment using tests that the tests align with the curriculum. Perhaps at the participating institutions, this is the case, but I have trouble seeing where items like those in the critical thinking part of the CAAP are actually used in first-year liberal arts curricula. Certainly, many schools have critical thinking listed as a goal, but it gets defined in many ways. I don't like the term because of its fuzziness, and this example shows that well.

Take the example given about poll sampling. The answer can be arrived at by a bit of common sense, in which case this resembles perhaps an IQ test, or one might have encountered sampling in a Finite Math course or Intro to Psychology. But it's not exactly the level of material that students come to college for. In finite math, they might learn linear programming, which is a fairly complex analytical tool used to solve constraint problems. Of course, anyone who hasn't had that material would fail miserably.

But isn't that the point? By trying to "measure" (see Measurement Smeasurement for an explanation of the scare quotes) only the least common denominator of freshman learning--a prerequisite to standardization--aren't we really just applying an IQ-like test? It would be better to use targeted tests that correspond to the actual curriculum that students actually take, rather than imagining that can all be treated uniformly. Put another way, we shouldn't be surprised if students don't perform well on tests that don't correspond to what we've actually taught them.

Other posts on critical thinking are listed here.