Thursday, April 02, 2009

Why it's Difficult, Part Two

This morning we continue to mine the rich vein of reasons why writing convincing assessment outcomes reports is hard. Yesterday I introduced a couple of terms I didn't define and made a claim I set up but didn't complete the defense of. I usually write these posts in my "quiet happy time" early in the morning with a warm cappuccino and the glow of two 22 inch Dell monitors as company. Sometimes the clock is not my friend, however, and I have to rush.

I sat in a meeting yesterday about electronic timeclocks for keeping track of hourly workers, especially student work study participants. It occurred to me that 'timeclock' is a ridiculous excuse for a word. What else are clocks good for? I mean, we don't have 'drinkcups' or 'drivecars' or 'writepens' (although we do have inkpens, hmmmm). I think this is one of the things that one particularly notices when one studies a foreign language--the goofy bits and sudden connections. Like iceberg means 'ice mountain' in German. The point of this digression is that language and its implied categories are a two-edged sword. On the one hand, they allow us to hide detail when its unnecessary. On the other hand, these very words/concepts may trick us into using them for something unsuitable. More on that topic here.

Back to the topic at hand... First the definitions.

Summative assessment is generally numerical and should have some reliability in the sense that we'd expect the same results if we asked for the numbers independently. It may or may not be valid--that depends on what you use the numbers for. Standardized tests are good examples of summative assessment. Summative assessments can theoretically tell you where you are. Depending on the level of concreteness (or transparency), it may or may not allow you to see what to do about the situation. Stock market prices are good non-education-related examples of a summative assessment. Summative assessment is monological, and the hallmark of a mature science where unit measurements can be rigorously applied to natural phenomena.

Formative assessment is one intended to generate complex data about some area of interest. It can be fuzzy and subjective. Student portfolios are a good example. Formative assessment can be dialogical, but aren't necessarily so. The idea with this type of effort is to develop an action plan by identifying strengths and weaknesses--best done in a dialogue with interested parties. When we decide who to vote for in an election, we make formative assessments. Natural science proceeds from formative assessments (forming ideas about categories of phenomena) to summative in any discipline where a predictive theory is developed.

Transparency and opacity are opposites. I don't know that these terms are generally used in the description of outcomes assessment, but I needed a term to describe the presence or lack of a working cause and effect model or theory. Sometimes you know where you are, but don't know what to do next (holding stocks, for example, gives you precise knowledge of how much value is in your portfolio, but probably doesn't contain much information about whether to sell or buy more). If it's clear from an assessment what efforts would improve the situation, that's what I'm terming transparent. Lacking a mechanism, it's opaque. In my encounters with the assessment profession, including accreditation bodies, I think it's generally assumed that most or all assessments are transparent. I find this puzzling, since it's obviously not true in our everyday lives. Hard science proceeds by forming predictive models out of observation and analysis--becoming transparent through a lot of hard work and precise methods.

The claim I made yesterday was that
The problem is that many people assume that you should be able to show continuous improvement in some particular metric. This is virtually impossible to do once learning outcomes have reached a certain level of complexity.
If our goal is continuous improvement, we can use either formative or summative assessment. But if we have to have a metric--a single number that we can graph and show that the lines go up--it would need to be summative. Formative assessments and follow-up actions can be used to make a nice narrative (and invitation to dialogue), but aren't going to be convincing as hard science. The problem with highly complex learning outcomes, like effective writing or critical thinking, is that finding a valid summative assessment is probably impossible. I've written a lot about that in these pages, so I'll spare you a recapitulation. Others, including C. Palomba and T. Banta in their book Assessment Essentials, have noted that increasing complexity generally means decreasing validity. This is particularly true of summative assessments because the mere fact of summation means that complex data is drastically reduced into something much simpler, leaking validity like a sieve.

Note that opacity does not prevent improvements. Even lacking a mechanism for knowing what action would result in an improvement does not prevent you from making improvements! In the natural world this happens all the time--an arctic fox does not engineer its coat to be white when there is snow on the ground. There was no planning that went on beforehand. Rather, natural selection breeds lots of designs and keeps the ones that work. The key is to do something and see if it works. If it doesn't, throw out that idea and try another one. This obviously isn't as efficient as having a transparent model of success, but it works very, very well when applied correctly.

Richard Hake, in his blog and on the ASSESS listserv comments regularly about the effectiveness of pre/post testing to ascertain what works and what doesn't. This is a good example of how summative assessments can be used to improve teaching, even without a transparent cause and effect model. The key is to keep track of the conditions that led to the outcomes--to do science, in other words. In my opinion, the common idea that simply doing assessments will lead to obvious ideas about improvement is questionable. Again, it assumes that all situations are transparent and that the assessments are valid. This is dubious.

Developing a useful predictive theory is hard. Even with a good metric of success, there's no guarantee that such a thing is even possible. To see how difficult the problem theoretically, read about the No Free Lunch Theorem.

Let me turn now to a related claim. I fear I'm exhausting you patience, dear reader, with all this philosophy. Let's turn to something eminently practical, and that bears on the difficulty question. To borrow from Speaker Tip O'Neill:
Claim: All learning is local.
What this means is that it's the details that get learned. Let me illustrate with an analogy, and then get to the practical bit (I promise).

You can't plant a tree. You can't even plant an apple tree. You can plant a particular seed, with its own genetic specs, and this seed may develop into what we will generically call a 'tree'. But a tree itself is a category, or Platonic idea that only exists as a mental convenience to data compress our perceptions.

Similarly, you can't teach writing. Not as a general concept. You can only teach the particular things that correspond to the actual practice of writing. You can teach vocabulary words and grammar rules. Through dialogue you can teach how criticism works. But 'writing' is too general a concept to teach, unless you mean particularly the physical act of putting pen to paper--that you can teach, obviously.

When we assess writing or critical thinking, we're assessing something we can't teach or affect directly. We can observe whether it improves or not if we have a good assessment process. As a Gedankenversuch, imaging the following situation.
A student is taken to the writing center. The tutors are told: this student doesn't write well. See what you can do to help. What would the tutors do? Undoubtedly they would have to go through a discovery process to see what particular knowledge and skills the subject had, and what areas seemed weak. Is this a native speaker of English? How much actual writing has the student done? Is there a portfolio to look at? What kinds of problems are evident? What kind of writing is desired? Poetry? Advertising copy? Article review?
Too general a definition of the goal represents a data compression that loses significant elements. As a second example, imagine a speech class underway.
Instructor: Billy, you need to improve your speaking.

Billy: What should I do?

Instructor: I just told you--improve your speaking.
Compare that to:
Instructor: Billy, you have a habit of using verbal tics like "ummmm" to fill pauses. Try to work on that. There are some tips on my website.

Billy: Okay. Thanks.
As a practical matter of taking generalization too far, take the practice of assigning course grades.

Nobody likes grades as an assessment measure. This is for good reason, because grades lump together all kinds of activities into a numerical stew that is served up as a single letter, generally A, B, C, D, F. This is fewer than three bits of information per course. Think of how much information (in bits or pages or words, for example) is generated during the semester, and compare that to the three bit summary. Obviously we have lost almost all of the information that was generated during the class. The general lesson to be learned is don't toss out the detailed data needed for assessment. Don't lump together things that are dissimilar. Don't average in vocabulary exercises with class participation into some kind of creole to be served up on the master grade sheet (can you tell I'm getting hungry?). To compound this problem, individual course grades are lumped together to create a GPA. Now we've compressed the whole college experience down to about five bits of information. For more on grades, see Andrew McCann's recent blog post here.

The truly ironic thing is that, having reached the conclusion that grades are not very useful for assessing learning outcomes, most if not all institutions turn around and repeat exactly the same problem! I'm speaking of the use of rubrics to assess learning outcomes.

If all learning is local, then the focus, and best source of data might be formative or summative data coming from rich rubric-type specifications. That's not the problem. The problem is if the individual elements are summarized into some kind of "assessment goop" to produce an abstract category. Let's take an example.
Suppose a foreign language class instructor tests vocabulary, verb conjugation, noun declension, and use of prepositions as an assessment. The specifications are well thought out in a rubric definition, and applied throughout the course to develop a rich set of assessment data. So far so good. But if these are then lumped together through some kind of numerical voodoo to produce an "overall assessment," we've committed the same sin that was supposed to be abolished by moving away from grades. This new category of "Word Use" or something can't be taught directly (too general), and we gain little through this data compression. It would be far better to preserve the details in full rather than summarize them. After all, computer technology can easily accommodate this volume of data.
Ultimately, one could hope that grades could be replaced altogether with a completely different kind of transcript. Instead of a neat summary that sheds its validity on the way to its three bit conciseness, we'd have a messy but valid list of very detailed learning "atoms" and the student's success with each. This probably can't be standardized. Students can't really be standardized either. But think how much richer such a report would be for a prospective employer. Yes, it would take longer to read. Anyone wanting a quick screening device would be pretty much out of luck.

The cynical side of my tells me that a secondary service would quickly spring up, probably from The College Board, that would (for a price) serve up standardized ratings of these rich transcripts. Soon, big employers would use these ratings to screen applicants, much like they use GPA now. But the difference would be that anyone willing to take the time to actually read the details could have an edge--there's a price to be paid for data compression. Fast food is convenient. It doesn't always taste good.

Next: Part Three

No comments:

Post a Comment