Wednesday, November 02, 2011

Language of Assessment: Session Summary

This post summarizes the conclusions from my presentation yesterday at the Assessment Institute in Indianapolis. Many thanks to Trudy and everyone else who helps make this conference happen!

The topics below are taken from my Conclusions slide. They are related to the relationship between reality (actually affecting/effecting events) versus language (observing, understanding, planning). This is pictured in the diagram below, where R=reality, and L=language.

Language is easy. Reality is hard.
Language is mostly combinations of arbitrary signs (pointing with your finger and 'oink' being two exceptions), and the flexibility is due to the infinite number of ways that we can arrange these signs (words). I used the example of QR codes like the one pictured below to contrast reality (a small square) with language (about $10^{300}$ different codes expressible--probably more atoms than there are in the observable universe).
Source: Wikipedia

The larger point here is to be careful that we don't become disconnected from reality. This can happen in a number of ways. We can, for example, do everything through planning, but then fail to execute. Or we be fixated on some model, some sort of understanding that we preconceive and don't deviate from, and make observations simply conform to this worldview. The language we choose, including the types of assessments and how they convert raw observations into data, necessarily limit us, and it's good to be cognizant of those limits.

Seek meaning: use contextual language
When talking to faculty members and other non-assessment experts, I recommend avoiding jargon like 'measurement', 'validity', and such. It's more productive to use language that they already use. I don't even like to use 'assessment', since the point of the process is really improvement, not just assessment, and the word is now a lightning rod for instant opposition in some quarters. Of course, if you have to get a report out of them for accreditation, it isn't going to matter much what you call it. The Happy Happy Fun-Time Learning Report isn't going to fool them. The deeper philosophy behind the advice under this heading is that the faculty members really are the experts--they know their students, the material, the history of the curriculum, the changes that have been made, know what is on tests, and have all this rich contextual information. They are already familiar with the idea of academic success or failure, and don't really need a new lexicon to discuss it.

Administrators like summative/monological language
From the Department of Education down, 'performance measures', 'accountability', and 'value-added' are words that reflect the top-down mindset. They often want to see dashboards that lead to easy conclusions about the success or failure of some endeavor.  For more on monological/dialogical, which I didn't cover, see Assessing the Elephant.

Faculty like formative/dialogical language
Faculty may also want to see summary graphs of indices (e.g. pass rates), but unless there's some model of cause and effect that we can build from the information, it's hard to figure out what to do with the data. In other words, we need to be able to construct a story that makes sense. For this, providing details like percentages (e.g. distributions instead of averages), or other connections like correlates, shown below (see my previous posts on that).

Rubrics are particularly useful, especially if the faculty members construct them using language that makes sense to them (e.g. fitting them into their cause/effect model). The scale used is particularly important. I didn't mention this in the talk, but I will here. Often we use a PAGE rubric (poor, average, good, excellent) for accomplishment levels, but this should only be done for skills or knowledge that are not going to be under long development. For example, you wouldn't want to use this to rate writing because there's really no upper limit on how good you can get. Moreover, a freshman who gets Excellent ratings in the first year is likely to get Excellent in the fourth year as well--demonstrating no improvement! I use a PAGE rubric for rating effort, since the concept of no effort to maximum effort seems a good fit. David Dirlam has a comprehensive way to create very far-reaching rubric language that I won't say more about here. Google him and find out more. I referenced him here.

I mentioned the FACS a few times in the talk. It's a rating scale based on a student's career. You can read all about it in this manuscript or through these links.

Predictive validity of tests is probably not great
I showed examples from the College Board's benchmark on the SAT, which gives a maximum 65% correct classification rate. More on that topic in this article.  The SAT is exceptional in that we actually can test its predictive validity. Validity studies for most standardized tests are underwhelming, and don't address this crucial point--the main point, really. If we are content to assume that being able to answer a test question correctly at this moment translates into some future ability without actually testing it, then okay. But I think that's unreasonable given the nature of education. At the national scale, there is also a strange lack of interest in finding out what all this investment in education actually produces in outcomes that can be accurately assessed, like employment and earnings (financial aid data + loan data + IRS data).

Report summarizes, not abstractions
I left this off the conclusions page, but added it here because it relates to the previous topic. Let me illustrate with an example. Suppose we have an English proficiency exam to place incoming students into a first writing course. This is a test we can (and should) check the predictive validity of by looking at success rates. We can make summary statements about test results like this:
The 2011 English proficiency test results showed that 79% met the faculty-established criterion to be placed into ENG 102.
The validity of this statement is unquestionable as long as the calculation was done correctly. Compare that to:
According to the 2011 English proficiency test, 79% of our students can write well.
This substitution of a known fact "met the criterion" with a notion "can write well" takes us from a perfectly valid statement to one that is completely subjective, depending on what the reader's idea of "can write well" is. If the reader is trusting, he/she may just pass over this, but the validity of the statement is essentially unknowable.

This isn't hair-splitting--it's essential. If we allow ourselves to jump from what's know to undefined and untestable abstractions, we may end up creating educational systems that follow this illusion into irrelevance (e.g. a completely artificial test-driven culture like No Child Left Behind).  If we stick to what we actually know to be true, we don't have this problem. Of course, this comes into conflict with the summative view demanded by the DoE. We need to make the conversation more nuanced, I think.

Don't aggregate different kinds of stuff
This falls in the nuts and bolts category. You wouldn't normally add different sorts of data together, like age plus shoe size, and expect to get anything useful. Why do we add up rubric dimensions for an 'overall score?'  The only purpose it serves is to save space, by reducing everything to one number. But it's really hard to make sense of the number, and comparing one score to another is problematic. For example, suppose student papers are rated on a 1-10 scale on Correctness (e.g. grammar and spelling), Style, and Audience. Is a three point deficit in correctness exactly made up for a by a surfeit of style? It's hard to see how that could be the case.

Average data as a last resort: use richer displays
Learning outcomes data is complex. Reducing a rich dataset into one number is like burning it and poking through the ashes. Use averages (or better: medians) when you really only have room for one number. For Likert scale data, reporting out the percentage of Agree and Strongly Agree is often more meaningful to the reader than an average.

Validity is not a property of a test, but of a statement
The quote below pretty much says it all.

It is a common misconception that validity is a particular phenomenon whose presence in a test may be evaluated concretely and statistically. One often hears exclamations that a given test is “valid” or “not valid.” Such pronouncements are not credible, for they reflect neither the focus nor the complexity of validity.
– College BASE Technical Manual

It is statements that are valid or not, not tests or results from tests. Just like anything we say can be true or not (true = valid). You would think that this would require test makers and test users to only say things they know are true, but this is not the case. The temptation to make the leap to abstraction is irresistible. I wrote a paper on this topic here.

Complexity and reliability don't go together
If something takes more language to describe, it's more complex. When we look at something complex, there are by definition many ways of looking at it. This means that we often have a collision between reality and language when we assess learning. If we insist on a simple bad-to-good scale (one dimensional, in other words), we have to throw out much of the original information in order to data-compress it. It's similar to the complaint about averaging. Complex outcomes will have lower reliability in ratings because there are valid differing opinions. I used the example of rating the taste of green beans. You may like them al dente. I like them mushy. What's the correct response on the answer key?

You can see that the summative approach favored by policy-makers is not a very natural way to look at learning, and the gap widens the more complex the outcome becomes. A simple test for multiplication is probably sufficient because its complexity is so low. But a multiple-choice test for understanding of literature (like the one my daughter took last year) is simply a result of abstraction (test result = ability) and convenience (relatively cheap, easy to score).

Non-cognitives are important
Outcomes other than knowledge and skill include attitude and personal behaviors. These are increasingly recognized as important in higher ed. I've written more about that subject in my blog.

Opportunity to focus on accomplishment
This is the punch line, that provides a hopeful way out. If the critique above seems negative in parts, it's only intended to be an honest look at what we're doing so we can make the next evolutionary improvement. My suggestion is that we stick to facts we know when using assessment results (that is be more modest in our claims), and take two new approaches. One is based on the FACS--gather subjective opinions about our students so we can see if our tests predict those real-world opinions. (I didn't mention this in the talk). The second, and more important suggestion, is that we stop thinking of ourselves as preparing students for something after college. Instead we have the opportunity to lead them to authentic accomplishments that they can use to build their professional portfolio beginning in their first year.

We test our students constantly. According to the results, they graduate knowing how to think and communicate. That's what the diploma means. 
Our students demonstrate achievement through substantive works that stand on their own. You can look at them in the portfolios that they build during their time with us. You will see their actual accomplishments in their fields of study and be able to judge for yourself their quality.
The first approach is based on unprovable claims, and has students waiting on the 'factory floor' to receive their certificate. This is the summative industrial standarized-in, standardized-out approach. The second approach is more motivating (I suspect) to both students and faculty--it allows students to see real-world outcomes as they go along. Ones that are incremental and achievable, that they can compare and compete with, and retain  meaning after graduation as part of their professional history. And it's free--just used LinkedIn for the 'portfolio' system.

I have a lot more to say about this, but it will have to wait.

Thanks to everyone who came to the session--please stay in touch and share your successes and failures with us so we can imitate the former and avoid the latter.

No comments:

Post a Comment