Thursday, April 24, 2008

Why I don't understand the assessment results...

...and what I can do about it.

This post is for comments related to my session on Friday, April 25 in Cary, NC, at the NCSU Undergraduate Assessment Symposium. The description is:
In pursuit of generating general education assessment results, we may enthusiastically adopt methods that are convenient but artificial. We explore ways to ‘find’ authentic assessments and present summary reports that your grandma can understand. The benefits are a better integration of assessment with the curriculum, and a much easier time convincing faculty that you know what you’re doing. With examples, colorful charts, and a manageable dose of epistemology.
The PowerPoint slides will be posted on the college's assessment site. There you can also find information about Coker College's general education assessment program, the open source software we developed to manage accreditation documents, our compliance report, and other stuff.

UPDATE: thanks to
Dr. Pamela Steinke for noting some great questions during the talk. Here are my answers.

Q: To what degree does your institutional mission statement reflect your general education goals?

A: From the discussion, it seems as though Coker is a bit unusual in having liberal arts goals explicitly stated in the mission. The middle paragraph of our mission lists analytical and creative thinking, effective speaking and writing as educational goals. This makes it easier to rally the faculty around these.

Q: How often do the results of assessment data and assessment results get shared and acted upon at your institution?

A: For our general education assessment, we have a faculty meeting at the beginning of the year where a report is given on the big picture. Then departments get reports at the program and individual student level. Additionally, committees like the institutional effectiveness committee will use assessment data sporadically throughout the year.

Q: When trying to assess complex processes, the greater the reliability of your measure the less the validity. Do you have examples of this?

A: Remember that validity is in the eye of the beholder to a large extent. As reasoning creatures, we model phenomenon with simple relationships (like linear ones, for example), and the word "complex" in common useage can mean “hard to predict”. Complexity has a technical definition that is quite useful (see Kolmogorov Complexity on the web), but requires a fuller explanation than I can give here. It’s easy to create examples of high complexity skills that can be reliably tested. For example, a single question can test if someone can pilot a 747: “Can you pilot a 747?”. This would very likely be quite reliable in the sense that subsequent repetitions would give the same response. Is it valid? I wouldn’t trust my life to it! On the other hand, impressions from a first date can be assumed to be meaningful (valid), even though the reliability is zero—you can’t have two first dates with the same person, so the whole concept of reliability is meaningless in this instance. In daily life, most things we do fall in the second category.

A point I made in the session is that complex phenomena manifest themselves in more ways than simple ones. Testing for knowledge of simple multiplication, for example, consists of verifying that the subject can solve a few problem types. If we wanted to test for knowledge of the US tax code, on the other hand, how would we have complete confidence in our results? We'd literally have to test every part of the tax code--an impossible task for any individual I imagine. Thus we tend to reduce complex outcomes to some subset. This can cause a merelogical fallacy--substituting the part for the whole, like in the 747 example. This reduces validity if you care about these details.

Q: How important is it that general education assessments be done in context of the discipline? Can general education skills be assessed with validity outside of disciplinary context?

A: We assess liberal arts skills across the curriculum. I'm not quite sure which discipline the question refers to, but we get enough inter-rater reliability with our method to be happy about it (exact matches over half the time with different raters on the same student).

Q: Means should only be calculated across the same unit. Do you have examples of inappropriate use of means?

A: Three pounds of grapes plus four pounds of nuts is seven pounds, but not of grapenuts. Aggregation works great with quantities, but not so well with qualities. (Averaging is just aggregation with a division afterwards.) So averaging math assessments with writing assessments is probably questionable. But this is exactly how GPAs are computed, of course! The average is pretty meaningless as a number. What a high GPA tells you, for example, is that a student did well in most of his or her classes. A proportion would do a better job. If you set B grades as your threshold, you could look at the percentage of course grades that meet that threshold and get, say Johnny has 34% and Mary 95%, which would more directly tell you what you’re interested in. A student who has a GPA of 2.0 could have had a 4.0 for a year, and then had a really rotten year because of some personal problem. Or it could be a solid C student. Are these the same thing? No. I realize that this sound heretical, since grade point averages are the currency of the registrar's office, but it's the kind of figure you should take with a grain of salt. It tells you something, but maybe not the most important thing you care about. As in any kind of data compression, you run the risk of losing something important. There's a joke about engineers who want to account for the presence of humans in a building in order to anticipate their impact on the wireless network (water interferes with it). Their first assumption is "assume a human is a one meter diameter sphere of water." This data compression might work well in one context, but not another.

Q: What tools have you found especially helpful with data analysis and reporting. Pivot tables, logistic regression….

A: The combination of database + pivot table is extremely powerful. Logistic regression is a special need kind of thing, but is designed to predict binary events like attrition. A very useful trick with pivot tables is to redefine scalar quantities (decimal numbers or integers) into a 0 or 1. For example, you could classify students with 3.0 GPA or above a 1 and the others as 0. When you use this field in the pivot table as data, tell it to average, and display as a percentage. It will then read off the percentage of the group (e.g. demographic) that falls in that classification. When I get a chance, I’ll put more detailed instructions with screenshots on this blog.

Q: Share ways you get more useful results by analyzing data categorically rather than looking at means.

A: You can more easily see extremes. How many students perform very well or very poorly? This can get averaged out easily, and hence lost on average reports. But it’s individual students we deal with, not average students, so these extremes matter. Another example is to look at assessment data by gender, ethnicity, or classroom performance (grades).

Q: What important details do the averages hide? Do you have any examples in which reporting the minimum and maximum would be more appropriate?

A: Averages hide the composition of the results, obliterating the distribution. Would you rather teach a class with half brilliant students and half remedial, or one with nearly uniform preparation and ability? The average ability (if such a thing were to exist) is the same in both. A prominent figure in the testing industry visited our campus a couple of years ago, and I had the opportunity to review our results from an instrument he’d designed. I brought the sheets with averages on them. He said “I don’t know why we even publish this stuff—where are the distributions?” Unfortunately, it took me another two years to figure out what he meant. Instead of averages, consider defining a cut-off for acceptable/unacceptable. This creates a meaningful reduction in data that can be used simply with pivot tables, for example. Maxes and mins or the whole distribution can be given fairly easily. These are quite informative—usually more so than the average.

Q: How can I present the data in a way that is most useful for faculty for improvement?

A: This may be a question to pose to the faculty, but certainly I would avoid the “lines go up” global graph of data, unless it’s just to provide context. If the report doesn’t connect their conceptual model of what they can change to the assessment results, they’ll simply be puzzled about what to do with it. The best case is if they are the ones generating the data to begin with—then they’ll have a better idea than anyone of what it means in functional terms. You can read how we try to accomplish that on our assessment website. We call it Assessing the Elephant.