Higher Ed/: April 2009

Thursday, April 30, 2009

Part Eight: If Testing isn't Measurement, What Is It?

Why Assessment is Hard: [Part 1] [Part 2] [Part 3] [Part 4] [Part 5] [Part 6] [Part 7]

Last time I argued that although we use the word "measurement" for educational outcomes in the same way it's used for weighing a bag of coconuts, it really doesn't mean the same thing. It is a kind of deception to substitute one meaning for another without warning the audience. Of course, this happens all the time in advertising, where "no fat" doesn't really mean no fat (such terms are, however, now defined by the FDA). In education, this verbal blurring of meaning has gotten us into trouble.

Maybe it's simply wishful thinking to imagine that we could have the kind of precise identification of progress in a learner that would correspond to the graduations on a measuring cup. Socrates' simile that education is kindling a flame, not filling a cup is apt--learning is primarily qualitative (the rearrangement of neurons and subtle changes in brain chemistry perhaps) and not quantitative (pouring more learning stuff into the brain bucket). As another comparison, the strength of a chess position during a game is somewhat related to the overall number of pieces a player has, but far more important is the arrangement of those pieces.

The subject of quality versus quantity with regard to measurement deserves a whole discussion by itself, with the key question being how does one impose an order on a combinatorial set. I'll have to pass that today and come back to it another time.

The sleight of hand that allows us to get away with using "measurement" out of context is probably due to the fluidity with which language works. I like to juxtapose two quotes that illustrate the difference between the language of measurement and normal language.

We say that a sentence is factually significant to any given person, if and only if, [she or] he knows how to verify the proposition which it purports to express—that is, if [she or] he knows what observations would lead [her or him], under certain conditions, to accept the proposition as being true, or reject it as being false. – A. J. Ayer, Language, Truth, and Logic

[T]he meaning of a word is its usage in the language. – L. Wittgenstein

The first quote is a tenet of positivism, which has a scientific outlook. The second is more down-to-earth, corresponding to the way words are used in non-technical settings. I make a big deal out of this distinction in Assessing the Elephant about what I call monological and dialogical definitions. I also wrote a blog post about it here.

Words like "force" can have meanings in both domains. Over time, some common meanings get taken over by more scientific versions. What a "second" means becomes more precise every time physicists invent a more accurate clock. The word "measurement" by now has a meaning that's pretty soundly grounded in the positivist camp. That is, if someone says they measured how much oil is dripping from the bottom of the car, this generates certain expectations--a number and a unit, for example. There is an implied link to the physical universe.

But as we saw last time, the use of "measurement" in learning outcomes doesn't mean that. What exactly are we doing, though, when we assign a number to the results of some evidence of learning? It could be a test or portfolio rating, or whatever. If it's not measurement, what is it?

We can abstract our assessment procedures into some kind of statistical goo if we imagine that the test subject has some intrinsic ability to successfully complete the task at hand, but that this ability is perhaps occluded by noise or error of various sorts. Under the right probabilistic assumptions, we can then imagine that we are estimating this parameter--this ability to ace our assessment task. Typically this assessment will itself be a statistical melange of tasks that have different qualities. A spelling test, for example, could have a staggering variety of words on it in English. If there are a million words in the language, then the number of ten-item spelling tests is about

10,00,000,000,000,000,000,000,000,000,000
,000,000,000,000,000,000,000,000,000,000.

So the learning outcomes question "can Stanislav spell?" depends heavily on what test we give him, if that's how we are assessing his ability. Perhaps his "true" ability (the parameter mentioned above) is the average score over all possible tests. Obviously that is somewhat impractical, since his pencil would have to move faster than the speed of light to finish within a lifetime. And this is just a simple spelling test. What are the qualitative possibilities for something complex like "effective writing" or "critical thinking?"

When we assess, we dip a little statistical ruler into a vast ocean of heaving possibilities, changing constantly as our subject's brain adapts to its environment. Even if we could find the "true" parameter we seek, it would be different tomorrow.

All of this is to say that we should be modest about what we suppose we've learned through our assessments. We are severely limited in the number of qualities (such as combinations of testable items) that we can assess. If we do our job really well, we might have a statistically sound snapshot of one moment in time: a probabilistic estimate of our subject's ability to perform on a general kind of assessment.

If we stick to that approach--a modest probabilistic one--we can claim to be in positivist territory. But the results should be reported as such, in appropriately technical language. What actually happens is that a leap is made over the divide between Ayer and Wittgenstein, and we hear things like "The seniors were measured at 3.4 on critical thinking, whereas the freshmen were at 3.1, so let's break out the bubbly." In reality, the numbers are some kind of statistical parameter estimation of unknown quality, that may or may not have anything to do with what people on the street would call critical thinking.

Note that I've only attempted to describe assessment as measurement in this installment. There are plenty of types of assessments that do not claim to be measurement, and don't have to live up to the inherent unrealistic expectations. But there are plenty of outcomes assessments that do claim to be measurements, and they get used in policy as if they really were positivist-style tick marks on a competency ruler. Administrators at the highest levels probably do not have the patience to work through for themselves the limits of testing, and may take marketing of "education measurement" at face value.

In summary, "measurement" belongs in positivist territory, and most educational outcomes assessments don't live up to that definition. Exacerbating this situation is that "critical thinking" and "effective writing" don't live in the positivist land--they are common expressions with meanings understood by the population at large (with a large degree of fuzziness). Co-opting those words borrows from the Wittgenstein world for basic meaning, and then assigns supposedly precise (Ayer) measurements. This is a rich topic, and I've glossed over some of the complexities. My answer to the question in the title is this: educational assessment is a statistical parameter estimation, but how that parameter corresponds to the physical world is uncertain, and should be interpreted with great caution, especially when using it to make predictions about general abilities.

Tuesday, April 28, 2009

Part Seven: Measurement, Smeasurement

Why Assessment is Hard: [Part one] [Part two] [Part three] [Part four] [Part five] [Part six]

In outcomes assessment we use the word 'measure' as a matter of course. Our task today is to make sense of this language.

Measurement is a word laden with meaning. It means more than assessment or judgment or rating. Consider the following statements.

I picked some strawberries today. We measured them to be very tasty!
I measured the kids before they went to bed--they were all happy.
We went to the art museum, and measured the artists' creativity.

To me this sounds quite odd. On the other hand, we might easily say:

I measured the bag of potatoes. It was five pounds.
I measured three cups of flour for the bread.
When the builder measured the door, he discovered it was crooked.

There is a difference in common language between a subjective, perhaps casual assessment, and a more rigorous objective and verifiable one. Objectivity and reliability might be said to be the hallmarks of measurement, but there's a lot more to it than that.

If we say we can measure something, we evoke a certain kind of image--a child's growth over time marked off on the closet wall perhaps. Because we reduce complex information to a single scalar, for convenience we usually choose some standard amount as a reference. We aren't required to create this unit of measurement, but any type of measurement should allow this possibility. Hence we have pounds and inches and so forth.

Despite all the language about measuring learning, there are no units. At least I've never seen any proposed. So I will take it upon myself to do that here: let's agree to call a unit of learning an Aha. So we can speak of Stanislav learning 3 Ahas per semester on average if we want. Of course, we need to define what an Aha actually is. I have come to this backwards, defining a unit without a procedure to measure the phenomenon. What might be the procedure for measuring learning?

Because of the objectivity and reliability criteria for real measurement, things like standardized tests come to mind. Good! We can measure Ahas by standardized test. Of course, these instruments aren't really objective (they are complex things, created by people who are influenced by culture, fad, and so forth) nor truly reliable (you can't test the same student twice, as Heraclitus might say). But if we wave our hands enough, we can imagine those problems away.

Still, there is a substantial problem before us. We can't put all knowledge of everything on this test, so what particular kinds of questions are there to be? We run smack into the question what kind of learning? Unlike length, of which there is only one type, or weight, or energy, or speed, there are multiple types of learning: learning to read, learning to jump rope, learning to keep quiet in committee meetings so you don't get volunteered for something. If an Aha is to be meaningful, we have to be specific about what kind of learning it is. But each type is different and needs its own unit. We could coordinate the language to paper over this difficulty, just like we have one kind of ounces for liquid and another kind of ounces for weight. But this is not recommended since it creates the illusion of sameness. Undeterred, we might propose different units for different types of learning: Reading-Aha, Jump-Rope-Aha, Committee-Aha, etc.

How specific do we need to be? Reading, for example, is not really a single skill. I'm no expert, but there are questions about vocabulary, recognition of letters and words (dyslexia might be an eussi), pronunciation, understanding of grammar, and so forth. So reading itself is just a kind of general topic, more or less like height and weight are "physical dimensions." In the same way that it would be silly to average someone's height and weight to produce a "size" unit, we don't want to mix the important dimensions of reading into one fuzzy grab bag and then have the audacity to call this a unit of measure. Where does this devolution stop? What is the bottom level--the basic building block of learning--that we can assign a unit to with confidence?

There may be an answer to that question. If you've read my opinions about assessing thinking on this blog, you'll know I find "critical thinking" too hard to define, and prefer the dichotomy of "analytical/deductive" and "creative/inductive" because those can be defined in a relatively precise (algorithmic) way. A couple of research papers tie electrical brain activity to creative thinking exercises. Science Daily has articles here and here. This is a topic I want to come back to later, but for now consider the point that neurological research may eventually have the ability to distinguish measurable differences in brain activity and potentially provide a physical basis for studying learning.

There are tremendous difficulties with this project, even if there is an identified physical connection. That's because brains are apparently networks of complex interactions, and by nature highly dimensional. It's going to be very hard to squash all those dimensions into one without sacrificing something important.

Note that none of these issues prevents us from talking about learning as if it were a real thing. It's meaningful if I say "Tatianna learned how to checkmate using only a rook and a king." Most of language is not about measurable quantities. We can make very general comparisons without being precise about it. Shakespeare:

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date . . . "Sonnet 18," 1–4

The amazing thing about language is that we converge on meanings without external definitions or units of measure. Meaning seems to evolve so that there is enough correlation between what you understand to be the case and what I understand that we can effectively communicate. This facility is so good that I think we can easily make false logical leaps. I would put it like this:

Normal subjective communication is not an inferior version of some idealized measurement.

We should not assume that just because we can effectively talk about love, understanding, compassion, or learning, that those things can be measured. Failing a real definition of an "atom of learning" and commensurate unit, we shouldn't use the word "measurement." But if learning assessments aren't measurements, what are they? I'll try to tackle that question next time.

Next: Part Eight

The Two Meeting Personalities

A while back, I wrote about the secret life of committees and mentioned an article about why you should never invite seven colleagues to a meeting. Later I mused about a meeting-driven life. There is yet more science to be explored on the topic I discovered today in Science Daily. The article dates from 2006, and explores the question of whether workplace meetings are a good thing or evil. It is based on research led by psychologist Steven G. Rogelberg of University of North Carolina, Charlotte.

The research team found that in public, most of us decry meetings as time-wasters, but our true feelings may be different:

"When speaking publicly, people generally claim that they hate meetings," said Rogelberg, "but in the surveys you see a different story -- some people's private sentiments are much more positive.

One important factor turns out to be whether or not the meetings are well led. In my experience, that's about half of the picture. The other half is the communication styles of those present. I'm an Act III kind of guy--I'll ask for Act I and Act II if I'm interested, but normally one act is enough. I may not be typical, because there is often far more explanation than I can stand in one meeting. But back to the article's findings.

Apparently there are two kinds of us. Some really don't like meetings because of the sense of not getting things accomplished. The other type actually likes them because they are not burdened by an itching agenda, and they enjoy the social aspects of meetings.

"People who are high in accomplishment striving look at meetings more from the perspective of seeing them as barriers to getting real work done," Rogelberg said. "But the others may view meetings as a way to structure their day or a way to network and socialize. As a result, these people see meetings as a good thing."

As a practical matter, then, perhaps we should all identify ourselves as belonging to one camp or another. Pins or arm patches would do the trick, or little markings on staff rosters. Those who don't like meetings breaking up their day (Type I) could be scheduled accordingly, in blocks, but only when absolutely necessary. The meeting-social types (Type II) could gather frequently in exclusive groups for their sort. Of course these 'talky' committees might not accomplish much. Another Science Daily article notes about workplace groups that:

From the operating room to the executive board room, the benefits of working in teams have long been touted. But a new analysis of 22 years of applied psychological research shows that teams tend to discuss information they already know and that "talkier" teams are less effective.

The authors advise that the remedy is:

[T]eams communicate better when they engage in tasks where they are instructed to come up with a correct, or best, answer rather than a consensual solution.

In other words, don't make it a social event. This would make our Type I committee members less grumpy, at the expense of the Type II's enjoyment. I guess the lesson is that if anyone in the room is enjoying the process, the meeting isn't proceeding efficiently. Perhaps the Type II crowd could benefit from the new drug Despondex, which is designed to take the edge off of people who are too cheerful.

The good news, I suppose, is that groups still function better than individuals, according to this article and this one. Scientific proof that committees are here to stay.

But the Internet has created a whole new kind of group. In a graphic novel, it would be the committee that fell into the vat of toxic waste and woke up with new powers and strange motivations. As a force for good, see the power that is the Mechanical Turk. As a force for the strange, see this article about 4chan's hack of an online TIME poll. For both of these examples, the whole being greater than the parts on a large scale. Imagine if your whole institution met as a committee and actually got something done. Scary, isn't it?

Part Six: Sylla and Charybdis

Why Assessment is Hard: [Part one] [Part two] [Part three] [Part four] [Part five]

In the last installment of this assessment melodrama, our heroic assessment director (pictured below) was given two Homeric tasks: convince the denizens of the deep (teaching faculty) to integrate assessment into the classroom AND simultaneously create aggregate "measures" of learning for the high priests of wesaysoism (administration and regulatory bodies)*.

We might characterize these two challenges as the micro and macro, and it's hard enough to accomplish them without confusing the two, so I'll say a few more words about the distinction today, and then start addressing my promise to talk about what we call "measurement" in assessing outcomes. This idea is almost ubiquitous in the field and pertains to both macro and micro scale. In either case it can be wildly misused.

Micro Assessment is what I'm calling assessment for learning. The point of the assessment bit is to increase the effectiveness of the classroom, curriculum, and mission of the institution. This may mean coinciding with general education outcomes or something. The point I tried to make in yesterday's harrowing installment was that this effort has to be credible. One of my maxims of life is that people do things for reasons that make sense to them. If it doesn't make sense for the instructor to take the effort to integrate some learning outcomes into assignments, it won't happen. I keep talking about philosophy (in the common, not academic sense) as essential to this project, so it was interesting yesterday to read an article in the Atlantic by a philosopher, Matthew Stewart, who got into the business of management consulting and became quite successful at it. He is nevertheless rather harsh on his adopted profession:

The thing that makes modern management theory so painful to read isn’t usually the dearth of reliable empirical data. It’s that maddening papal infallibility. [...]

Each new fad calls attention to one virtue or another—first it’s efficiency, then quality, next it’s customer satisfaction, then supplier satisfaction, then self-satisfaction, and finally, at some point, it’s efficiency all over again. If it’s reminiscent of the kind of toothless wisdom offered in self-help literature, that’s because management theory is mostly a subgenre of self-help. Which isn’t to say it’s completely useless. But just as most people are able to lead fulfilling lives without consulting Deepak Chopra, most managers can probably spare themselves an education in management theory.

I make no judgments about Mr. Stewart's erstwhile profession (he left eventually), but the account makes a good cautionary tale for the intrepid assessment director: don't approach the task with "papal infallibility," and think more like a coach than a scientist.

Macro Assessment is where the real damage can be done. About the worst that can happen by annoying the faculty with administrivia is creeping irrelevance for the director and a lot of unpleasant meetings. But the macro assessments might actually get used in policy. For example, there's the notion of an exit exam. These can be implemented at different times, and for different purposes. One option is to create a graduation requirement of passing the test, so that it becomes the finger of fate, as it were, for many students. Education Week has an April 27 article on the topic, by Debra Viadero called "Scholars Probe Diverse Effects of Exit Exams." Part of a graphic is reproduced below, showing states that employ this technique in public schools.

More than anything, such testing is about faith. The effects of failing are damaging, not just because of the lost time and effort, but because of the apparent loss of faith in one's own abilities by those who fail, as noticed by independent researchers:

[... Researchers] in a study looking just at students who barely passed or barely failed that state’s exit exam in 10th grade, found that being labeled a failure can have a detrimental effect on low-income students in urban schools.

Even though students have plenty of opportunities to retake the exam—and most do—poor, inner-city students who just missed the passing cutoff in 10th grade are 8 percentage points less likely to graduate on time than demographically similar students who just barely passed, even though both groups scored at roughly the same levels on the 10th grade exam. Failing or passing the tests seems to have no statistically significant effect, though, on the probability of graduation for wealthier, suburban students.

The cause and effect seems to be plausibly that a (minority or female) student's failure results in a downgraded self-assessment of his or her own abilities, which causes a real degradation in ability to pass the test.

What about our own faith in these tests? If we believe in the reliability and validity of the exit exam, we'd have to conclude that simply taking the test and seeing the result was enough to cause students to know less than they did before. There's no real way out of that conclusion if you believe that we are really measuring learning with such processes.

This is just one short example of the subtleties at play, and the unexpected consequences of macro-assessment. There is lots more to discuss on the topic. What I'd like to do over the next few installments is begin to peal away the layers of assumptions about this idea of measurement and see what's there. What is it, really, that we put so much faith in? A quote from Thomas S. Dee, a Swarthmore College economist concludes the article with:

“The cynic in me worries that we’re just going to continue to see these policies proliferate, because it seems like an obvious way to convey the expectations that we should have for students,” Mr. Dee said, “and the negative effects appear to be hidden from public discussion.”

As Henry Louis Mencken said, “For every problem there is a solution which is simple, clean and wrong.” This is the second cautionary tale with which to begin to frame the discussion on outcomes measurement. Stay tuned...

Next: Part Seven

*Note that I mean no disrespect for either group, especially since I happen to be in both. But this is how they sometimes speak of each other, so it's useful to highlight this tension.

Sunday, April 26, 2009

Part Five: Creating Do-Gooders

Why Assessment is Hard: [Part one] [Part two] [Part three] [Part four]

Well, this is ironic. I was trying to come up with Plato's quote about men wanting to do good, and only needing to learn what that is. That was supposed to be the foil for some engaging line of thought. What I found was this. It's a service that writes essays for (I presume) college students for $12.95 per page. Their sample happens to be about Plato and Aristotle. Quoting from the essay:

Plato says that once someone understands the good then he or she will do it; he says “...what we desire is always something that is good” (pg.5). We can understand from this that Plato is saying individuals want to do good for themselves; we perform immoral deeds, because we don’t have the understanding of the good.

The existence of the quote where I found it pretty much negates its premise. And here I'd planned to make an argument that resistance to assessment by teachers is caused by them not knowing the Good. Well, I shall forge ahead anyway, and you can play along.

Even when something is clearly Good, it's not obvious that everyone will do it. Otherwise everyone with the means would eat plenty of fruits and veggies every day instead of fried pork rinds for breakfast, or whatever it is that leads to so much heart disease. So there's certainly the issue of how hard it is to do the right thing. But maybe we can take those two parts as necessary conditions to, for example, get teaching faculty to implement all of the beautiful assessment plans that have been cooked up.

First, the Good. If Professor Plum doesn't buy into the idea that this whole outcomes assessment thing is worthwhile, the project can only proceed through wesaysoism, which calls for continual monitoring and tedious administration. Administration is, of course, a hostile word to many faculty, so it's best if the message is delivered from one of their own. If the assessment director isn't in the classroom mixing up his or her own assessment recipes, the project is suspect. But this is only the first step. After all, faculty members have even more crazy ideas than administrators do--it's almost a prerequisite for the job (speaking as one, here).

No, you have to be convincing. There is a certain amount of chicken and proto-chicken here--you really need a program or two that does a good job so that you can prove that the idea can actually be carried out. If you start from zero, then the first priority is to find a spot of fertile ground and begin to cultivate such a program. Plan on this taking years. There are some natural alliances here, in unlikely places perhaps. Art programs already have assessment built in with their crits, as do creative writing programs. Finding a champion or two among the faculty that others respect is key. You can tell who's respected by who gets put on committees.

Unfortunately, the enterprise of assessment is a lot harder in practice that it looks on paper. So having a solid philosophy can help enormously. By this I mean picking your way carefully through the minefield of empirical demands and wesaysoism to find a path the others can follow. If the goals you set are too demanding of science, you'll fail because assessment isn't science. We don't actually measure anything, despite using that language. More on that later. As a result, if "closing the loop" means to you a scientific approach of finding out what's the path to improvement and then acting on it deterministically, it will be like trying to teach a pig to dance: frustrating to you and annoying to the pig.

On the other hand, as we've already noted, relying simply on wesaysoism to get things done means you have to micro-manage every little thing, and the faculty will try to subvert you at every turn. So doing the work of sorting out for yourself how to make a convincing argument, based on cogent principles, is worth it. Read what other people have to say. You might check out Assessing the Elephant for a divergent view. But find something that makes sense.

I find that it helps to separate out thinking about classroom level assessment and what we might call "big picture" assessment. The former should and can be indistinguishable from pedagogy--the integration of assessment directly with assignments and how stakeholders view them. As an example, we weren't happy with students' public speaking skills, so we started videotaping student seminar presentations, and having them self-critique them. It's not rocket science. But it wouldn't have happened if we hadn't explicitly identified effective speaking as a goal, and thought about what that means. And it seemed like a Good thing do to.

Big-picture assessment is extremely easy to do wrong, in my opinion. I think lots of low-cost subjective ratings are a good approach, but opinions will vary. In any event, don't imagine that it's easy to accumulate what happens in the classroom and make it applicable to the university. It's very difficult. Again, have a solid philosophy to back you up. Otherwise you'll be waving your arms around, trying to distract your audience from the logical holes.

In both micro- and macro-scale assessment, try to feed the faculty's own observations back to them in useful summary form (not averages--they're not much use). They don't respect anyone so much as themselves.

Being Good isn't enough. It also has to be easy-peasy. Lower the cost of assessment to the minimum realistic amount of time and energy. Don't meet if it can be done asynchronously. Use Etherpad instead. More generally, use technology in ways that simplify rather than complicate the process. It's all well and good to have a curriculum mapped out in microscopic detail, with boxes and rubrics for every conceivable contingency. But if no one uses it because it's too complicated, it's moot. The barrier to completion may not have to be very high to be effective. Page load times matter. A few dozen milliseconds is enough to turn away customers. Too many clicks, or poorly designed interfaces shed more users. It shouldn't be a death march to enter data into the system, however you do it.

I won't recommend a commercial system here, because I'm not an expert on them, and I also think you can create your own without that much trouble. You just need one decent programmer and a little patience. Again, philosophy is key to building the thing. Whether your system is paper or electronic or smoke signals, think about what the maximum useful information per effort is. It's easier to start small and grow than the other way around.

As a real example, a portfolio system I built for a client started off as the absolute bare minimum--it was really just a drop box, but with students and class sections conveniently set up automatically. Over time we added assessments to the submissions, customized per program. It would have been too much to start there. Remember the only way to solve opaque problems is an evolutionary approach.

Satisfying all the critics isn't easy. Accreditors want different things than the senior administration, which will be yet different from what the faculty find useful. For the former, sad to say, rather meaningless line graphs showing positive slopes of some kind of outcome is usually enough. This is the kind of thing that looks good, but probably doesn't mean much. So there is always the temptation to simply play the game of meeting those (very minimal) expectations. Don't do that, or you'll find yourself wondering why you didn't choose exotic plumbing as a career instead.

Convince the faculty, and everything else Good will follow. It's easy to make pretty graphs. It's much harder to lead an ongoing conversation that your very intelligent colleagues find convincing and insightful. And if you find yourself in trouble, you can always show them the website selling term papers. That should be good for an hour's distraction at least. Meanwhile you can slip out the back and work on your oratory or figure out a way to shave half a second off of page refresh times.

Next: Part Six

New Under the Silicon Sun

In the old days the computer aisle at the bookstore used to be full of old friends. There were books on MS Dos, DR Dos, Borland C++, and other titles I could recognize. The trick was to figure out the best book for the money. Today that's much easier with the likes of Amazon's rated comments, but the selection of subjects boggles the mind. I don't even understand what the majority of the books are about, judging from the title. I'm trying to think back to remember when this happened, this explosion of variety of ways to use the computer. Certainly things became unmanageable after the Internet exploded. There's always something new around the corner.

Too new to actually be here yet is Stephen Wolfram's Alpha, which is supposed to launch in May of this year. It's like a search engine, but with language processing and computation built in. This is not to say that you can't do computations in Google. See below, as it exponentiates i (square root of minus one) times pi to get the correct answer.

But if you google "son of a grandson," Google won't draw a family tree for you. Supposedly, this is the sort of thing that five million lines of Mathematica running a search engine get for you, according to a review of Alpha on ReadWriteWeb.

One complaint about search engines is that it takes a critical eye to wade through the results to find pertinent information. Alpha is supposed to be smarter about the kinds of results it returns, especially for quantitative information. This is a specialization, and is unlikely to supplant the likes of Google or Yahoo for many of the kinds of searches frequently done. What's the difference? According to TechCrunch's review:

Basically it means that you can ask it factual questions and it computes answers for you.
It doesn’t simply return documents that (might) contain the answers, like Google does, and it isn’t just a giant database of knowledge, like the Wikipedia. It doesn’t simply parse natural language and then use that to retrieve documents, like Powerset, for example. Instead, Wolfram Alpha actually computes the answers to a wide range of questions — like questions that have factual answers such as “What country is Timbuktu in?” or “How many protons are in a hydrogen atom?” or “What is the average rainfall in Seattle?”

In order to do this, Alpha has to have all kinds of models in the background. In some sense, understanding has to be programmed in, to know that a proton is a component of an atom, for example.

There's probably a downside. Instead of just dealing with cut and pasted text in student papers, now there will be flawless pie charts and ratios too, I suppose, all a product of unthinking clicking around. It seems to me that this progression makes information literacy training more and more important.

Other sites of interest:

eduStyle.net -- best rated higher education web sites.
polleverywhere.com -- you've heard of classroom clickers, right? Turn the audience phones into clickers.
dimdim.com -- online collaboration and meetings, complete with VOIP
list of malware detectors a short list of some of the best utilities to clean Windows.

Friday, April 24, 2009

The Future of Portfolios

A few weeks back in the public library I picked up a novel by Verner Vinge called A Fire Upon the Deep, which may be the best sci-fi novel I've ever read. Soon after, I found myself reading his novel Rainbows End, about a near future where technology has become truly ubiquitous--one wears his or her computing environment, which overlays the 'real world' with graphics displayed on contact lenses to create new realities. Part of the story takes place in a classroom, and it's fascinating to see how Vinge (who is a mathematician and computer scientist) imagines the new technologies impacting pedagogy. This plays out as students work in collaboration to flex their creative muscles using networking, computing, physical "black box component" machines, and software tools. Except for the sophistication of the wearable computers, all of this is possible in some form today.

Portfolios can be accumulators for this sort of work. There are commercial systems available as well as open source ones like OSP (open source portfolio, literally enough), which integrates with the online learning system Sakai. The OSP home page advertises that portfolio owners have access to:

tools to collect items that best represent their accomplishments, their learning, or their work;
tools to reflect upon these items and their connections;
tools to design a portfolio that showcases the best selections of this work;
and tools to publish the portfolio to designated audiences.

The use of eportfolios has been championed by Trudy Banta at IUPUI, which has developed its own custom one in cooperation with OSP. I saw an early version at the IUPUI Assessment Institute some years ago. At that time there was a matrix into which students plugged artifacts that demonstrated their accomplishments in the general education learning outcomes set by the university.

Advantages of eportfolios are obvious. Years ago I was on a committee to review the old-fashioned kind--manila folders bursting with papers that supposedly documented student accomplishments in language and numeracy. The stakes were high--if a student failed to pass this review, his or her graduation could be held up. In practice, the reliability and validity of this review were very doubtful, and the record-keeping terribly time-consuming and imperfect. There was little transparency throughout the process, and it ultimately was abandoned. A few years later, as chair of the Institutional Effectiveness Committee, the accreditation process presented me with the problem of assessing our Quality Enhancement Plan, which addressed student writing. This time, we built an eportfolio. I used the ASAP approach (as simple as possible), so as to minimize support problems and maximize usage. It takes very little in the way of barriers to keep users away. In this, we were very successful--we gathered oodles of writing samples without having to try hard at all. I called the thing {iceBox} because that's what my grandmother always called her refrigerator, and it tickled me to give new meaning to the term.

Assessment was harder than collection. We tried what seemed like the obvious approach, creating a rubric for writing, assembling a sample of portfolios (each was a selection of three writing samples, so it was a "portfolio" only in a general sense), and assembling a committee to rate them. After spending a lot of effort, we concluded that this approach was a good example of an idea that seems obviously correct until you actually try it. It's much better to tie assessment very closely with coursework--that is, leave it centered in the classroom where it can have an immediate impact and has authenticity.

There are two lessons from my experience. First, software that does one thing simply and well is preferred over general tools that try to do everything. In Vinge's novel, this is analogous to the black-box components that fit together like Legos. For two excellent examples of this kind of thing, see my post on collaborative software. There are many, many components on the web that can be used to assemble portfolio-type materials of all kinds, from music to language to geography to graphics to publishing and beyond. Moreover, they are evolving all the time. A hyperlink can take you to an audio file of a speech, a movie, or a location in Second Life, to name three of a virtually unlimited number of species of potential portfolio artifacts.

The early cars looked like carriages, because that was an obvious transition from horse to no horse. Similarly, our first generation of eportfolios tries to be a big electronic manila folder. This too will evolve, I'm sure. Because what's called for is not the design and publish components (3 and 4 in the OSP list), but the means to link together artifacts that can live anywhere in cyberspace. A student's portfolio per se is just the connections between presentations, and could consist of a single hyperlink to a blog as entry point.

The second lesson I learned relates to the second requirement of a portfolio, if it is to be used for educational purposes. Assessment documentation has to get done in a way that makes sense; there are a lot of ways to do it wrong. Assessments need to be pertinent to ongoing educational processes in real-time, need to be authentic (related to class work), and the burden of creating these assessments needs to be minimal if they are to get done at all. That's a tall order.

Yesterday I mused about implicit rules in the academy. The Center for Teaching, Learning, & Technology (TLT) at Washington State University has produced a very interesting spectrum of assessment focus. You can download it here as a pdf, and I've reproduced the first line below as a sample.

Working through this sheet is like taking an inventory of implicit rules and assumptions concerning assessment practices. The third item "Expert consensus from the community of practice validates the assessment instrument," sounds like the primary basis for the general education assessment we had the most success with.

This idea is related to portfolio assessment by WSU's TLT Center through the notion of a "Harvesting Gradebook," which I mentioned briefly yesterday as a kind of disruptive technology--ideas that challenge the implicit rules of how grades and grading work in this case. On their blog, Nils Peterson writes that

As originally articulated by Gary [Brown], the gradebook “harvested” student work, storing copies of the work within itself where it was assessed.
On further discussion, the concept became inverted, what was “harvested” were assessments, from work that remained in-situ.

The key here is that the components of the portfolio can live anywhere on the web. The piece that brings them all together is the assessment. This fits perfectly with the evolution from "horse-drawn" portfolios to an ASAP component model. Do the assessment part really well, in other words, and don't try to recreate a music-editing program or web-based word processor inside your eportfolio software. A sample assignment contains notes about how to write a blog. My understanding is that it doesn't specify where or what software to use, but what the content should look like. This separates the means of creation from the assessment of the product, which simplifies and enriches the portfolio. I recommend taking a look at the sample survey here, which permits feedback from various types of audience with rubrics and comment boxes.

There is a lot of material about this project on the WSU/TLT blog, and I look forward to learning more about it. It's interesting that for all the verbiage in the assessment community about grades being lousy assessments, this is the first time I've actually seen a proposal that would radically transform the practice of grading. I hope the registrar has a defibrillator in the office.

Thursday, April 23, 2009

Rules, Damned Rules, and Policy

My daughter has this thing about wanting pets. Because we aren't well situated to play host family to them (the gerbils were a disaster on our first attempt), I try various ruses to change the subject. When she was younger I latched onto the idea of using a laser pointer as a dog substitute (named Spot, inevitably, although Red was a contender). We took it for a lot of walks, always after dark, and watched proudly as it visited all the trees on the block.

Yesterday it was plants. Pet plants need minimal care, I figure. So we went to Home Despot to look. I particularly wanted rosemary, to replace the nice bush we had at the old house. We bought some 'pet' vegetables, but there was no rosemary, so we walked down to Kmort to look. Their outdoor section looked much like a concentration camp, and I'm sure if I spoke plant-ese I'd have been moved to tears by their pleas. My daughter wanted to ask the salespeople if she was allowed to water the plants, but there was no one to ask.

None of this has anything to do with rules, in case you're wondering. That connection came next, when I noticed a gaming store next to Kmort. In graduate school I co-authored a board game of sorts with a friend, so I wanted to peek in and wallow in the ambiance for a bit. My almost-teen daughter was properly horrified, which added to the attraction.

The place was filled with gamers at tables piled with miniatures from a dizzy variety of genres. There were sci-fi tableaux, with someone asking about how far pulse rifles could shoot, fantasy sorts of things I couldn't recognize, and historical battles with box-like formations of hand-painted troops. Around the walls were stocked the complicated rules books I remembered.

You've probably figured out by this point that I played a lot of geeky games as a teen, with rule books that resemble the 1040 tax instruction booklet. The rules got increasingly more complex as time went on, until half the games seemed to consist of searching for the right sub-section with the table on the chance of successfully napkin-folding or whatever. For me, there reached a point where I didn't find it enjoyable anymore. That's when I started coding up rules in Applesoft, to use the computer to keep track of the complicated bits, and truly descended into full-fledged geekdom, from which I never really emerged.

This all by way of introduction to "rules and the academy". (I hope this one worked better than sock muffins did.) There are places where rules are absolutely essential. You can identify them by the lack of thinking that's required by the tasked staffer. Storing backup tapes somewhere safe, following procedure with regard to transcripts, keeping track of financial accounts properly, and so on, are good examples. The lower the complexity, the more suitable for rules-making a process is. At the other end of the spectrum lies general responsibilities like "being president," which is too fuzzy to be described in a President's Operating Manual, or something.

If you like rules, stop by human resources. Here, as in other areas like IT, rules can be used to simply block things that staff don't want to do and generally accumulate power and influence. I read a (possibly apocryphal, but plausible) account of a man interviewing for a job, who was asked by the HR interviewer for the phone number of his previous employer, so as to verify the information on his resume. I can't find the original now, but it went something like this:

"I need to call your previous employer, Mr. Snark."
"Well, I was self-employed, so that would be me."
"Fine. What's the phone number?"
Mr. Snark, bemused, gives the number and watches the numbers being dialed. He pulls out his cell phone and answers on the first ring.
"Mr. Snark?"
"Yes, that's me."
"I need to verify some information about a previous employee."

In this Kafka-esque drama, the interview plays out in full, after which Mr. Snark is given the explanation that rules, after all, must be followed.

It's debated whether or not evolution produces more complexity in living things. I think it's probably true that complexity is more valuable in some situations than others. Like a string in a drawer, systems seem to bow to entropy almost immediately and become more complicated without much effort. Straightening them out is hard. As Machiavelli put it:

It must be considered that there is nothing more difficult to carry out nor more doubtful of success nor more dangerous to handle than to initiate a new order of things; for the reformer has enemies in all those who profit by the old order, and only lukewarm defenders in all those who would profit by the new order; this lukewarmness arising partly from the incredulity of mankind who does not truly believe in anything new until they actually have experience of it.

Whereas complexity naturally emerges in the form of additional rules, in order to properly simplify the resultant mess, the reformer has to overcome the natural resistance of those who benefit from the complications. Think of the US tax code.

So, like biological bodies renewing themselves through reproduction after entropy has corrupted them, it's a healthy process to change administrations and shake things up once in a while. It is perhaps the case that the most insidious rules are not affected by this, however. They may be invisible.

At least too-complex rules can be seen. There are many quasi-rules that are merely implied. Dress codes often are, and are enforced through social conventions, although explicit ones aren't uncommon. More dangerous, I think, to the mission of the academy are the implied rules that pertain to learning. Here are a few. You can add to the list.

Teaching only takes place in formal sessions
Learning is not as important as ratings like grades
Education proceeds by check marks on a sheet
Education is a service one pays for, just like having your car washed
Learning experiences can be made uniform, like an assembly line
Student collaboration, unless explicitly allowed, is cheating

The bureaucracy of higher education is necessary to organizing the massive endeavor, no doubt. But leaving unexamined the implications of these practices blinds us to some pernicious effects. Do we really have a right to complain if students see courses as milestones to be passed on a linear journey--points of momentary interest that can be forgotten? Doesn't the very structure of the process from advising through transcripts encourage that point of view?

In order to gain perspective, a complete rethink is in order. I was impressed recently with an idea by Gary Brown at Washington State University about a way to redefine the hoary old idea of a gradebook. No, I don't mean moving it to Excel. You can read more about this "harvesting gradebook" idea on the blog Center for Teaching, Learning & Technology. The authors of this project seem to be questioning what exactly grading is--a review of the associated implicit rules, as it were. It will be interesting to see where it leads. This is an example of the creative disruption that is called for in order to reach more than a superficial review of the stew of formal and informal complexity that comprise the academy.

Update: In the pursuit of sensible database policies this morning, I found myself wandering through the wilds of the FERPA rules, and discovered this gem [pdf].

Under FERPA a school may not disclose a student’s grades to another student without the prior written consent of the parent or eligible student. “Peer-grading” is a common educational practice in which teachers require students to exchange homework assignments, tests, and other papers, grade one another’s work, and then either call out the grade or turn in the work to the teacher for recordation. Even though peer-grading results in students finding out each other’s grades, the U.S. Supreme Court in 2002 issued a narrow holding in Owasso that this practice does not violate FERPA because grades on students’ papers are not “maintained” under the definition of “education records” and, therefore, would not be covered under FERPA at least until the teacher has collected and recorded them in the teacher’s grade book, a decision consistent with the Department’s longstanding position on peer-grading. The Court rejected assertions that students were “parties acting for” an institution when they scored each other’s work and that the student papers were, at that stage, “maintained” within the meaning of FERPA. Among other considerations, the Court expressed doubt that Congress intended to intervene in such a drastic fashion with traditional State functions or that the “federal power would exercise minute control over specific teaching methods and instructional dynamics in classrooms throughout the country.” The final regulations create a new exception to the definition of education records” that excludes grades on peer-graded papers before they are collected and recorded by a teacher. This change clarifies that peer-grading does not violate FERPA.

Exceptions are the hallmark of complexity. If trivial exceptions can't be dealt with by simple common-sense methods, you're stuck with arguing trivialities at the highest, most formal level of adjudication. This is a recipe for entropy-induced "heat death," as it's called when one speaks of the end of time.

Wednesday, April 22, 2009

Discounting the Recession

Has the demand for the products of higher education risen or dropped during the recession? There are suggestions from some quarters that shoppers are more discerning, and don't take for granted that loans for an education may be worth the cost (see here for example).

The Lawlor Group, an educational consulting outfit with a snazzy website, advises in their recent publication "When Market Conditions and Public Perception Collide," by Amy Foster:

Forget what you thought you knew about supply and demand.

As for supply, the article shows a map of the United States with projected change in high school graduates over then next decade. This picture varies dramatically from state to state. A section showing the southeast is reproduced below.

In opposition to the numbers of college-seekers is the price of higher education, which as we know from unrelenting media coverage has been on the rise: 4.6% in recent years, compared to the consumer price index of 2.2% per annum. Much of the cost is deferred:

The amount of loan debt carried by the typical graduating senior has more than doubled over the past decade, from $9,250 to $19,200—that represents a 58 percent increase after adjusting for inflation. (pg. 11)

The debt load has been exacerbated, the author argues, because of low savings rates and stagnant earnings.

Financial aid policies make no sense in many cases. There is an over-emphasis on merit aid over need-based aid:

[M]eritbased aid is more efficient in attracting high-achieving students—the type of student body that can make a college appear more elite in national rankings like U.S. News & World Report’s America’s Best Colleges. (pg. 17)

If you follow the research reports in Postsecondary Education Opportunity you won't be surprised by this--it's been a trend for a long time for universities to bid up the price of the "best and brightest" and perversely throw more and more money at those who can best afford college to begin with. Socio-economic status correlates with attractiveness in the admissions funnel.

I've argued that low-SAT students are disproportionately under-priced because of this effect, and that using indicators like non-cognitive variables, we can find excellent students, build a diverse student body, and feel a lot better about how our institutional aid budget is spent.

With this as background, some numbers from Noel-Levitz are very interesting. In their "2009 Discounting Report" they analyse the financial aid strategies and outcomes for 121 private colleges that partner with the company. The data set is from 2007-8. Among the highlights, they note that:

Tuition rose 6% on average
Unfunded aid increased by about 10%
More aid is going to meet need
Discount rates increased 1%
Freshman enrollment increased 2.7%
Net revenue increased 8% !!!

Here, Net Revenue = Enrollment * Tuition * (1-Discount rate). The net revenue per student is the real cost passed onto the student on average. Charts in the paper show that while discount rates have remained largely stable at about 33%, the actual cost to students has risen dramatically from $13,065 in 1999 to $19,660 in 2008, an annualized increase of 4.17%. The enduring strategy seems to be to raise tuition 6% and give two thirds of it back as discount, for a 4% net increase.

What is not spelled out in the article, is a fact that a careful reader can discern from the juxtaposition of the two sources for this post: that low-income students subsidize higher income students. This is because the former don't get as much merit aid because of lower SAT scores and other common predictors of success. They may get more need-based aid, but those sources has not kept up with the rising real cost of education. So they take out loans so that (in effect) colleges can bid up the price of the most attractive students, and secure a place in the US News report. This is a bit cynical perhaps, but not far from the truth.

This strategy is unsustainable. With the factors at play in the economy, this rate of cost increase will inevitably falter. Net revenue increases will have to shift from cost per student increases to enrollment increases. Institutions with excess capacity should benefit from this, if they take advantage of it by tailoring aid packages to meet need at an appropriate level, and by ignoring what US News has to say about their SAT scores.

Tuesday, April 21, 2009

Sock Muffins

Yesterday, as I was racing to get ready for the commute, I slipped on my shoes and jacket, clipped on the blackberry, grabbed the briefcase, and headed downstairs. I noticed about five hours later than I'd put on two different kinds of loafers, one brown and the other kind of reddish. But I digress. When I got downstairs, I made my second cup of coffee and waited for my wife. When she appeared, her eyes immediately went to the counter under the cupboard for the plates and glasses. There, lying in front of the mixer, were the dress socks I'd worn to the gala along with that rather ill-fitting rented tux. What the heck were they doing there? I knew it was no good to blame the sock fairies, because we'd had the house sprayed for them.

"What are these?" she asked? Meaning that she knew what they were--she just wanted to make sure I knew they didn't belong on the counter. I had to think fast.

"Oh, those are for the sock muffins," I said as casually as I could.

"Sock muffins?"

"Yeah. You pack dough in the socks, then squeeze them into little balls like making sausage links. Bake 'em right in the sock, and then cut it off afterwards. You know...sock muffins."

There was a long pause. She's German, and it could just be that this was some crazy American thing, I could almost see her thinking. Then she asked "well, aren't these new socks?"

I'm afraid my ruse failed at that point because I started laughing. I'd make a lousy spy. The moral of the story is, if you do enough crazy things often enough, people will believe you're capable of anything.

At this point, what is called for is a pithy segue. If today's profundities included passing accreditation, this would be relatively easy, but I promised last time to look at this study on expectations vs. reality at elite institutions. I'm not clever enough to figure out how to go from sock muffins to predicting grades, so you'll have to use your imagination.

I missed the point last time that the predicted grades in the study were self-predictions by students before taking classes. The biggest gap in this expectations game was for minorities, particularly Black students.

The article itself is written in economics-speak, which for a math guy is a little like a German reading Dutch. It's not something you can sit down and read casually. It took me a lot of flipping back and forth (during a long meeting) to determine that

just means multiply the number of applicants of a race by the percentage of that population that lie above some cut off (T*) in benefit. This gives the number enrolled in that category. This formula appears throughout the paper. I think I'd have been tempted to condense it to B(Tmin) and W(Tmin) to save a few eyeballs, but never mind.

The comments on the article at insidehighered are interesting and worth browsing if you're interested in this topic. The idea of mismatch is clearly controversial. My interest in the article was more about getting insight into Duke's admissions process. But I will say that it seems obvious that providing more transparency about the likelihood of success is good for everyone. In my attrition studies (see here for example), it's obvious that a mismatch in expectations for new students is very hard to recover from. This has implications from marketing through the first year experience and beyond.

Duke uses variables gathered from the admission process that in this report are called Achievement, Curriculum, Essay, Personal Qualities, Recommendations, Test Scores, high school GPA, and SAT. Note that at least two of these (achievement and personal qualities, and probably recommendations as well) speak to non-cognitive traits. How well does it work? From page 17 of the report :

Notice from Column 3 that controlling for Duke’s rankings increase the R^2 by more than 0.12, again suggesting substantial Duke private information. Note that this still leaves two-thirds of the variation in GPA unexplained, perhaps due to course selection and shocks to how students respond to college life.

So the 'private information,' which would not include things like high school GPA and SAT scores, since those are known to the student, counts for about 12% of the variance in correctly predicting first year college grades averages. The two-thirds unexplained variance in those averages is about typical in my experience. I think this is the real story, and the real opportunity.

In the insidehighered article, one anonymous commenter writes:

Experience at my institution is that students from lousy high schools do not do well in their first year of college, but, if they're bright and willing, eventually catch on to what university-level work is all about. They also find what they want to study, and this adds to their motivation. So focusing solely on the grades in the first year of college vastly overstates the effect being studied.

This point is well taken. A better indicator of success is perhaps first year retention. If a student fails to return, that represents a failure in real terms: investment of time and money that cannot be fully recouped even if courses transfer. Our predictive powers are not stellar in that regard either.

Time to make the commute again. Guess I'd better go look for my socks.

Monday, April 20, 2009

Commodification of Education

These musings were sparked by an idea I had after reading three very different articles. The common theme is that large industries have pressure to turn their products into commodities, implying mass production and uniformity of output. We saw this with the Spellings' administration of the Department of Education, with the seeking of a uniform measure to compare learning across institutions. We see it with the misplaced trust in uniform scales like the usual standardized tests used for admissions (it's hard to justify the expense and attention, given the low amount of variance in actual performance explained by these tests).

First up are textbook factories. Edutopia, sub-headed as a George Lucas Educational Foundation, has an interesting article "A Textbook Example of What's Wrong with Education" by Tamim Ansary, who has been involved with the creation of textbooks for primary and secondary schools. He describes his preconception prior to taking the first editorial job as:

[...] filled with the idealistic belief that I'd be working with equally idealistic authors to create books that would excite teachers and fill young minds with Big Ideas.

He was quickly disillusioned. The book was almost finished, and the publisher hadn't signed an author yet. He describes how "basals" are actually created. These comprehensive collections of materials are very expensive and very cookbook, kow-towing to Texas more than any other state's requirements. To differentiate texts, the publishers must find the hot new philosophy dripping out of the university system's holding tank of bubbling theory. This is the magical ingredient that can create a big hit, or conversely cause a company to suffer through years of lost revenue if it guesses wrong. These decisions are now made by a handful of large corporations, only one of which resides in the US (McGraw-Hill).

"We pretend to teach 'em, they pretend to learn" is the title of the second piece, by Margaret Wente at theglobeandmail.com. It's an opinion piece about higher education in Canada based on an interview with an anonymous university professor, but it echoes complaints heard anywhere in North America, I wager. The claim is that a large percentage of students are not ready for university, and that they'd be better off in some other educational setting, but that the public equates university with education. That there's a stigma associated with other choices, presumably like two-year institutions. The author pins some of the blame on a cyncal practice of graduating students from high school who shouldn't graduate, just so the rates can improve. This is a one-size-fits-all solution that hollers 'commodity.' Students are all fed into the same hopper, with the same expectations out the other end.

I've heard others make the astute observation that this would be laughable as, say, a philosophy for setting a football roster. Obviously some players have more talent and work harder, are larger and stronger, and have more experience. Some will be more successful than others. They have specialties with concomitant strengths and weaknesses. A successful coach wouldn't dream of taking all comers and trying to turn them into generic 'players' with a common set of entrance requirements and a common set of exit requirements. It obviously makes no sense. It makes no sense in education either.

Hard research tries to answer the question "Does Affirmative Action Lead to Mismatch[?]" in this insidehighered.com article by Scott Jaschik. Grist for the mill is a study [abstract] from the National Bureau of Economic Research (which sounds like a government agency, but isn't) about Duke University's admissions policies. At issue is the low performance of Black and Latino students. Duke should be applauded for the transparency here--they provide a generous helping of data stew. Summary tables show that particularly Black students are poorer and have lower predicted and actual first year grade averages. The predicted scores are only .07 grade points lower than white students, but the actual ones are .43 grade points lower. The question being asked is to what extent this is due to a 'mismatch' in ability relative to the performance expected by the school.

I spent the $5 to buy the full research article, to see what their methods were. I was looking for a regression model that used the variables listed in the summary table of the insidehighered article. There are sub-scores given for the admissions decision: Achievement, Curriculum, Essay, Personal qualities, Recommendations, Test scores, and SAT. This is interesting because of the insight into Duke's admissions requirements, and because of my own interest in using non-cognitive variables for that purpose. The bigger picture here is the degree to which prediction, education, and our expectations should be homogeneous. Affirmative action is inconvenient to an industry that is commodified because it makes exceptions. Therefore, expect pressure to remove or subvert such policies.

The article itself is quite mathy, and will take a day or two to digest. It has stuff like this in it:

This is some kind of maximization problem. There is some regression data used to build a grade point predictor, which is what I was looking for, but I can't do it justice in this post. Stay tuned for more tomorrow on this fascinating topic.

Data compression is attractive. Using the reducing assumption that students are the same, that they have similar needs, that our policies and actions are 'fairer' if everyone is treated identically, is attractive as a simplifying assumption, and fits the image of mass production and commodification well. It does not, however, sit comfortably with the realities that anyone who's ever raised a child knows: they're all different. This sounds to my own ears a bit too much like a straw-man argument I'm making here (arguing demagogy-style against a hypothetical and ridiculous position), but I think there genuinely is a case to be made that higher education is too much like an industry and not enough like a parent, to continue the analogy.

Saturday, April 18, 2009

Practical Matters

I'll take a break from the heavy philosophy of assessment stuff, and write about two little tricks that make my life a bit easier in the hopes someone else may benefit as well.

The first applies to anyone who still grades the old-fashioned way and ends up subtracting from 100% a lot. It's generally a pain to subtract from 100 because of the 'borrowing' algorithm that we're taught in this country. In Germany and other countries they use a 9's complement approach that I'll show you here. I'll apply it specifically to the 100 problem, but it works for any situation. It's embarassingly simple:

Because 100 = 99 + 1, subtracting from 100 is the same as subtracting from 99 and then adding one.

This makes a big difference because you never have to 'borrow' from 99. Moreover, the subtraction is easy-peasy because each subtraction is symetrical: 9-4 = 5 and so 9-5=4. This makes it all very painless.

Example: If you add up a student's points off and get 27, you can 'flip' 27 around 99, and get 72. Add one and you're done: 73% correct.

Example: 100 - 59 would be 99 - 59 = 40, 40 + 1 = 41.

I know it's not really that difficult to subtract from 100 to begin with, but this makes the process almost fun, and after grading dozens of paper, anything that makes my head hurt a little less is welcome.

The second practical matter is a simple way I've evolved to for organizing projects. I have a lot of projects going on simultaneously in all kinds of domains. There are lots of sophisticated project managers that you can find online, and I've tried out many of them. I finally realized, however, that the amount of information I need to document about most projects is pretty minimal. The most important bit of information is simply recording that the project exists. If management is through some complex system, there will always be things that fall off the table because someone didn't take the time to enter the information. Secondly, a simple green/yellow/red status flag is sufficient for most things. (Use blue instead of green if you have colorblind users.) Finally, a hyperlink to a file or other resource to document details of the project is desirable. I find everything I need with the combination of a mindmeister mind map, and etherpad documents. Both are free for the basics, which is enough for me currently. A section of my project tree looks like this:

The arrows on nodes are hyperlinks to resources. You can make the whole thing editable by others. Etherpad docs, by design, are very open and perfect for collaboration. I don't feel like I'm creating more work for myself using this combination of tools--it's very natural. For bigger projects, a more robust tool will probably be useful, but for keeping track of a large set of small and medium-sized projects, this is perfect. I could wish for calendar features and summary reports and such, but within my current scale of operations, this hasn't become critical.

Thursday, April 16, 2009

More Thoughts on Moravec's Paradox

A couple of weeks ago, I wrote about Moravec's Paradox: that many of the mental functions we consider difficult are actually computationally easy, but some things we find easy are very hard to replicate computationally. You're probably wondering what this has to do with anything. It has everything to do with outcomes assessment. Bear with me for a moment.

Let's consider some computational task, like solving a two-dimensional linear system of homogeneous differential equations. This may sound terribly complicated if you've never been trained to do it, but it's completely a cookbook exercise. Provided you know all the rules, you can proceed step by step from problem to solution. This doesn't diminish the fact that lots of students find it hard to master the calculus and linear algebra background needed to get to that point, but this makes the point embodied in the paradox. It's computationally easy, but seems hard because our brains aren't evolved to do ODEs. It's unfair to ask of biology that it pre-adapted our brains for such computations, since they've only been around for a couple hundred years.

On the other hand, it's very easy to grade problems of this sort. The computational simplicity means that each step can be checked for correctness step by step. In summary: learning is hard, but assessment is easy in this example. More specifically, assessment is easy because the whole task (solving systems of differential equations) can be logically broken up into smaller tasks, each of which is easy to assess. We can't really call this phenomenon "reductionism" because that means something else in the scientific context. Let's call it "componentism" instead. Yes, it's an ugly word. No staring, move along... Nothing to see folks, just an unfortunate etymological accident.

What about easy problems? Consider those sorts of things that our brain does for us without much effort, like recognizing letters of the alphabet, regardless of font or even noise that may obscure part of the letter. Moravec's thesis is that functions like visual processing are ancient bits of mental plumbing, honed over millions of years of evolution to a shiny efficiency.

Included in this list of highly-optimized, highly complex abilities we surely must include social judgments. I'm not sure at what point our ancestors started living in groups, or when exactly speech became intelligible, but I would posit that the former is in millions of years and the latter at least hundreds of thousands of years. Whatever consciousness is must have emerged during that time frame, I would suppose. Moreover, there would be heavy evolutionary pressure to favor genes that facilitated navigating social interactions. Included in these is the ability to make subjective assessments about another person's social status, strength, health, age, personality, and intelligence.

This is a speculative argument, then, that subjective judgments that we humans make about one another regarding matters of intelligence are deeply wired and very complex. When we as instructors, employers, colleagues, or supervisors make subjective judgments about another person's academic abilities, we're tapping into this wiring.

Complex systems are not always componentized. There is a burgeoning field of science of complex systems that demonstrates:

Interesting and useful behavior can be obtained from nearly chaotic systems.
One cannot tinker with such systems the way one can with componentized systems. They behave unpredictably when you try.

As an example of the first, some researchers are building chaotic systems for what are called flip-flops in computer engineering. The advantage is that these new flip flops have multiple states. Traditional ones only have on or off. To illustrated 2. above, it's possible to create networks of electrical components through an evolutionary process, and cause the outcome to perform some useful function. But reverse engineering the resulting circuit is impossible. Any tweak sends it careening off into some new functionality.

It seems reasonable to me that our abilities to assess thinking skills in others is complex and not componentized. We may create relationships between categories of thought, like

Effective writing = addresses purpose + addresses audience + correct language
+ good style + interesting content

but that does not mean that the part of our brain that assesses effective writing can be summed up in components in this way. As a trivial example, not all jokes are funny to all people. (Check out the world's funniest joke here.) Individual tastes and proclivities vary greatly, and there's no room in a components analysis for this sort of thing.

The movie industry makes a good example. There is huge economic incentive to make a movie that will sell. There is a years-long history of what works and what doesn't. There is a well-established theory of how to tell a story (usually in three acts in a 120 page script). Undoubtedly there are component analyses of everything that can go into making a movie, and yet few movies are blockbuster hits. Why is that? It seems to me that it's the same problem--what are imagined to be the components only poorly approximate the effect of the whole experience of watching a movie.

If this conclusion is valid, if a subjective judgment provided easily by our on board wiring has only a poor relationship to the components we try to assemble it from with formal assessments, what then? First note that this does not really diminish the importance of the components. Style is still important to writing, even if we can't quantify exactly how. And when we teach, we have to teach something specific. So it's all well and good to dwell on components and make rubrics of them if you find it useful. But we should not then assume that we have "solved" the writing assessment problem, because this is likely only addressable on the whole.

We might agree to agree. By working hard, we can increase inter-rater reliability of our social judgments using a set of components and averaging those ratings (shudder). But agreeing to agree means that we subvert our original impression, sacrificing validity.

What do you suppose the inter-rater reliability is in the wild? Ask three people who know a friend of yours to rate his or her ability to drive. There is probably consensus, but it's probably not perfect. It doesn't need to be perfect in order to communicate. If we insisted on high inter-rater reliability for all the things we talked about in normal life, it would be a disaster. Part of the reason is that you don't need anywhere near 100% correlation in order to communicate effectively. The other, equally important part, is that disagreement is where new information comes from. This supposes that we can have a dialogue to synthesize our differing points of view, and should be a strength of the academy (even if it's not).

We already come wired with the ability to deal with problems of differing levels of complexity. For algorithmic problems, like completing the 1040 tax form, it may make our brain hurt, but we can wend our way through the components step by step to find the refund amount. On the other hand, we have wonderfully complex and intricate ways of making quick decisions about fuzzy subjective sorts of things, like how well someone drives, or how good they are at solving problems, or how reliable they are as a friend. If we don't acknowledge the difference between the two sorts of assessment problems, we are relegated to a kind of low-complexity shadow world, and are likely to be surprised when the rest of the world (outside academia) doesn't see the value in our fancy numbers and charts depicting a student's ability to, say, resolve ethical dilemmas.

There is an opportunity to see what the difference is between the actual assessments (outside of the academy, where it matters) and our component analysis. This sort of research would lend credibility to the enterprise, and we might find that a few components actually do come close to approximating the holistic subjective ratings we internalize every day as a matter of course. But I find it impossible to accept as an article of faith.