Thursday, May 28, 2009

The Invention, Part Three (comic)


[Previous comic][Next comic]
Photo credit (first panel) Garrettc (second and fourth panels): Kaptain Kobold ,
(third panel) Josh Bancroft, (fifth and last panels) bjornmeansbear
This comic may be distributed under Creative Commons.

Convincing People

I came across this article on interviewing a while back, and have been mulling it over since. The proposition is that we are wired psychologically in an asymmetrical way. Alex at Technical Interview Patterns is drawing from a book called “Negotiation Genius” authored by Max Bazerman and Deepak Malhotra. It seemed to me as I read the list of strategies that some of these could be useful in making the assessment case (often a difficult proposition). The strategies are quoted from the article.
Strategy 1. [P]eople are much more upset by prospective losses than they are pleased by equivalent gains.
Here the idea is to frame opportunities in the negative. Compare:
We have an opportunity to become known for our liberal arts studies program.
with
If we don't act, we'll sacrifice a chance to be the first ones to do this.
Maybe this is why using the "wesayso" stick of accreditation is so effective: do this or else you may have to go job hunting.
Strategy 2. Volume matters
This is the idea (which I seem to recall reading in Machiavelli's The Prince) of spreading out good news and administering bad news all at once. It's no coincidence that it's usual practice for the White House to dump bad news on a Friday afternoon, where it's less likely to be smeared out over a whole week's media frenzy.

So perhaps a strategy for assessment directors is to deliver all the requirements up front--program review, curriculum mapping, changes to syllabi, the whole thing in one go. That takes care of the bad news. Do we have any good news? :-)

One effort in that direction is to construct ongoing dialog with key faculty. These meetings would highlight successes of the efforts weekly or at least monthly, to be disseminated. This gives the (hopefully correct) impression that assessment isn't so hard, and it's actually good for something. It's key that this knowledge gets propagated, so maybe rotating membership is a good idea.

Wednesday, May 27, 2009

The Dynamics of Interbeing and Monological Imperitives

The title comes from this excerpt of Bill Watterson's brilliant Calvin & Hobbes strip:

The occasion for this was this Chronicle of Higher Education article "Community Colleges Need Improved Assessment Tools to Improve Basic-Skills Introduction, Report Says," which I found courtesy of Pat William's Assess This! blog.

Okay, so it's a bit unfair to unload all that post-modern double-speak when the article itself is actually clearly written. The point is not that the author, Mr. Keller, deliberately obfuscates the matter, but rather falls prey to--
The myth of assessment: if we only had better tests, it would be obvious how to improve teaching.
This shows up early in the piece (emphasis added):
To improve the success rates of students who are unprepared for college-level work, community colleges must develop richer forms of student-learning assessment, analyze the data to discover best teaching practices, and get faculty members more involved in the assessment process[.]
Although this isn't lit-crit gooble-de-goop like "monological imperatives," it's arguably more misleading because of the image of simplicity that's conjured up--the idea that good analysis of test results will show us how to teach better. It's actually a lot more complicated than that. The article goes on to be more specific, describing the results of a paper "Toward Information Assessment and a Culture of Evidence" by Lloyd Bond:
[M]easures should be expanded to include more informative assessments such as value-added tests, common exams across course sections, and recordings of students reasoning their way through problem sets[.]
Value-added tests (such as pre- post-testing) may show areas of improvement, but not why they improved. For that you'd need controlled experiments across sections using different methods, and even then you only have a correlation, not a cause. Same with common exams. Transcripts of student reasoning could be good fodder for an intelligent discussion about what goes right or wrong, but can't by itself identify better teaching methods.

Ironically, the report itself doesn't take that approach at all (page 2 of the report, my emphasis added):
From the beginning of the project, the Carnegie team stressed the importance of having rich and reliable evidence—evidence of classroom performance, evidence of student understanding of content, evidence of larger trends toward progress to transfer level courses—to inform faculty discussion, innovation, collaboration and experimentation. Because teaching and learning in the classroom has been a central focus of the Carnegie Foundation’s work, our intent was to heighten the sensitivity of individual instructors, departments, and the larger institution generally to how systematically collected information about student learning can help them improve learning and instruction in a rational, incremental, and coherent way.
The tests themselves provide rough guideposts for the learning landscape. It's the intelligent minds that review such data that leads to possible improvements (from page 3):
[T]he development, scoring, and discussion of common examinations by a group of faculty is an enormously effective impetus to pedagogical innovation and improvement.
The effective process described is not successful because the exam showed the way, but rather because a dialogue among professionals sparks innovation. I've made the point before that when solving very difficult problems, the most robust approach is evolutionary--try something reasonable and see what happens. This report emphasizes that the "see what happens" part does not even rely on perfect data:
To summarize, encouraging a culture of evidence and inquiry does not require a program of tightly controlled, randomized educational experiments. The intent of SPECC was rather to spur the pedagogical and curricular imagination of participating faculty, foster a spirit of experimentation, strengthen capacity to generate and learn from data and evidence[...]
The important part is not the test, but what you do with the results. This is the opposite conclusion one would reach from reading the quotes from the review article, which immediately devolves to the Myth.

I recommend the primary source as a good investigation, not unduly burdened by the Myth, and full of interesting results. My point is that the general perception of assessment, from the Department of Higher Education in recent times, on downwards perpetuates the idea that all we need is a magic test to show us the way. In fact, it's far more important to foster dialogs among teacher, administrators, and students. Inculcating a common vocabulary about goals is a good way to start (one of the uses of rubrics). The "better test" myth simply feeds the maw of the standardized testing companies, which ironically produce the kind of data that is least useful to faculty who want to improve their teaching and curriculum.

We could describe the assessment investigation as:
  1. Be organized (be reasonably scientific, keep good records, don't fool yourself)
  2. Communicate effectively (develop a vocabulary, share ideas among stakeholders)
  3. Do something (try out things that might work and see what happens)
Note that all of this is mainly directed at the "soft" assessment problem of improving teaching and programs. The "hard" problem of how to "measure" education is a global sense can't be solved in this way.

Finally, to close the loop on the post-modern theme, it's fun to note the concession to political correctness that accompanies any discussion of student remediation, here from the introduction of the article:
[W]e have used several terms: pre-collegiate, developmental, remedial, and basic skills, recognizing that these are not synonymous and that, for better or worse, each brings its own history and values.
I suggest that we add to the list "differently-learned."

Tuesday, May 26, 2009

Assessment in the Wild

A big part of the justification for the Assessing the Elephant project was that when graduates have to actually demonstrate performance in "the wild" that these assessments will not be done with formal instruments. No standardized tests or (necessarily) rubrics, although the HR department may in fact foist some kind of rubric on supervisors. But I mean the informal everyday judgments about job effectiveness that ultimately determine whether or not one is promoted. The same applies to graduate school, doubly so for any complex field (like the humanities), where success is more subjective.

Ronald T. Azuma wrote in 2003 a computer science survival guide intended for graduate students. It's interesting to see the non-cognitive skills he highlights, including initiative:
One of the hallmarks of a senior graduate student is that he or she knows the types of tasks that require permission and those that don't.
Others include tenacity and flexibility and interpersonal skills. Here Dr. Azuma writes:
Computer Science majors are not, in general, known for their interpersonal skills. [...] [Y]our success in graduate school and beyond depends a great deal upon your ability to build and maintain interpersonal relationships with your adviser, your committee, your research and support staff and your fellow students. [...] I did make a serious effort to learn and practice interpersonal skills, and those were crucial to my graduate student career and my current industrial research position.
He then cites "Organizations: The Soft and Gushy Side" by Kerry J. Patterson, published in Fall 1991 issue of The Bent, which contains the nugget I want to highlight:
To determine performance rankings, we would place in front of a senior manager the names of the 10-50 people within his or her organization. Each name would be typed neatly in the middle of a three-by-five card. After asking the manager to rank the employees from top to bottom, the managers would then go through a card sort. Typically the executive would sort the names into three or four piles and then resort each pile again. Whatever the strategy, the exercise usually took only minutes. Just like that, the individual in charge of the professionals in question was able to rank, from top to bottom, as many as 50 people. It rarely took more than three minutes and a couple of head scratches and grunts. Three minutes. Although politics may appear ambiguous to those on the receiving end, those at the top were able to judge performance with crystal clarity.
This is what actual assessment looks like. It happens all the time, formally and informally. Our supervisors, colleagues, neighbors, and friends constantly assess us just like we do them--it's the nature of living in a tribe. Are these impressions valid? If there's enough feedback through inter-rater reliability, then this fact alone will create validity. For example, if a co-worker has few social skills, such that he becomes the butt of jokes at the office, this very fact probably makes it less likely he'll be promoted or enjoy the reciprocation of favors that makes teamwork effective. The implicit agreement of characteristics is powerful.

Besides the ones mentioned above, self-control is another non-cognitive skill that deserves attention. It is mentioned in this New Yorker article simply called "Don't!" The article highlights a study done by a psychology professor named Walter Mischel, who assessed children's ability to wait for a reward in favor of a bigger reward (a measure of self-control). The experiment began in 1981, and he's tracked the children since then to see what happened to them:
Once Mischel began analyzing the results, he noticed that low delayers, the children who [couldn't wait], seemed more likely to have behavioral problems, both in school and at home. They got lower S.A.T. scores. They struggled in stressful situations, often had trouble paying attention, and found it difficult to maintain friendships. The child who could wait fifteen minutes had an S.A.T. score that was, on average, two hundred and ten points higher than that of the kid who could wait only thirty seconds.
I had just finished reading this when my 11-year old daughter came in an begged to go to the bookstore--she'd gotten a gift certificate for her birthday.
"But they're almost closed," I said, "you'll only have 15 minutes by the time we get there."
"I want to go anyway," she said.
"How about this," I proposed, "We go tomorrow instead, and you can have two hours if you want."
She didn't even think about it. We went right away. As it turns out, the store closed later than I thought, and we had an hour to shop, but it was jolting to have this conversation right after reading the New Yorker piece. Why isn't this kind of thing--including assessment--part of the curriculum? Take a look at the AAC&U piece about trends in general education and see if you can spot non-cognitives anywhere. There are experiences that would probably lead to non-cognitive development (internships, for example), but they aren't addressed directly. I think it's time we seriously considered non-academic factors that are important to success and begin to talk about them.

Sunday, May 24, 2009

Marking up Portfolios

I blagged (here) a while back about a potential transformation in the way we think about eportfolios. The key idea is that the actual portfolios elements reside somewhere (anywhere) on the web. We're long past the point where a single program needs to host and present the portfolio. The multitude of ways to freely create content on the web far exceeds any one portfolio program's capabilities. It's true that these works potentially wouldn't be as tightly linked to a learning management system, with handy rubrics and such already linked up. But the trade-off is potentially worth it.

One essential element is the need to be able to comment on work. Students may peer-review each other's work, and certainly the teacher would like to be able to do the same. I was brainstorming this problem this week because, as it turns out, I have to figure out how to create eportfolios for the fall semester. Our old software is going poof.

By serendipity, I saw a blog at blogged.com called Academic Productivity. Lo and behold, there was an article about a "web highligher and sticky notes" application--exactly the sort of thing I'd been brainstorming about. The service is seen as a competitor for delicious.com, which I use religiously to maintain my bookmarks. It's called diigo.com, and it allows you to bookmark a page, add tags, organize into folders, mark as read or unread, and--best of all--highlight portions of text within a web page and add comments. These can be public, private, or restricted to within a particular group. The image below, snipped from a New York Times article, illustrates such a comment. If you browse to that page with your Diigo account activated, you should be able to see the note I left, since I made it public. (Note: the "Readers' Comments" box on the left has nothing to do with Diigo--it's the usual bloggish comments allowed in NYT.)

Best of all, the creators of the service have a special provision for educators, so that you can easily organize classes into groups. I haven't tried this out yet, but it sounds absolutely perfect.

Diigo comes with a Firefox plug-in that makes it easy to do your markups. The menu bar is shown below.


An "open" portfolio of the type I've described could have a home base in a public blog, which is a nice way to host content, with built-in commenting and RSS feeds. Plus, many kinds of media can be directly embedded, like graphics, sound, video, and anything that runs on Flash or Silverlight.

Clearly, even with markups, the loss of a built-in gradebook may be too much to suffer for some instructors. Perhaps Gary Brown's harvesting gradebook will come to the rescue there. In the meantime, I think it's worth trying out the markups as a pedagogical tool.

Thursday, May 21, 2009

The Invention, Part Two (comic)


[Previous comic] [Next comic]
Photo credit (top right, bottom panels): Kaptain Kobold
This comic may be distributed under Creative Commons.

The Higher Ed Buffet

An enthymeme implicit in any solution to the hard assessment problem is that education is a commodity. That is, if we could truly pin a meaningful number on the value-added (or even absolute accomplishment) of graduates from respective institutions of higher learning, and could rank colleges and universities in a scientific way, then we have abstracted all most every distinguishing detail away from the college experience. Students are uniform raw materials for the industrial maw of enlightenment, and the output comprises finely packaged standardized brains, weighed and bar-coded and packed in bubble-wrap, ready for shipping. Employers and graduate schools could simply mail-order their inputs from Graduate.com, and an efficient market would quickly find price equilibrium.

This belief in wholesale data compression of the multitude of products delivered by any college into a single number is a staggering arrogance that ignores what Susan Jacoby calls "[T]he unquantifiable and more genuine learning whose importance within a society cannot be measured by test scores and can only be mourned in its absence." (The Age of American Unreason, pg. 149)

It's truly hard to imagine that people actually believe this absolute data reduction is possible, but it's at the heart of attempts like the CLA to compare "residual" differences in learning outcomes across institutions, and is evinced in comments like the one I quoted from "Steve" last time from InsideHigherEd:
20 years from now: Consumer Reports will be assessing the quality of BA degrees, right along side washing machines and flying-mobiles. Parents will ask, "why should we pay 3 times the cost when Consumer Reports says that there is only a 2% increase in quality?!"
Here, a second assumption compounds the first, viz, that college ratings easily translate into worth in dollars. I scratch my head over this sort of thing, which also came out of the Spellings Commission. If what we really care about is dollars, then why not just focus on salary histories of graduates? The US government already publishes volumes of reports on such things a average salary of an engineering graduate. Why not add one more dimension, so that the school issuing the diploma can be identified?

One valid reason why the learning = cost equation doesn't work is articulated by an anonymous commenter to today's article on the future of higher ed costs in InsideHigherEd:
Wealthy institutions, such as the small elite liberal arts colleges which charge over $50,000 in comprehensive fees, and private elite universities know that keeping prices high is the surest way to attract the wealthiest customers who will also become future donors. This is allays the motivating factor at my institution, where the president is always public about staying among the elite by charging high tuition and by regularly raising tuition above 6%. "we have to remain at the mean of our peers" is the justification.
I remember exactly this kind of conversation with institutional researchers at a round table discussion a couple of years ago. One elite college was raising rates dramatically year after year to "catch up" to the competition. Is it worth it? Is the Harvard experience worth more because of the contacts you will make? You bet it is. How is that going to be measured with a standardized rating system?

This suggests a kind of Red Queen Race among top institutions, fighting for the best students of the top socio-economic strata. I've argued before that many more institutions in addition to the top tier are affected by this treadmill, and those who suffer the most are highly talented, highly motivated students who don't have the right credentials to get admitted into the club, or get admitted but with insufficient aid. That is a real market inefficiency that can be partially addressed with non-cognitive assessments.

A few weeks back I found myself down town looking for a take-out lunch. I was in a hurry, and the lunch crowd had descended, creating long lines at the sort of place I'd normally eat at. I finally found one place with no lines. It quickly became apparent why that was the case--the cheapest thing on the menu was $17, for some vegetarian "delight" sort of thing. The place was plush, quiet, and refined. Maybe the few patrons were there because of the food, but it seems to me that they were paying for exclusivity as well--no bustling hoi palloi to disturb their cogitations on credit swap derivatives (this was in the banking district). Maybe this is a good place to meet future clients.

Hard assessment sees a college as a factory. Instead, I think the comparison to a restaurant is more appropriate. To see this, imagine applying hard assessment to all of the eating establishments in your locale. The product of this exercise would be a listing of all the eateries with a number denoting the value-added of each. Note that this is not a score from the health department certifying that the kitchen is clean--that's all low complexity stuff. No, our hard assessment must take into account the rich experience of dining, and produce a single number that indicates with scientific precision the performance of the establishment. If you want to take it a step further, you can add the assumption that this metric must be comparable to dollars in some way, so that higher ranked restaurants can charge higher prices.

You can have fun with this analogy: the catalog of programs as a menu, the demographic served as the clientele, institutional aid = coupon clipped discounts, professors prepare the culinary products, and so forth. No analogy is perfect, but the advantage is that most of us have direct experience with a small number of colleges or universities, but with a large number of restaurants. If anything, it ought to be easier to do hard assessments of restaurants than it is colleges. (Please note, I'm not talking about "one-to-five stars" type assessments prepared by city guides. They make no pretentions to be scientific.)

In order to build our assessment, we'd have to start worrying about what are the most important outcomes of the dining experience. Is it customer satisfaction? Or rather the health benefits of the food? Or perhaps the ratio of calories to dollars spent? Then we must tackle the problem of how to average across what are really qualitative differences. How do we compare a fish-lover's opinion of the tuna and ginger plate with the customer who just discovered she's allergic to ginger, and had it sent back in favor of a hamburger? How much can we rely on self-reported ratings by customers? Do we take into account the kind of customer who normally eats there, or do we try to randomly sample the population? If the final assessment is to be a single number rating for the restaurant, how do we weight each of these components? (If the answer to that is "we'll use factor analysis," then how do we subjectively decide what the primary dimension actually means?)

If this seems like a task that is impossible to do while keeping a straight face, it is. We will quickly abandon science and have to make subjective decisions about the design of the grand assessment in order to come up with anything at all. I encourage you to actually try this thought experiment, using the eateries you frequent as your raw material. Remember that the the numbers you assign have to be meaningful to other people, not just yourself; solipsistic assessments aren't publishable in Consumer Reports. As a final requirement, assuming your ratings are taken seriously--now you have to figure out how to keep restaurants from "gaming" your rating system to artificially increase their scores. Good luck.

Sunday, May 17, 2009

Assessment and Automation

A January article in InsideHigherEd ponders the worth and future of assessment, including mention of the creation of a National Institute for Learning Outcomes Assessment, staffed by such luminaries as NSSE creator George Kuh. As usual, the comments after the article are as enlightening as the formal presentation. It's hard to know if comments are in earnest or simply for the lulz, but some are certainly provocative. Take Robert Tucker's comment (if it is indeed Mr. Tucker of InterEd.com, as advertised--the comments are self-identified):
Sidebar to whiney profs. Should you ever need brain surgery (or a car, or HDTV, etc.), let us know, we’ll hook you up with someone who shares your beliefs; i.e., outcomes and impact can’t be measured (except by you when you decide how well your students are doing) and inputs are sufficient to assess quality.
I couldn't find any assessment advice at InterEd.com, but the company advertises itself as a higher ed consulting business (apparently specializing in adult ed). Taking the comment at face value, it asks us whiney profs to understand that clear outcomes are determinable and desirable when employing a neurologist or when buying durable goods. That's certainly true enough. The implication is that this should be also true of educational products. "Steve's" adendum to Mr. Tucker's comment extends this dubious logic. If you look closely, you can see the flailing arms.
[E]ven those who insist that what Higher Ed does cannot possibly be assessed still seem to want to have doctors who are licensed, lawyers who've passed the bar, and plumbers who've demonstrated their competence.
This is a common confusion, I think, and the reason for highlighting these embarassing passages. The passing of a bar exam and demonstrating competence are two different things. Would you rather fly with a pilot who's passed all the fill-in-the bubble tests, or one who's landed at your airport successfully many times? I'm quite sure all of the Enron accountants were fully certified, and that malpractice is accomplished by doctors who passed their boards. I'm sure there are incompetent lawyers who've passed the bar. A plumber is competent if he or she makes enough money at it to stay in the plumbing business. A test result may have some correlation with demonstrated competence, but they are not the same thing.

"Steve's" closing comment is sadly indicative of the general lack of critical thinking exhibited in the debate on assessing critical thinking (and the like):
20 years from now: Consumer Reports will be assessing the quality of BA degrees, right along side washing machines and flying-mobiles. Parents will ask, "why should we pay 3 times the cost when Consumer Reports says that there is only a 2% increase in quality?!"
This is one of those statements that it makes no sense to even dissect; if you believe it to be true, then no amount of argument is going to change your mind. This sort of epistomological black hole seems common in American society. If you haven't read it, take a look at Susan Jacoby's The Age of American Unreason--an excellent book on the declining rationality of the body politic.

The kind of "measurement of learning" that could actually lead to accurate statistics in Consumer Reports is a very high bar. Let's agree to call that the Hard Assessment Problem. This is in contrast to the sort of thing that goes on in the classroom, where local assessments are used to improved pedagogy, content selection, and program design and coordination. Let's call that the Soft Assessment Problem. There are many successful examples of the latter. For hard assessment, I think we should be a lot more cautious about making claims. In an Education Week article (available here without subscription) Ronald A. Wolk, Chairman of the Big Picture Learning Board, lists five assumptions he claims are the root of education's ills. The first two are related to hard assessment and more generally a positivist approach to education:
Assumption One: The best way to improve student performance and close achievement gaps is to establish rigorous content standards and a core curriculum for all schools—preferably on a national basis.

Assumption Two: Standardized-test scores are an accurate measure of student learning and should be used to determine promotion and graduation.
Proponents of hard assessment ought to be informed by other kinds of "measurement," like ratings of companies and their bond issues by large well-paid companies like Standard and Poor. Their project involves mainly crunching numbers to see what the prognosis is--very much more quantitative than assessing learning. You'd think the success rate would be pretty good. I think the recent financial mess indicates clearly that such ratings are not very good.

If hard assessment is possible, why don't we test parents of their parenting skills? Anyone who flunks gets their kid hauled off to the orphanage. Is there any test that we would invest that much confidence in? I hardly think so.

I actually sat down to write about something tangential, so here's the segue: think of the hard assessment problem backwards. If we can truly understand what goes into the measurement of a phenomenon, then we should be able to reproduce that phenomenon by varying inputs. For example, we can put a sack of potatoes on a scale and weigh them. We can equally add sand to the scale until it weighs the same amount--we can reproduce the measurement through artificial means.

To see where this is going, let me ask this: is being able to do integral calculus a skill that requires critical thinking ability? Before you answer, consider the contrapositive: without critical thinking ability there is no solving integrals. Not coincidentally, this train of thought happens just as Wolfram's Alpha launched. I asked it what the integral of 1/(x^2+1) was:


The mindless web service came back with the correct answer, a couple of graphs, and a Taylor series for the solution, just in case I want a polynomial approximation. Very nice! Was critical thinking involved? Well, not unless you want to concede that computers can think.

You may complain that I've cheated. That of course there was critical thinking involved in the construction of the program, just like a reference book itself cannot think, but is the product of many hours of cogitation. The computer itself is unthinking, and hence cannot think critically or any other way. But in this case, I think you would be wrong. In essence, the programmers have taught the computer to operate just like first year math students. The processes (algorithms) are the same, it's just that the computer is very fast at doing them. Either the CPU qualifies as a critical thinker, or critical thinking is not required to solve the problem.

Please note that it's not necessary to think at all to create complex machines that can think. Evolution is the most powerful problem-solver yet evidenced. It's very slow, horribly wasteful, not goal oriented, and completely thoughtless, and yet it made us. To me, one of the central mysteries of the universe is how deductive processes created inductive processes, but that takes us too far afield.

This hypothesis (viz, if we can measure it, we can create it artificially) actually sounds too strong to me. But I think the distinction is useful nevertheless. We could agree to these conditions for hard assessment:
  • If I can assess some phenomenon quantitatively AND reproduce it artifically, then it is a hard assessment (not necessarily a measurement, for other reasons I've blagged about).
  • If I can assess it quantitatively BUT NOT reproduce it artificially, then let's call it an observation.
As an example, we can observe happiness but not assess it. We can observe critical thinking but not assess it (at least until AI develops). We can assess the ability to do multiplication tables or integral calculus. In the health fields, we can assess some conditions but only observe others.

This idea is a bit out of the box, and is probably useful only in limited contexts. But I think it is useful to draw attention to the transparency or opaqueness, and the complexity of assessment tasks. Focusing on reproducability is one way to do that. On the other hand, I may wake up tomorrow and think "what a stupid idea I've advertised on my blog," but that's the nature of complex debate, I fear. Even in one's own mind.

Friday, May 15, 2009

AP Exams and Complexity

The College Board offers AP tests, which are generally accepted as college credit, in certain subjects. High school courses are offered as preparation for the tests--a very nice arrangement for the College Board. The relationship between grades in the AP courses versus scores on the exams were an issue in this article in the Jacksonville News, viz:

Duval [County public school] students passed 80 percent of their AP courses last year with a "C" or better. But only 23 percent of the national AP exams, taken near the end of those courses, were passed.

The national exam pass rate for public schools was 56 percent.

In other words, students successfully complete a course that is essentially a preparation for a standardized test, and then fail said test. The College Board, which reviewed the results, blames the effect on under-prepared students taking the courses combined with inexperienced teachers. When I read this, I thought it was perhaps a good opportunity to look for evidence of a phenomenon I've said should exist: that standardized tests are more valid for low-complexity subjects of study. Here, complexity means in the computational sense (search for the word in my blog for lots more on the subject). If we assumed all things equal (dubious, but I have no choice) then preparation courses for lower complexity subjects ought to be more effective than for higher complexity subjects. This would manifest itself in the correlation between passing tests and passing AP exams. This is all possible, because the statistics for the school district in question are posted online.

In pure complexity terms, math is low complexity and languages are higher complexity. This is easy to see--math is a foreign language with little new vocabulary and a few rules. Spoken languages have massive vocabulary and many, often arbitrary-seeming, rules of grammar. So if my theory is any good, it ought to be the case that math courses can prepare a student better than language courses for a standardized test, all else being equal. Also, the overlap in students between the two courses is probably pretty good since college-bound seniors will be taking both a foreign language and math. Of course, even learning a foreign language is mostly committing deductive processes to memory, and hence of not the highest complexity (inductive processes would be). So this is a contest between low complexity and, shall we say, medium complexity.

Here are the results:

I debated whether or not to include both calculus courses (clearly, more advanced students are in the BC section). I also assumed that the three languages are equally complex, although in practice Spanish dominates. If a student passed a calculus course, he or she had a 48% chance of passing the AP subject test. For languages (the more complex subject), only 39% of those who passed the course went on to succeed on the exam.

Does this prove anything? Not really--there are too many uncontrolled variables. But it's still fun to push this as far as it can go. If the complexity to difficulty relationship holds, we would expect that the subject with the worst test/course pass ratio would be the most complex subject. Of course, sample size plays a role, so let's agree (before I look at the numbers) that there had to be at least N=50 to qualify. For all tests combined, the average test/course ratio was 29%. Anything lower than that would be lower than average preparation for the test (and higher complexity maybe). The least effective (or most complex) course was a three-way tie, with a 16% conditional probability of passing the AP test, given that the course had been passed. The three subjects were World History, Human Geography, and Micro Economics. These each had enrollments in the hundreds and thousands.

Are these subjects the hardest to test because of complexity? It's easy to guess that the first two might be, cluttered with endless facts and fuzzy theories. Micro Economics is much more like chemistry or physics, one would think. Chemistry scored a low 22%, but Physics B was 58%.

This was fun, but it's hard to really make the case that complexity is the driving force here. It does give me some ideas, however, about comparing difficulty (course pass rate) versus complexity (course to test pass ratio). Meanwhile, I still have the placement test project to try out...

[Update: here's an interesting article about the effectiveness of the calculus AP test]

Wednesday, May 13, 2009

Real Outcomes


One of the best assessments of college is the job and the paycheck. The chart above tells the tale (taken from here). Another table from the Wall Street Journal gives data by major, which is very interesting (hint: liberal arts comes out looking good)--see this post for details and comments. I argue there that the government is in the best position to do detailed studies by institution if they really are interested in learning outcomes. I had an opinion piece published about that idea in University Business a while back. Maybe the new administration would be more receptive to such an idea.

Another measure of success of higher education is the percentage of the population with degrees.
These are all solid, usable statistics with no statistical goo added. Notice that there aren't any averages in sight (unless you consider a percent to be an average over a binary variable).

This contrasts with the generally opaque assessments of learning outcomes. Part of the allure of standardized testing is that at least the outcome seems to be clear: it arrives in the form of a crisp number like the ones above. These can be turned into pretty graphs. The problem there is whether or not they actually mean anything. I've argued that they can, if the complexity of what's being tested is appropriately low. Testing calculus (low complexity) may work, but testing critical thinking (high complexity) probably won't.

Saturday, May 09, 2009

eBooks

Academic publishing is such a mess, it's hard to know where to start. Academic presses are cliquish and limited in capacity, journals are polluted with "publish or perish" read-only articles but protected behind intellectual property walls created not by the authors but by the publishers, and textbooks....there's a long list there.

Math is easier than some subjects. I try to find out-of-print texts for my students, which have the same benefit at a tenth the price of new books. Dover's softcover catalog is very reasonable too, of course.

Anything that looks like a crack in the existing order of things is interesting to me, so the introduction of an "academic Kindle" from Amazon attracted my attention. There's an article in the Chronicle about its potential use in higher education to serve up e-textbooks.

Electronic books have been around for a long time. When I was directing library operations, I made sure we got in on the ground floor of NetLibrary's electronic collections. It was an amazing opportunity at the time--thousands of titles for a one-time cost of a couple thousand dollars. Over time, this collection has grown to the tens of thousands, and the total investment extremely reasonable. Reading off a computer screen is the disadvantage, but on the other hand it's extremely nice to be able to search a book with a few clicks.

I did not know about CourseSmart, however, until reading the article. This service allows you to buy and download electronic versions of current textbooks, suitable for reading on your laptop. Not for the Kindle, alas, at least not yet. I also wonder about the usefulness of the cute little reader for textbook use. The article touches on that, and after playing with the one I bought my wife for her birthday (the small-screen version two, not the one just announced), I think it's got a ways to go before it can duplicate the usefulness of a paper text or ebook delivered on a powerful browing platorm. Part of the problem is color: it's nice for illustrations, but the Kindle's e-ink is black and white. The rest of the challenge I see is the interface. Reading a novel is quite different from using a textbook. Now if the Kindle had TWO screens like the pages of an open book, and you could point them at different pages, you'd have something. Perhaps a folding model with this feature is in our future. I sliced and pasted an image from Amazon to mock one up below:

Even without color, this would be a leap forward in the electronic textbook capabilities. It's hard to get this kind of visual real estate with a laptop.

One final observation. This generation types with their thumbs. I bet if I googled around, I could find a USB keyboard for a PC that looks like a phone's keypad. That interface will evolve probably to include better chording (use of multiple keys at once to more quickly identify keys) and word prediction. The Kindle's keypad needs to squish in to accomodate this trend. Maybe with a rollout, plug-in keyboard option for the old timers like me...

Friday, May 08, 2009

State Standards

It's generally not hard to find K-12 state standards when you need them, but you can find them all in one place at shoonoodle.com. It's useful for teacher-education programs, but also for general assessment sometimes. For example, one of our projects was looking at the transition between 12th grade English and freshman composition. That was very productive, and it's worth having some high school teachers get together with the freshman faculty to compare notes. For us, the big difference was the hand-holding of students and the cultivation of a lack of independence--outcomes of standardization--that were obvious differences between high school and college experiences.

Learner Centered Technology

Ed Nuhfer sent me an interesting paper by Jody Paul entitled "Improving Educational Assessment By Incorporating Confidence Measurement, Analysis of Self-Awareness, and Performance Evaluation: The Computer-Based Alternative Assessmentª (CBAAª) Project." Dr. Paul has a website with general information about learner-centered technology, which the paper explores in the context of multiple-choice testing.

At the heart of the method Jody constructs is the notion that we need to make room in our assessments for uncertainty. Or to reverse the idea, the confidence a student has in an answer is important. From the article:
[T]raditional scoring, which treats students' responses as absolute (effectively a 0 and 1 based probability distribution), begs the question: Is a student's knowledge black and white? How can a student express belief in the likelihood that an alternative may be correct? Further, how can a student's ability to carry out a process be traced and evaluated? Addressing these questions requires going beyond traditional multiple-choice testing techniques.
A couple of days ago I wrote about Ed Nuhfer's knowledge surveys, which approximates student confidence in subject material with a survey. Jody's idea extends this to a testing environment. Obviously there are differences between surveys and tests. One might expect students to be honest about their confidence in a survey, or perhaps underestimate it slightly, because they may see it as affecting the review and test itself. On a test, a student has nothing obvious to gain by admitting uncertainty. That changes if "near-misses" are partially rewarded. This is like partial credit on a pencil and paper test. But how can one indicate such subtleties on a multiple-choice test?

Dr. Paul's solution is to create software that allows rich responses from test-takers. A schematic of the interface is shown below, annotated with meanings of the various zones of response.

The response mechanism allows students to waffle about their answers. The analysis in the paper of different weighting strategies is quiet detailed (and mathy). It raises and attempts to answer interesting questions about multiple-choice questions, and the idea of rewarding partial knowledge.

Wednesday, May 06, 2009

Blags and Electronic Resources

On the education subreddit I came across 100 Most Inspiring and Innovative Blogs for Educators. It has some good finds, including Jane's E-Learning Pick of the Day. Jane posts small articles with links to interesting online resources. Go look for yourself.

I found a couple of mind mappers I hadn't seen before, and one that a friend had mentioned called The Brain. The video demos look good.

There's also an open-source web-based HTML editor called Amaya I hadn't seen before.

By the way, the neologism "blag" comes from "blog" as an intentional misselling, courtesy of xkcd:

So...log comes from a ship throwing a piece of wood over the side to see how fast it's moving, which comes to mean the book the readings are written in, which comes to mean any form of recording information regularly. Meanwhile "world wide web" gets shorted post haste to "web," and the two words get put together to make "web log," which contracts to "blog" and now (if you're sufficiently geeky and want to be ironic) "blag." Isn't language fun?

Difficultie, er, Difficulty vs C0mp13xi7y

Sorry for the l337 garbage in the title. I've been absorbing the ideas from Dr. Ed Nuhfer (see previous post) about knowledge surveys, and in particular the idea that complexity and difficulty are orthogonal. That is to say, different dimensions. We could think of an abstraction of learning that incorporates these into a graph. Mathematicians love graphs. Well, applied mathematicians love graphs anyway.
I have used my considerable expertise with Paint to produce the glorious figure shown above. It's supposed to lay bare an abstraction of the learning process if we only consider difficulty (expressed here inversely as probability of success) and complexity. I think it's fair to say that we sometimes think of learning happening this way--that no matter what the task, probability of success is increased by training, and that more complex tasks are inherently more difficult than less complex ones. This may, of course, be utterly wrong.

We encountered Moravec's Paradox in a previous episode here. The idea is that some things that are very complex seem easy. For example, judging another person's character, or determining if another beer is worth four bucks at the bar. So, it may be that the vertical dimension of the graph isn't conveying anything of interest. But I have a way to wiggle out of that problem.

If we restrict ourselves to thinking about purely deductive reasoning tasks, then success depends on the learner's ability to execute an algorithm. Multiplication tables or solving standard differential equations--it's all step-by-step reasoning. In this case, it seems reasonable to assume that increased complexity (number of steps involved) reduces chance of success. In fact, if we abstract out a constant probability of success for each step, then we'd expect an exponentially decaying probability of success as complexity increases because we're multiplying probabilities together (if they're independent anyway).

We could test this with math placement test results. An analysis of the problems on a standard college placement test should give us pretty quickly the approximate number of steps required to solve any given problem (with suitable definitions and standards). The test results will give us a statistical bound on probability of success. Looking at the two together should be interesting. If we assume that Pr[success on item of complexity c] = exp( d*c), where d is some (negative) constant to be determined, and c is the complexity measured in solution-steps, then we could analyze which problems are unexpectedly easy or difficult. That is, we could analyze the preparation of students taking the placement not just by looking at the number of problems they got correct, but the level of complexity they can deal with.

I've written a lot about complexity and analytical thinking, which you can sieve from the blog via the search box if you're interested. I won't recapitulate here.

I'll see if I can find some placement test data to try this out on. It should make for an interesting afternoon some day this summer. Stay tuned.

Update: It occurs to me upon reflection that if the placement test is multiple-choice, this may not work. Many kinds of deterministic processes are easier to verify backwards than they are to run forwards. For example, if a problem asks for the roots of a quadratic equation, it's likely easier to plug in the candidates for the answers and see if they work than it is to use the quadratic equation or complete the square (i.e. actually solve the problem directly). This would make the complexity weights dubious.

Tuesday, May 05, 2009

Response on Knowledge Surveys from Ed Nuhfer

Last time I mentioned that I had discovered Dr. Ed Nuhfer's work on knowledge self-assessments. I wrote Dr. Nuhfer to ask a couple of questions about the results, and he was kind enough to respond and give me permission to post his comments here. His remarks are fascinating.

In order to understand the context of my questions (in blue below), you may want to take a peak at the article I cited yesterday: "The Knowledge Survey: A Tool for All Reasons" by Ed Nuhfer and examine Figure 2 in that paper. Don't be confused by the fact that the same graph is referenced as Figure A below. The graph shows self-assessed learning differences (pre- and post-) sorted by the Bloom's level of the item in question.

Hi David—let me see if I can help you here. You stated: "Wouldn’t one expect both graphs to be lower at the right side of the graph? That is, shouldn’t more complex tasks seem more challenging, and therefore inspire less confidence? Or are the data normalized in such a way as to hide this kind of thing?"

"A related question is, if you sort the results by difficulty (either pre or post estimate), is there some trend in the type of question that is more difficult?"

Good thinking. Let's start with the second question first. The patterns in pre- are largely governed by the students' backgrounds in relation to the content the question asks for. It's not difficulty so much as language. Remember these are averages—so on the average, students in a school have had some preparation in some areas and none in others. We find amazing consistence between different sections of the same course.

The post- is pretty consistent between same instructor through multiple course sections. Usually, there are differences we can see between different instructors doing same courses. If students are REALLY learning something, minds should be changed—some valleys in pre's should be peaks in posts. Often, this is not the case. If students come in with knowledge and we just teach to those same areas, we can end up with pre-post correlations of about 0.8. That's NOT what we should want.

On language," higher level" does not necessarily equate to "more difficult." Name the capital of your state is recall and it's easy; name the capitals of all the states is still recall, but it is more difficult. Learning names of twelve people in our class is fairly easy; learning the names of all people in the Chicago phone book is nearly impossible—and it's still just a low level task.

Next, let's see if I can answer that first question, starting with the Figure you sent me. That distribution of Bloom levels is from the very first knowledge survey I ever did—probably around 1992-1993.

Call this Figure A.

Next, let's look at the simple pre-post results in order of the items given in that same class. We now have two ways of looking at this same data.


Call this Figure B – the items are now in the order of course presentation.

Note the general drop in reported achievement (Figure B)-- after item approximately 160. That occurred because of poor course pacing—too much material run through in about the last two weeks.

What Figure A helps to answer is, of the material lost by galloping through it in the final weeks, what was the nature of this loss? Figure A shows that this loss was mostly low-level information. So now, in answer to your question, the higher Bloom challenges had already been done earlier, and there was plenty of time devoted to these. Students who scored badly on first drafts had opportunity to revisit the assignments and revise.

The greatest value of a Figure A lies in the course planning. BEFORE we inflict the course on students, we can know the level of challenge we are going to deliver. This example was for a sophomore course and was about right on the money for meeting most of those students' needs—not a lot of prior familiarity with this material (lots of blue in the figures is not good), but they had good understanding of most of the material by the end (lots of red is good). We hit Vygotsky's zone of proximal development pretty well on most of these students as a result.

The open-ended challenges used for high Bloom levels involved conceptual understanding of science and evaluation of hazards posed to oneself by asbestos and radon. Because these were considered the most important learning outcomes, those are what we focused on most and early, in order to be sure we met them.

There are always more facts that students could know, but beyond what was needed to meet the objectives of this particular course, we didn't worry much about what was lost in the last weeks—(lots of gap above the red is not good) because we had already met the planned learning outcomes very well.

This particular course was redesigned as the result of this assessment by cutting out most of the less essential material altogether. Better to "teach less better;" it really doesn't serve anyone well to present even low-level information through a "drive-by" when students aren't able to learn it well.

One thing I want to add that we stressed in the paper Delores Knipp and I wrote is that Figure A in itself cannot demonstrate that critical thinking occurred. Just because we ask high level Bloom questions doesn't mean that students respond with high-level answers. A reviewer needs to know answers to something like the following: "OK, I can see the high level challenges and I can see that students now register high confidence to meet these. But, just WHAT DID THE STUDENTS DO to demonstrate that knowledge?

To answer that, we need to show the actual assignments/projects and rubrics used to evaluate the students' responses. If we have the assignments, the rubrics and the knowledge survey results, we can then clearly see how well students really met high-level challenges with high-level responses.

Monday, May 04, 2009

Knowing What You Know

In "Categories of Risk" I dissected Donald Rumsfeld's factorization of epistemology and discovered a missing term: unknown knowns. Ed Nuhfer has put the idea of "known knowns" to work in the classroom.

I often find good stuff in the Inside Higher Ed article comments, like their April 28 piece "Assessment is Widespread." The conclusion of the article is that learning outcomes assessment happens more than commonly believed, and has for a long time. Some commenters beg to differ, pointing out that it may not be of much quality or may not even be used for anything. David Cleveland writes something more interesting:
Dr. Ed Nuhfer (California State University - Channel Islands) has worked extensively in the development of Knowledge Surveys that cause the faculty member to develop clear, chronological expected student learning outcomes and then conduct pre and post-tests on student confidence to demonstrate these skills.
The idea is to find out if students think they know how to successfully answer items on a test. It didn't take long to track down research papers on this idea entitled "The Knowledge Survey: A Tool for All Reasons" by Ed Nuhfer and "Knowledge Surveys: What Do Students Bring to and Take from a Class?" by Delores Knipp.

This approach neatly gets to the heart of the assessment problem in a very elegant way. It connects the learner with the material in a meta-analysis that forces him or her to think about the process of learning. At a bare minimum, the student must consider "do I know this or not?" But because of the way the items are structured, solutions are not so easily wrought, and the idea of partially knowing is natural. This could lead to rich classroom discussions about not just the material, but also the complexity (or difficulty) of the task, and why it may be difficult. I've burned a lot of blogging bits on the idea of complexity and its impact on how assessment works, so this approach naturally appeals to me. Ignorance begets meta-ignorance was the subject of this post in February, based on a New York Times piece claiming that those who don't know, don't know they don't know. So it seems that teaching students to understand the limits of their understanding is extremely useful. One of the non-cognitive dimensions linked to student success identified by William E. Selacek is realistic self-appraisal.

This post just scrapes the surface of this topic. More to come.

Sunday, May 03, 2009

Assessment Reference Book


I have a chapter in a new reference book on assessment. Here is the information, so you can buy one for all your friends for the holidays. Imagine the faces of your colleagues lighting up with joy as they unwrap the weighty tome that is:

Handbook of Research on Assessment Technologies, Methods, and Applications in Higher Education

ISBN: 978-1-60566-667-9; 500 pp; May 2009

Published under the imprint Information Science Reference (formerly Idea Group Reference)

http://www.igi-global.com/reference/details.asp?id=34254

Edited by: Christopher S. Schreiner, University of Guam, Guam

DESCRIPTION

Educational institutions across the globe have begun to place value on the technology of assessment instruments as they reflect what is valued in learning and deemed worthy of measurement.

The Handbook of Research on Assessment Technologies, Methods, and Applications in Higher Education combines in-depth, multi-disciplinary research in learning assessment to provide a fresh look at its impact on academic life. A significant reference source for practitioners, academicians, and researchers in related fields, this Handbook of Research contains not only technological assessments, but also technologies and assumptions about assessment and learning involving race, cultural diversity, and creativity.

****************************************

"There is a stunning range of inquiry in this well-edited book, which certainly exceeds the bounds of the usual handbook in so far as it is always readable and stimulating. Every department chair and assessment coordinator needs a copy, but so do faculty members seeking to get aboard the assessment train that has already left the station. Bravo to IGI Global and the editor for gathering such exceptional essays and articles under one cover!"

- Dr. Michel Pharand, The Disraeli Project, Queen's University, Canada

****************************************

TOPICS COVERED

Assessment applications and initiatives

Assessment technologies and instruments

Collaborations for writing program assessment

Communication workshops and e-portfolios

Creativity assessment in higher education

Effective technologies to assess student learning

Faculty-focused environment for assessment

Instructional delivery formats

Method development for assessing a diversity goal

Multi-tier design assessment

Reporting race and ethnicity in international assessment

Technology of writing assessment and racial validity

For more information about Handbook of Research on Assessment Technologies, Methods, and Applications in Higher Education, you can view the title information sheet at http://www.igi-global.com/downloads/pdf/34254.pdf. To view the Table of Contents and a complete list of contributors online go to http://www.igi-global.com/reference/details.asp?ID=34254&v=tableOfContents. You can also view the first chapter of the publication at http://www.igi-global.com/downloads/excerpts/34254.pdf.

ABOUT THE EDITOR
Christopher S. Schreiner is Professor of English and Chair of the Division of English and Applied Linguistics at the University of Guam. Before teaching on Guam, he was Professor of Literature at Fukuoka Women’s University in Japan, and Professor of Integrated Arts and Sciences at Hiroshima University. He has coordinated assessment for the Division of English and Applied Linguistics in preparation for the WASC visit, and authored the summary assessment report for the grant-funded Project HATSA in the College of Liberal Arts and Social Sciences at the University of Guam. One of his recent articles, “Scanners and Readers: Digital Literacy and the Experience of Reading” appeared in the IGI Global book, Technology and Diversity in Higher Education (2007).



This copy courtesy of the publisher.

Friday, May 01, 2009

Statistical Goo

I've referred over the last few posts to "statistical goo" to mean numbers from grades, rubrics, surveys, standardized tests, or other sources that have no clear meaning once assembled. Often they are the result of averaging yet other numbers so that the goo is recursively opaque.

First, I should say that goo may not be totally useless. The gold standard of utility is often predictive validity. Grades are goo, for example, but still have some predictive power. High school grades can predict university grades to explain perhaps 20% of the variance in the first year college GPA. And first year grades can predict the whole college career fairly well. But you have to tease out the statistical effects from the real effects (where "real" means in the sense of the physical universe and not just a numerical artifact).

It is easy to imagine that if your data has a normal distribution (bell curve, or Gaussian), that this means something profound. The utility of the bell curve comes from the fact that this is how averages of random variables tend to distribute themselves. The graph below is courtesy of Wikipedia.

It's easy to fool oneself, and as a bonus fool others. See the graph below, showing nine groups of students, divided up by their first year college grade average. The graph tracks what happens over time to each group's cumulative GPA.
Imagine turning this graph loose with a planning committee. Obviously something dramatic is happening in the second year because the distribution becomes much tighter. What could it be? The discussion could easily turn to advising programs, analysis of course completion, data mining on demographics or other information, and so forth. There's nothing wrong with those efforts, it's just that the graph doesn't really support them. You might want to take a moment to puzzle out why that is for yourself.

The central limit theorem is the formal way of talking about distributions of averages in a pretty general context. And it says that as sample sizes increase, variances (and hence standard deviations) decrease. What happens between the first and second years of accumulated GPA? There are twice as many grades! Hence we would expect the variation to decrease. Another way of thinking of it is as a combinatorial problem. If you are a 4.0 student, there is only one way to maintain that average: get all As the second year. On the other hand, there are lots of ways to decrease your average: any combination of grades that is not all As (there are 3,124 of those).

We must conclude that the GPA compression that's apparent in the graph is mostly due to a statistical artifact (we would check actual variances to quantify this), and not due to some real world parameter like student abilities or difficulty of the curriculum.

Another fallacy easily derived from the graph above is that the poorer students do better over time because of their GPA correlation with the year. We've already dispensed with that by means of the central limit theorem, but there are other factors at play too--the slope of the graph is sharpest at the bottom. Everybody knows that correlation doesn't imply cause. The Church of the Flying Spaghetti Monster, for example, holds that global warming has been caused by the rise in worldwide piracy.


After some musing, you might conclude that the poor performers' GPAs improved because of dropouts. It's simply not possible to maintain a 1.0 GPA for long, so the lower group averages would rise because of survivorship. Not controlling for survivorship invalidates a lot of conclusions about learning. It's common practice not to do so, however, because it requires a cohort years to cycle through the university's digestive system.

Avoid averages if you can. It's very easy to make goo. I've argued that when we assess learning, we are trying to put a ruler to inherently qualitative information. I mean by that information that has a lot of dimensions, and which we deal with routinely using our complex on-board wetware without thinking about it too much. When we average, it's like melting down a bronze sculpture and weighing the slag.

If you're stuck with averages, don't take the meaning of the resulting goo too seriously. You'll at least very likely have a nice bell curve distribution to work with. But don't imagine that the mean value equates to some real world assessment used in common language like intelligence, or effective writing, or critical thinking. In order to make that kind of connection, one has to build the bridge from both directions. What is the common perception of a student's work? How does it relate to the goo? In my experience, you can find reasonable correlations between the two, but if the "critical thinking" test correlates highly with natural language subjective assessments of critical thinking, it is still just a correlation, not a measurement. As such it can be very useful, but we should be careful how we talk about it.