Higher Ed/: June 2009

Thursday, June 18, 2009

Assessment from the Faulty (sic) Perspective

Faculty members, like people, come in types. These various species are described in administrative handbooks kept locked away from prying eyes—the same tomes that carry the secret works of wesaysoism. This ancient practice is rumored to have started with Julius Ceasar as he crossed the Rubicon and continues to this day in modern garb. Those who govern, tax, administer, accredit, and enforce abide by these proven precepts, modernized and abstracted as they have been by the likes of von Neumann and Nash in the last century.

According to dusty volumes, one faculty archetype is the curmudgeon, described as “thou who distaineth” according to a heavily inked codex, one of the few works to escape the flames at Alexandria. Administrators spend a lot of time complaining about these robed and hooded stumbling stones—almost as much time as the faculty spends complaining about wesaysoism and the lack of parking spaces combined. This is all for good reason, as the established practice of the academy has an even older and hoarier history, stretching back to Plato and beyond. Thus does irresistible force come to bear upon the deeply philosophical and generally immovable objection. Perhaps no better contemporary example of this conflict exists than that caused by the mandate of accountability, which implies the ability to justify ones practices by the outcomes they produce. To which the curmudgeon may look up from his brewing treatise on the transcendental ontology of post-feminist non-euclidean hypersquares and say “well, we made you didn’t we?”

For the representative of the administration, it is important to understand the perspective of these faculty members, so that great things may be accomplished. In particular, deep knowledge from the codex is essential to the Assessor, that species of administrator who is tasked with grinding documentation of learning outcomes into report sausage to be consumed by accreditors. It is essential to the Assessor’s mental health that he have good rapport with at least some of the teaching faculty, and this includes the ability communicate effectively. Unfortunately, this is often hampered by conflicting motivations, expectations, and even vocabulary. Fortunately, the recently-recovered codex entitled “Assessment from a Faulty Perspective” (one presumes that a tired scribe misplaced the ‘c’) has elucidated some of these tensions. The topics forming the glossary found below were carefully chosen from the codex and updated for modern sensibilities. For example ‘Assessor’ now replaces ‘Inquisitor ,’ but could equally well be Assessment Director or Instititional Effectiveness Director. Experts, using a rubric from the codex itself, have rated the translation 4.5 and guaranteed its validity. In layman’s terms we might consider the contents New and Improved. Without further explanation, we present:

GADOT Government, Administration, Deans, Other Trouble. Any government or other regulatory body, university committee, task force, or bureaucracy that makes trouble for faculty. To the faculty this is generally a source of frustration, as in “waiting for GADOT”. To the assessor, this is the font of wesaysoism, the ultimate authority to be used wisely or not, but to get things done.

Academic freedom. Academic freedom is to faculty what statistical goo is to the Assessor—a universal way of making annoying problems from GADOT go away. Its effectiveness varies, however, and tends to weaken when used repeatedly.

Assessment. This word is almost never defined before being used, probably in the assumption that everyone knows what it means. In practice, some will use it interchangeably with ‘measurement’, and others to imply a fuzzier sort of observation. The very use of the word often distinguishes an act as a separate activity, as in “I was teaching, but now I’m doing assessment.” As a result, its use should be limited, except as an irritant. For example “have you done your assessment report?” is particularly good at annoying someone. This is because as the Assessor, you almost certainly would have known if it were finished, and so it’s a passive-aggressive way of saying “go do your assessment report.” Faculty will often use the word as an expletive, often combining it with other words to resemble actual bureau-speak, as in “That belated assessment expects it in by Friday!” The etymology of the word is interesting, having begun as an onomatopoeia for a sneeze (ah-SEZment), which was replaced by “achoo,” “hazie,” and other contemporary versions only recently.

In order to minimize the Assessor’s pain in what will be—quite frankly—a long period of suffering in assembling even rudimentary learning outcomes reports, the word ‘assessment’ should be only whispered if it need to be used at all. The intent is to encourage faculty to think:

assessment = teaching + paying attention,

which can be easily remembered with the mnemonic “Auntie Theresa prefers artichokes.” Paying attention means setting clear expectations and then trying to figure out later if they were met. Commonly, this takes the form of writing down on slips of paper what the purposes of a course or curriculum are.

As a concrete example, Calculus I has a long list of content topics: limits, sketching slopes, computing derivatives, related rates problems, and so on. Paying attention in this case might mean that the teaching faculty coordinate to choose a textbook, syllabus, and final exam. At a minimum they discuss what the course should deliver. Ideally this would be written down somewhere in a list. After they’re finished, the Assessor can cleverly copy and paste this list into a document and write “learning objectives” across the top. Above that he or she can write “assessment plan.”

Of course, this is only half of paying attention—we might call it assessment foreplay. The other half takes place after at least some of the course has been delivered. Here the objective is to see if students performed better than soap dishes would have in a controlled study (some use boxes of rocks instead for comparison). At this point, a terrible migraine-inspiring fog can easily set in for the faculty, and the Assessor must be most adroit. To describe and understand the consummate act of assessment, we must be familiar with a few more terms of vocabulary, however. So we will first consider those, and return to the topic of “closing the loop” to rejoin Auntie Theresa.

Measurement. To the faculty, who are often highly educated and have knowledge of science and other practical experience, “measurement” connotes units and standards, universal methods of comparison, and a datum that relates the physical world to a number in some reproducible way. For example: “I measured the wine in the bottle, and it was only 740ml.” To the assessment profession, it means an estimate of a parameter for some (generally unknown) probability distribution for some set of observations. To this parameter, the Assessor often assigns an imaginative name like “thinking skill.” Mereological fallacies such as this (considering a part as the whole phenomenon) are often the result of over-reaching “measurements. ”

The common definition of measurement and the one used in assessment are so far apart that there are bound to be digestive problems on both ends of the conversation. It is therefore recommended that the Assessor only use this word in situations where the audience will not ask too many discerning questions—board meetings or legislative councils, for example. In those cases, it sounds better to say “we measured our success at 90%” than to say “we can say with utter statistical confidence that we may have made a difference.”

The real problem with thinking of assessment as measurement is that it implies that it is a scientific endeavor. The intrepid Assessor who strikes a path through the waters of empiricism will quickly find out what centuries of Human doubt and superstition attest to: science is damned hard.

We might call measurement the “hard” assessment problem. It can be profitably used for modest assignments. For example, suppose a goal of a course is to teach students to swim. Counting how many laps students have swum without having to be hauled out by the lifeguard would be a useful number to know. We can call this a measurement without waking up with night sweats. The tip-off is that there are clear units we can assigned: laps per student. This is a physically verifiable number. Similarly, attendance records, number of papers (or words) written, typing speed, and cranial circumference are all measurable in the usual sense. Note that these all have units.

Sometimes a commercial standardized test will be mentioned in the same sentence as measurement. Beware of such advertisements, as they are no longer talking about things that can be laid alongside a yardstick or hoisted on a scale or timed with a watch. The use of the word “measurement” in this case really should come with a long list of disclaimers like a drug ad for restless fidget syndrome—very likely to cause cramps when used in outcomes assessment. Although toxicity varies, and in moderation these instruments may be used for useful purposes, they are by and large producers of statistical goo, which we address next.

Statistical Goo. The topic of statistical goo is large enough for its own book, so we will confine ourselves here to the main applications. There is almost no data problem that statistical goo cannot fix, and the reader may want to consult Darrell Huff’s classic How to lie with Statistics for a more comprehensive treatment. It is a double-sticky goo, this numerical melange, both useful and a hindrance in turn. First we shall see how it can be a problem for the Assessor.

Goo generally results when numbers are averaged together, although this by no means exhausts the methods by which to squeeze all life from a data set and leave an inert glob in its place. The goo most familiar to faculty members is the venerable grade point average. The fundamental characteristic of goo is that by itself it is only an informational glob, and not useful for Auntie Theresa’s assessment purposes. We have not yet got to the closing the loop topic, but this paying attention activity hinges on the usefulness of the data at hand. And goo is just not very handy for this purpose.

As an example of how faculty can easily run afoul of goo, consider the following tale. Following the Assessor’s prescription, Alice and Bob discuss course objectives and agree on a common final exam. Once the exams have been scored, they sit down again and compare results. Alice’s average is 67% and Bob’s is 75%. What do they learn from this?

Well, in a word: nothing. Clearly, the goo has got the best of them if this is the level of detail they are working with. Even looking at individual student test scores is little help. What’s really called for is a more detailed analysis of student performance on different types of problems, viz. those tied to the original objectives. Goo occludes detail like a milkshake on your windshield. The lesson for the Assessor is to think thrice before averaging anything. The assessment professional who takes goo seriously as useful data is likely to have a career that is nasty, brutish, and short.

On the other hand, goo can be useful to the Assessor when writing up reports that won’t actually be used by anyone. Top level administrators, accreditors, and the rest of GADOT generally don’t have a lot of time for details, so feeding them a diet of goo is a convenience. If the resulting graphs have lines going up, the Assessor can claim credit for this putative success. If, on the other hand, the graph heads down or meanders, the Assessor can point to this as an area that needs improvement, and therefore a triumph of his cleverly designed assessment process in identifying this “problem.” Goo is like WD-40 and duct tape combined—there’s really no argument you can’t make, given enough data.

Later on we’ll take up the more advanced topic of anti-goo.

Learning outcomes. This term is viewed with suspicion by faculty, who feel that they already know what learning outcomes are and that they see them every day in the classroom. Terrible confusion lurks here for the unguided faculty member, who tries to implement “assessment.” In the worst case, he or she will expend a lot of effort and produce little to show for it, and feel punished later when the Assessor breaks the news. The key idea is that the outcomes have to be observable in some way or another, and eventually the faculty have to sift through these observations for pearls of wisdom. The more specific and detailed the outcomes are, the easier this is, but also the more objectives there must be.

Specificity is attractive because learning is “local.” For example: you can’t teach someone to “do math;” that’s far too general. You can teach someone to factor quadratics if they have the right background. Without further resolution “math” is like statistical goo—a generalization that is useful in some circumstances but not others. The trick to making useful learning outcomes is to find ones that are general enough to permit a small enough number to manage, but not so general as to become too gooey to be of use.

The danger to the Assessor is surplus enthusiasm. In his drive to reach the top of the GADOT org chart, and perhaps awakening a latent case of positivism, he may turn the identification of learning outcomes into a behemoth of bureaucracy, creating “alignments” and levels of competency and purchase expensive data systems to instantiate this educational cartography in inexorable logic. There is a certain attraction to this sort of thing—and it’s probably related to the same overreaching that leads to dreams of measurement for things that can’t be assigned units in the usual sense of length or weight standardization. With patience and a proper diet, the Assessor can overcome this condition. As someone once said: “theory and practice are the same in theory but not in practice.”

Alignment. Having broached this topic in the previous entry, we should address the topic of curriculum alignment properly. This is almost always evidenced by a so-called matrix—a grid with courses along one side and learning outcomes along the perpendicular. Creating the matrix may be referred to as curriculum mapping. On the surface, this seems like a puzzling idea. After all, if a learning outcome is assigned to one course, why should it appear in another course? Is that not a simple admission of failure? And if the learning outcome represents a more advanced level in one course than the other, then isn’t it qualitatively different and deserving of a new name? These are some of the paradoxes that can confront the unwary Assessor who tries to explain curriculum alignment to the faculty. As usual, these matters are best explored with simple examples.

A simple outcome, like say “factoring quadratics” clearly should not appear as a learning objective in more than one sequence without some explanation. If the math curriculum contains this objective in every major course, it would be terribly inefficient.

On the other hand, “analytical thinking” might well appear in every course in the math curriculum, just like “effective writing” might appear in every course in History. The difference is the generality of the objective, of course. The distinction between specific and general objectives is a very important one, because they have to be treated differently in both assessment and in closing the loop.

Rubric. To the faculty member, a rubric is often considered an accoutrement to a course, or perhaps a spandrel—an architectural detail that supports no structure, but is at best merely decorative. To the Assessor it is the sine qua non of classroom assessment—the results to be pored over to divine the minds of a mass of learners. At worse, rubrics become just another kind of grading. At best, they guide the intelligent dialogue that is the engine of change.

Dialogue. This is a better word than “meeting,” which will cause half of your audience to sneak quietly out to the nearest bar, and the other half—who love meetings—to sharpen their digressions and sidle closer. Whatever the name of it, focused discussion among peers with a common interest is the best assessment tool in the Assessor’s kit. As it pertains to the previous topic, dialogue can help identify objectives that are too general. The crux of the matter is that discussion among experts creates a social engine for intelligent change. Fostering this dialogue is the primary goal of the Assessor.

Hard and soft assessment. We are now ready to return to the topic of assessment with a better perspective, to consider what is it that “paying attention” means in two contexts: hard assessment (which we can sometimes legitimately call measurement) or soft assessment (which is subjective and prospers through dialogue).

We should acknowledge that everyone does soft assessment all the time. Sentiments like “he seems sad today,” or “Jonah gets seasick,” or “that waiter was mean to me” are all soft assessments—even though no one calls them that—which convey meaning because of the wonders of language. That is, even for abstract or difficult concepts like love, hate, intelligence, and happiness, these words communicate real information to the listener. This can only be possible because there is enough reliability (agreement among observers) for information to be communicated. Moreover, this reliability suggests that there is some connection between concepts and reality. This can be interesting and complicated, as in “Paris Hilton is famous,” which is only true because we agree it’s true. But for purposes of this discussion, let’s make the leap of faith that if we observe that “Stanislav is a poor reader,” and we get some agreement among other observers, that this is a valid observation. It is, after all, this sort of informal assessment that allows society to function.

This approach has some almost miraculous advantages. First, the Assessor doesn’t have to explain the method—faculty already do it. If you asked them to list their top students, a group of professors can generally sketch out who should be on that list—provided that they all have experience with the students. If you’ve ever witnessed a whole department trying to grant the “best” student an award, and watched faculty members try to describe how good is a student to someone who hasn’t had direct experience, you’ll understand the paucity of our ability to communicate the rich observations that we assimilate.

Soft assessment is the primary means of closing Auntie Theresa’s loop, which we will come to in a moment.

Hard assessment could be measuring things—counting, weighing, timing, and so forth. As we’ve seen in the section on measurement, much of what passes for measurement is really just a sterilized type of observation, which falls far short of measurement, as it’s commonly thought of, but simultaneously lacks the direct experience richness of personal observation. But even if we don’t call it measurement, we can call it hard assessment to reflect the fact that it is difficult, brittle, and inflexible. It still can be useful, however. If we think of hard assessment for the moment as a standardized test or survey, we can use such instruments for general or specific learning outcomes. The key is the complexity of what’s being tested or surveyed. Relating the results back to the classroom is generally the hard part for hard assessment. Supposing you know how many books students check out on average from the library—an indirect measure of literacy, perhaps. How do you improve your teaching based on that? This problem is common, and a good example of:

Zza’s Paradox: We can assess generally, but we can only teach specifics.

There is an inherent asymmetry between teaching and assessing—we can observe far more kinds of things than we can change.

Anti-goo. A type of hard assessment is an subjective survey about student performance--just ask faculty in an organized way what they think of the thinking and communications skills of their students, for example. Rather than reporting these out as averages, use frequencies and maximums and minimums. If averaging creates goo, these simpler statistical data are the anti-goo.

Grades. To faculty, grades are a necessary evil to make GODOT happy, but one with a long tradition, and now accepted as a given. To Assessors, grades are useless, unless finding validating statistics with a new standardized test, in which case correlations with grades are printed on the marketing material.

Reliability and validity. Some faculty members may use these words to annoy assessors by pointing out the flaws in designs of tests and surveys. Often assessors will use the same terms as a stick to advertise some standardized instrument, for which these words are the equivalent of “new and improved” for consumer products. Except for the simplest objects of study, reliability almost always comes at a cost in validity. You can train raters to call a dog a cat and get high reliability, but what about those who aren’t in on the secret? How will they know that dogs are now cats? They don’t, in fact, so the new private language has limited utility. Faculty members, being intelligent people, intuit that we get along just fine in normal language without any attempts to improve inter-rater reliability, so will view suspiciously any attempts to subvert their onboard judgments with agree-to-agree agreements.

Ratings. Faculty look at average scores and shrug. What good does it do to know that sophomores are .2 points better at “subversive thinking” than freshmen? To the Assessor this is often important, so he or she can create graphs with positive slopes to highlight on reports. It makes him look good. But ratings can be reported out usefully—just avoid averages. Use proportions instead. Say two-thirds of Tatianna’s results were considered remedial, rather than that she scored 1.7 relative to a group average of 2.2.

Closing the loop. To faculty, this means scrambling at the end of the year to put together reports that will make the annoying emails from administrators stop piling up in their inbox. To Assessors, closing the loop is the holy grail—a chalice of continual improvement, as it were. In her haste to nudge the faculty in the right direction, the Assessor will often emphasize the wrong things, such as rubrics and ratings, and hard assessments, and try to push the faculty into a mindset of a pickle factory trying to increase output. This is annoying to the faculty and ultimately self-defeating. The result the Assessor really wants is evidence of change. The assessment data will almost always be fuzzy, there will very rarely be headline screaming across the IR report: DO THIS! Here’s the step-by-step process:

1. Get faculty to sit down together and talk about the learning objectives. Call them goals, so as not to arouse their suspicions, and have plenty of alcohol on hand.

2. Provide someone to take minutes if you can, or be there yourself. In any event, someone needs to record the salient bits.

3. For each goal, dig out and discuss any information—opinions, ratings, grades, or other evidence—that bears on the goal. Let the hive mind do its job and come up with suggestions. Some won’t be very practical, but don’t worry about that. Sit with the chair or program director and see if you can get agreement on a few items for action. Most likely the faculty will have decided this themselves. 5. The hard part will always be to get them to write this stuff down, even when they do it successfully. Provide all the support you can there for the lazier groups.

This works for a model of continuous improvement. If you are stuck with a ‘minimum standards’ model, where ‘measurement’ is taken seriously, then all is probably lost. Get a prescription for Xanax and listen to soft jazz while you read the want ads.

Conclusion. It’s really in everyone’s best interests if the faculty and the Assessor are not at war with one another, and deploying the nukes of wesaysoism and academic freedom. There is a moderate path that creates the least pain all around, and as a bonus has a good chance of improving teaching and learning.

[Note: This was originally written as a chapter in a book, but ended up here instead. I never did final editing on it, and didn't add hyperlinks.]

Wednesday, June 17, 2009

Universities and Economic Growth

Strong well-reasoned opinions are always welcome. On my current lean diet of blogging time I have only the opportunity to pass the link on to you, however. I'll add an update later if I get my assignments finished. Here's Phil Greenspun of MIT on the topic du jour: Universities and Economic Growth.

As a bonus, here's a concept I stumbled upon last night while writing. It surely is the source of a lot of confusion in "closing the loop."

Zza’s Paradox-
We can assess generally, but we can only teach specifics.

Tuesday, June 16, 2009

Summer Plans

My blogging will be hit or miss for the next month, as I take the kid to Germany to stay with the grandparents for a while. Before leaving, I have a book chapter to finish, which may consume most of my blogging time. I had thought I'd try to find an assessment director or equivalent at University of Cologne to interview about the Bologna process while I'm there. We'll see if that pans out.

I blogged a while back about the Dunning-Kruger Effect, and I'm still mulling it over The original paper is here, and a video for popular consumption can be found here. Some questions are:

How does this affect the results of knowledge surveys?
Given the findings, there seems to be an absolute characteristic of expertise. That is, at some critical point in knowledge level, one can accurately self-assess. What would allow us to predict what that level is ahead of time?
Can we teach self-assessment as a skill separate from the domain of expertise?

Just thinking about the commute, it would be great if every driver were given a realistic assessment of his or her ability to negotiate the highways!

Sunday, June 14, 2009

Solving Complex Problems

We've encountered the observation before that solving really hard problems has to rely on an evolutionary trial and error approach. We might describe this as active inductive reasoning. In general, induction is just looking for patterns in data and formalizing them. The 'active' part means we're continually looking for more data. What actually happens, though is not so straightforward, and bears deeper inspection. For me, this is the crux of the "critical thinking" assessment debate, which anyone who reads this blog knows I've tried to refactor into more easily identifiable patterns of thought: deductive and inductive.

The cause for this iteration of that line of thought is two-fold: an article in the New York Times called "The Case for Working with Your Hands," from May 21, and a trip I took yesterday to Seagrove, North Carolina. In the article, Matthew B. Crawford describes how how left a high-paying career based on academic credentials and began fixing motorcycles full time. There are some beautiful descriptions of the inductive/deductive synthesis in his problem solving:

In fixing motorcycles you come up with several imagined trains of cause and effect for manifest symptoms, and you judge their likelihood before tearing anything down. This imagining relies on a mental library that you develop.

The possible cause and effect relationships are deductive rules that the successful mechanic accumulates in his mental library. But these rules are not enough to solve problems. These comprise only the theory. The imagining is the inductive process--positing solutions to test against the known facts and theory. But this is too simple a description, because there is cost involved:

There is always a risk of introducing new complications when working on old motorcycles, and this enters the diagnostic logic. Measured in likelihood of screw-ups, the cost is not identical for all avenues of inquiry when deciding which hypothesis to pursue. Imagine you’re trying to figure out why a bike won’t start. The fasteners holding the engine covers on 1970s-era Hondas are Phillips head, and they are almost always rounded out and corroded. Do you really want to check the condition of the starter clutch if each of eight screws will need to be drilled out and extracted, risking damage to the engine case?

What might seem in theory like a good line of inquiry may be expensive to carry out in practice. So over the whole deductive/inductive process operates an economic decision-making process that must weight possibly costs against probabilities of success. Because of the uniqueness of the situation, this may be guesswork--loose "deductive" rules we may as well call heuristics.

Theory is a powerful drug, however (witness the wars of ideology that continue into this century). The good problem solver knows when to ignore them, which according to the author is most of the time:

There probably aren’t many jobs that can be reduced to rule-following and still be done well. But in many jobs there is an attempt to do just this, and the perversity of it may go unnoticed by those who design the work process. Mechanics face something like this problem in the factory service manuals that we use. These manuals tell you to be systematic in eliminating variables, presenting an idealized image of diagnostic work. But they never take into account the risks of working on old machines. So you put the manual away and consider the facts before you.

The article itself is not a treatise on problem-solving, despite the quotes above. Crawford's thesis is that jobs like fixing motorcycles are not less than jobs like accountants and lawyers in their mental demands. In his words (taken from different paragraphs):

When we praise people who do work that is straightforwardly useful, the praise often betrays an assumption that they had no other options.

A gifted young person who chooses to become a mechanic rather than to accumulate academic credentials is viewed as eccentric, if not self-destructive.

The trades suffer from low prestige, and I believe this is based on a simple mistake. Because the work is dirty, many people assume it is also stupid.

Dr. Crawford comes to this line of thought after finishing a Ph.D. in political philosophy at University of Chicago and working as an executive director in a Washington think tank. Compare the problem solving process he encountered there:

It sometimes required me to reason backward, from desired conclusion to suitable premise. The organization had taken certain positions, and there were some facts it was more fond of than others. As its figurehead, I was making arguments I didn’t fully buy myself. Further, my boss seemed intent on retraining me according to a certain cognitive style — that of the corporate world, from which he had recently come. This style demanded that I project an image of rationality but not indulge too much in actual reasoning.

Later in the article he reflects on this distinction in thinking styles (emphasis added):

[M]echanical work has required me to cultivate different intellectual habits. Further, habits of mind have an ethical dimension that we don’t often think about. Good diagnosis requires attentiveness to the machine, almost a conversation with it, rather than assertiveness, as in the position papers produced on K Street. Cognitive psychologists speak of “metacognition,” which is the activity of stepping back and thinking about your own thinking. It is what you do when you stop for a moment in your pursuit of a solution, and wonder whether your understanding of the problem is adequate.

I think it would be an easy shot to wish that if only our captains of finance had taken this step back into metacognition..., but that would advertise a higher level of understanding about our current woes than I possess. Crawford's exposition is too graceful for that. The sense I get from reading the article, however, is aposite to both motorcycles and derivaties: complex problems are humbling, and if approached on their own terms (rather using than an ideological end-around, for example), cause us to grow through struggles within ourselves. This is the meta- part of meta-cognition. Crawford makes the fascinating observation that many professions--highly paid ones--do not easily allow for this sort of growth. He highlights the "moral trap" of the archtypical middle manager on page four.

An example of the anesthetic properties of theory over actual problem-solving is given in a Kafkaesque description of a job he held to write abstracts for academic journals (emphasis added):

My job was structured on the supposition that in writing an abstract of an article there is a method that merely needs to be applied, and that this can be done without understanding the text. [...] [I]t became clear [my instructor] was in a position similar to that of a veteran Soviet bureaucrat who must work on two levels at once: reality and official ideology.

Doing things by rote can solve some complicated problems, as computers routinely demonstrate. But of course in that case, the 'rote' solution can be quite complex, requiring millions of bytes of code. Simple prescriptions for complex problems are unfortunately common--any fixed ideology is a likely suspect--and seem to cause the kind of logical contortions described in the quote above.

I was reminded of the Times article because of a trip I took yesterday, to Seagrove, North Carolina. It's a unique place, a center for hand made pottery. I bought the piece shown below for $30 from William and Pamela Kennedy of Uwharrie Crystalline Pottery [email].

Mr. Kennedy told me that the technique he uses involves seeding the glaze with zinc, which begins to crystallize as it cools. Essentially, this re-creates in a kiln a geological process that might happen with rock far below the surface or near volcanism. At a different place, called Dirt Works, I bought a shallow bowl in the style shown below.

The master potter, Dan Triece, was kind enough to let us watch him work. His most common question is "how long does it take to make a piece?" His answer: about 3 years and fourteen minutes. You can figure out what he meant.*

Obviously there are problem-solving connections between the zen of motorcycle maintenance and hand crafting bits of clay into art. They are different problems, of course, but the differences serve to highlight the role of inductive reasoning.

With motorcycles, there are obvious empirical tests that have to be met. Does the thing run when I'm finished with it? Is it dripping oil? For the potter, physical reality also intrudes. Kilns affect the clay differently, depending on if they are electric or gas-fired or heated with wood. There is the chemistry and physics of glaze and high temperatures to reckon with, all of which may thwart the desired outcome. But it seems to me that beyond these physical limitations there are more possibilities for a distinct and obvious style to emerge with pottery than with motorcycle maintenance. This may be a blessing or a curse, but it seems to me that the inductive process becomes applied to solving problems in one's own mind in the artistic process. We might call this the "do I like it?" loop, which gets tested over and over again for the artist.

The underlying complexity of the physical problems to be solved PLUS whatever potential there is for creative expression together represent the challenge for the expert in any field, whether motorcycles or pottery or mythology or politics. In addition, there are the social complexities, like the one Dr. Crawford mentions, and for the artist the problem of finding an audience. These act like physical constraints in some ways--there are causes and effects--but maybe only in a Dr. Zeuss kind of way (you may puzzle until your puzzler is sore...).

What's the point? Shouldn't we have this sort of discussion with every student at a liberal arts institution? Shouldn't part of the curriculum include ideas about possible futures--not in the "what do you want to be when you grown up?" frame, but one informed by the psychology of happiness, the life experiences of experts in all sorts of fields (fixing engines included), an exploration of the styles of thought that resonate with the student (deductive, inductive, creative?), and at least some of the attempts of western civilization to answer the question "why am I here at all?"

If anything, the Times article ought to convince an educator that self-reflection is absolutely key to intellectual development. Take one more step, and we might conclude that we ought to be assessing thinking stills through reflection on the part of the student, and teaching that as a skill for its own sake. But there's more than that too. Developing one's intellectual maturity doesn't occur in isolation--it strikes me that the account given by Dr. Crawford illuminates what we might call an aesthetic maturity, a finding for one's self a place in the grand scheme. Perhaps this idea has too much baggage to get through a faculty senate that isn't populated with existentialists, but what do we gain by graduating nominally stellar thinkers who are unhappy with their lives? I think Dr. Crawford's implication is right on: preparing for a place in the "information economy" is not the only reason to seek higher education.

*three years to learn, after which it only takes fourteen minute to 'throw' the pot on the wheel.

Friday, June 12, 2009

Closing the Loop (comic)

[Previous comic] [Next comic]
Photo credit (first and third panels) ldanderen, graphs from Wolfram Alpha (second panel): Esther Dyson via Flickr
This comic may be distributed under Creative Commons.

Thursday, June 11, 2009

Meeting Salad

The other day I mentioned that I was trying to bring some order to meetings with a new form I'd created. This all started from "meeting salad." If the image below of papers randomly drawn from my briefcase seems familiar, then you too have produced your share of the stuff.

The name "meeting salad" is one I drew from Thomas Pynchon's The Crying of Lot 49. Here's the quote, describing what was swept out of "Mucho's" car (emphasis added):

[Y]ou had to look at the actual residue of these lives, and there was no way of telling what things had been truly refused [...] and what had simply (perhaps tragically) been lost: clipped coupons promising savings of 5 or 10 cents, trading stamps, pink flyers advertising specials at the markets, butts, tooth-shy combs, rags of old underwear or dresses [...], all the bits and pieces coated uniformly, like a salad of despair [...].

The description seems apt. I had googled "meeting notes" to see if someone had published an open-source solution, and found this as the top link:

A common bad habit I have come across with managers and executives in recent years is the accumulation of unprocessed meeting notes. It is heartbreaking to see so much effort go into the creation of meetings and the capturing of what goes on, and the stress created and value lost from irresponsible management of the results. At least 80 percent of the professionals I work with have pockets of unprocessed meeting notes nested away in spiral notebooks, folders, drawers and piles of papers.

So it seems that meeting salad is being produced in mass quantities. The site's recommendation was basically to have more self-discipline and to go through the notes once in a while. I need more than that. Using the form as a start, I will build a little database and form application on top of it, so that my project coordinator or other assistant can enter the data. The beauty of this is that each little leaf of meeting salad will be tagged with a destination, like IT or Assessment Committee. Then, when I'm going to meet with some person or group I can simply query the database for unresolved items and have fodder for a meeting agenda. Right now I'm just carrying the notebook around with me everywhere, but when there are many pages it's not convenient to scan through looking for Earthquake Prevention Committee or whatever.

Once it's built I'll post it at meetingsalad.com, and you can use it too, if you want. I bought the domain name yesterday for a sawbuck, but there's absolutely nothing there yet. If you have comments or suggestions, I'd love to hear them.

On a related note, organizing email is an equal pain. I know that google is coming out with their Wave thing to replace email soon, and maybe that will help (if you haven't heard of it, you've been under a rock). But this morning I stumbled upon Xobni--a tool that integrates with MS Outlook, which I use for email. The problem with traditional email, Outlook included, is that it produces what we may as well call "email salad" -- heaps and heaps of data that are only loosly organized. Yes, you can sort and you can search, but I often find myself dreading sifting through two weeks of messages to find the attachment I was looking for. (Of course, we should be archiving docs with something like openIGOR, and my group does that, but not everyone does.)

Xobni is a free tool that looks like it can help with email salad by providing more intelligence about correspondence. Primarily this is done by focusing on individuals and networks as the main unit of analysis, rather than an individual email. You can see screenshots at their site. It's very pretty and installed cleanly on my machine (although the install dialog was hidden behind another window, which caused me to think it was frozen for a while). The image below (from their website) hints at some of the features, including integration with social networking sites, threaded conversations, message statistics, and easily browsed attachments by person.

Tuesday, June 09, 2009

Clemson Fiasco

It's hard to avoid the buzz in higher ed circles about the supposed "gaming" of the US News college rankings by Clemson University. This concerns an IR conference, where according to InsideHigherEd:

A presentation by Catherine Watt, the former institutional researcher and now a staff member at Clemson University, laid bare in a way that is usually left to the imagination the steps that Clemson has (rather brazenly) taken since 2001 to move from 38th to 22nd in U.S. News's ranking of public research universities.

It's no secret that reputation, as rated by the mutual opinions of college administrators, is part of the stew of variables that US News uses in cooking up its menu of Top 10 Universities. The controversy comes from a particular remark by Ms. Watt about these (quoting again from the article):

And to actual gasps from some members of the audience, Watt said that Clemson officials, in filling out the reputational survey form for presidents, rate "all programs other than Clemson below average," to make the university look better. "And I'm confident my president is not the only one who does that," Watt said.

I think there is a fair amount of schadenfreude evident in the remarks and "analysis" that follow this denouement, but I find the attention paid to it a bit bizarre. Maybe it's that college are supposed to compete on the basketball court, but not for rankings. In my mind, they're exactly the same thing: a more or less arbitrary set of rules and a payoff for winning (scoring high). A coach that didn't take advantage of time-outs to advance his or her team, for example, wouldn't last long. Why is the sentiment different when it's no longer sports, but US News rankings? Shouldn't universities compete aggressively for grants? Sure. Do the best tuition leveraging to compete for the "best" students? Sure. Rate down their competition on a subjective survey so they look better? Why not--it's the reasonable thing to do.

If US News wants to pass off their statistical goo of average SATs, endowment size, etc., as something meaningful, that's fine. The result is the fiasco--a word my wife tells me originally meant an ill-formed piece of pottery that was then broken. If US News wants to pass off this fiasco as a genuine "hard assessment" of institutions, then institutions ought to be applauded for competing as hard as they can in this artificial arena. Of course, the unfortunate IR director gets caught in the middle of this, because ethical standards will not allow the publication of outright lies.

I have a solution that will make everyone happy. Instead of sending the US New survey to the IR office, send it to the PR office instead. Let the staffers there apply the same subjectivity that applies to the "reputation" scores to the rest of the data. Fill in the blanks however feels right, in light of the image of the university that is to be cast by this--let's face it--PR survey. To pressure IR staff to do this is unethical and unnecessary. After all, if you choose a student to highlight on a big billboard beside the interstate, does the PR office take great pains to select one randomly, that best represents the student body (at least in a stochastic sense)? Of course not!

It probably serves as a good reminder that games like this are played all the time, and the sort admitted to by the Clemson staffer is particularly harmless. I'm currently reading a book by Anne Applebaum called simply Gulag, which is a history of that Soviet institution. One section deals with survival, and describes an escape attempt from a remote work camp. Two thieves were conspiring to break out, but had no food to sustain them on their long walk. Their solution? Invite a third man along.

Monday, June 08, 2009

Noncognitives as Indicators of Success

As Institutional Research Director at Coker College some years ago, I was involved in creating a predicted GPA matrix, which was used to set merit awards in our financial aid leveraging process. I noticed that the amount of variance explained by the traditional measures of high school GPA and SAT (or ACT equivalent) was not great--perhaps 30% in the linear model using both variables. I tried using other data at hand, and found that these two variables were about the best usable ones at our disposal. Unusable ones included sex and income. This last was troubling because it was clear that students from a low socio-economic group got lower grades and standardized test scores on average, and hence received less merit aid. I began to wonder if it could really be true that these students were less able to do college work. Clearly some of them could do quite well--even at the very bottom of our acceptance pool, rated by predicted GPA, about half the students would do quite well academically. This was the genesis of the idea to try to find ways to identify the actual potential of these poorly-rated students. There are two advantages to this idea. The first is that it's fairer to students--rather than the usual (implicit) practice of adding merit aid on top of high family incomes, a closer inspection can even out the aid (and accepted applications, for that matter). Secondly, it's beneficial to the institution. This is because high GPA, high SAT students are competed for among many institutions--they are visibly "high potential" applicants, despite the 70% of variance in actual performance that is unexplained by these predictors. By developing better predictors, it would be possible to partially avoid the bidding-war that drives up discount rates and puts the cost of running a university disproportionally on low-predicted students (who, as we've noted are on average of lower socio-economic status to begin with).

It's somewhat curious that with this win-win advantage to better predictors that there is not wider acknowledgment that the usual cognitive predictors (GPA, SAT, class rank, etc.) are not sufficient. It's doubly odd since the College Board itself makes no secret of the limitations of the SAT, for example. This applies in general as well as to particular demographics. From the 2008 College Board Research Report Differential Validity and Prediction of the SAT: "The results for race/ethnicity show that for the individual SAT sections, the SAT is most predictive for white students, with correlations ranging from 0.46 to 0.51. The SAT appears less predictive for underrepresented groups, in general, with correlations ranging from 0.40 to 0.46." Those correlations are really only useful as an adjunct to high school grades (and perhaps rank), so one has to take them with a grain of salt. Even so, squaring these to get r2 yields 20% or so of the variance in first year college grades. This is hardly enough to justify the confidence that the general public and even higher education administrators seem to have in the test. I imagine that college ratings systems like U.S. News accentuate this distortion.

We have therefore established that there is a great potential for indicators other than the usual cognitive ones in order to chip away at the 70% or so unknown variance in predictive power. Why noncognitives?

Recently results from an unscientific but interesting faculty poll were announced in a staff meeting at my university. The faculty members had been asked what qualities they would like to see in students. As the deans read their lists, I categorized them for my own amusement and found that two thirds of the traits listed were things like "works hard," "is interested in the subject," and "takes studies seriously." In other words, descriptions of noncognitives--attitudes and behaviors other than grades and test scores. It had been long clear to me by that point that we needed better insight into those qualities of our students, but I was surprised to see how much research had been done on it already.

Even the big standardized test companies are becoming interested in noncognivites. In 2008 ETS announced that it would begin testing the use of noncognitives for augmenting the GRE with a "Personal Potential Index". In this case, this means using standardized letters of recommendation from advisors and professors. The College Board has funded a project (GRANT) at Michigan State University to study noncognitives.

Below are a few selected sources from the literature on noncognitives.

The University of Chicago Chronicle, Jan. 8, 2004, Vol. 23 No 7:

Heckman, the Henry Schultz Distinguished Service Professor in Economics and one of the world’s leading figures in the study of human capital policy, has found that programs that encourage non-cognitive skills effectively promote long-term success for participants.

Although current policies, particularly those related to school reform, put heavy emphasis on test scores, practical experience and academic research show that non-cognitive skills also lead to achievement.

"Numerous instances can be cited of people with high IQs who fail to achieve success in life because they lacked self-discipline and of people with low IQs who succeeded by virtue of persistence, reliability and self-discipline," Heckman writes in the forthcoming book, Inequality in America: What Role for Human Capital Policies? (Massachusetts Institute of Technology Press), which he co-authored with Alan Krueger.

Noncognitive predictors of academic performance: Going beyond the traditional measures, by Susan DeAngelis in Journal of Allied Health, Spring 2003.

The purpose of this study was to provide an initial investigation into the potential use of the PSI as a noncognitive predictive measure of academic success. As programs continue to experience demands by allied health professions for their graduates, admissions committees should employ highly predictive and valid criteria to select the most qualified applicants. Although it is impossible to select only candidates who are ultimately successful, admissions decisions must be based on the best data available to reduce the risk of attrition and to increase the number of newly licensed graduates entering the profession. The preliminary findings indicate that the PSI moderately enhanced the predictive capacity of the traditional cognitive measures of entering GPA and ACT score.

Probably the best source for the practical use of noncognitives in higher education, and a rich source of literature on the topic, is William E. Sedlacek's Beyond the Big Test: Noncognitive Assessment in Higher Education. The book describes itself thus:

William E. Sedlacek--one of the nation's leading authorities on the topic of noncognitive assessment--challenges the use of the SAT and other standardized tests as the sole assessment tool for college and university admissions. In Beyond the Big Test, Sedlacek presents a noncognitive assessment method that can be used in concert with the standardized tests. This assessment measures what students know by evaluating what they can do and how they deal with a wide range of problems in different contexts. Beyond the Big Test is filled with examples of assessment tools and illustrative case studies that clearly show how educators have used this innovative method to:

• Select a class diverse on dimensions of race, gender, and culture in a practical, legal, and ethical way

• Teach a diverse class employing techniques that reach all students

• Counsel and advise students in ways that consider their culture, race, and gender

• Award financial aid to students with potential who do not necessarily have the highest grades and test scores

• Assess the readiness of an institution to educate and provide services for a diverse student body

Sedlacek identifies eight dimensions of interest:

1. Positive self-concept
2. Realistic self-appraisal
3. Successfully handling the system
4. Preference for long-term goals
5. Availability of strong support person
6. Leadership experience
7. Community involvement
8. Knowledge acquired in a field

He also gives research findings and processes for reviewing applications, for example, to identify and rate these traits.

Research like the sources cited above show promise in the use of noncognitive predictors. And there is clearly a need for better predictors of student success. Nevertheless, we must be careful not to underestimate the difficulties or imagine that "noncognitives" are a panacea. Dr. Sedlacek's work and others indicates to modest gains in predictive power based on surveys and other methods of gather information about attitudes and behaviors. In his paper with Julie R. Ancis "Predicting the Academic Achievement of Female Students Using the SAT and Noncognitive Variables," the noncognitive instrument only predicted a handful of percentage points in the variance of GPA, a result that has been indicated by others as well.

Additional complications are introduced if the methods of assessing noncognitives are standardized and published. Even tests of cognitive skills are gamed, as with SAT preparation tests that teach not just content, but also test-taking strategies. It's easy to imagine how much easier it would be for an applicant to a competitive institution to "game" a formalized noncognitive survey. This imposes constraints on how we can approach the problem.

Within these limitations, there is still opportunity to develop the use of noncognitives in useful ways. Obviously, applying such indicators in an admissions setting must be done carefully. But the usefulness of knowledge about what traits lead to success is not limited to admissions. Student intervention, orientation, and even the curriculum itself can benefit from consideration of skills that are not traditionally academic. The AAC&U's LEAP initiative, for example, lists noncognitives in their recommendations for general education (teamwork and civic engagement, for example).

In summary: there is motivation to find new predictors of academic performance, and there are indications that noncognitives are promising. There are, however, obstacles that can be best met with a collaborative, wide-ranging effort that focuses on practical effect.

Friday, June 05, 2009

It's all in the name (comic)

[Previous comic] [Next comic]
Photo credits (top) wetwebwork, me, caruba (middle) Ryan Somma, (bottom) Bob.Fornal, vgm8383
This comic may be distributed under Creative Commons.

Thursday, June 04, 2009

Neologisms and Peeves

I received one of those viral emails this week advertising neologisms from a Washington Post contest, including things like:

Esplanade -- v., to attempt an explanation while drunk.

I thought they were funny, so I tried to find the original source. Interestingly, there seems to be no page at the Washington Post that corresponds to the description:

Once again, The Washington Post has published the winning submissions to its yearly neologism contest, in which readers are asked to supply alternative meanings for common words.

By googling the sentence above, one can find the whole list at many places on the web, except apparently at the Washington Post. With a little more work, I discovered this page on the Post's Style section site, with the list and attributions to the original authors. Here one can discover that "esplanade" is a creation of Kevin Mellema in Falls Church. These attributions are missing from the version circulating on the web. I have not checked to see if the whole word list is the same, but it looks so at first glance.

Now for the peeve. (Sorry, there's only one.) I got an email yesterday which say in effect

We've been asked to create monthly financial reports, which we will share with the bigwigs on a semi-annual basis.

Although one can make out the sense of the statement, the problem is with the word 'basis.' Everybody seems do to things on a basis nowadays. "I go to the store on a daily basis," instead of "I go to the store daily." It's an awful, pretentious circumlocution that ought to be execrated. But it isn't, and probably won't be. That doesn't stop me from complaining about it. Consider:

I'm paid weekly on an hourly basis.

This is precise and understandable. The speaker probably punches a time clock--hence the hourly basis, which is an accounting concept. But the paychecks come weekly. Now read:

I'm paid on an weekly basis on an hourly basis.

What a confusing mess! In the email I got, the 'basis' part gets applied to the periodic distribution, and the financial accounting part leaves it out--exactly backwards. What's intended is:

We've been asked to report semi-annually on finances, summarized on a monthly basis.

Now we know the reports go twice a year, and should include each month's activities in summary.

This problem is so bad--this unnecessary ubiquity of 'basis'--that evolution and information theory have begun to gnaw at it. By this I mean that the extra verbiage in saying "on a basis" instead of adding the suffix to make an adverb, takes longer to say. This economics fact competes with the (imagined) weight added to the speaker's words by the extra baggage. This sometimes results in pressure to shorten the circumlocution. That's how we get acronyms and abbreviations. So I've heard "on a daily" popping up. As in "I run this report on a daily."

In my nightmares, 'basis' takes over as the root of all adverbs, just like genitive seems to be eroding away in German, to the increased usage of dative (The car of Stanislav instead of Stanislav's car, for example). If you listen carefully you'll hear this happening in the wild. "Basis" is creeping out of the domain of purely temporal adverbs and into the general population, where it can breed like kudzu. Soon we'll be talking like this:

After the accident, I drive on a slow basis.

I spend my money on a careful basis now.

I love you on a complete basis.

The horror, the horror.

The Dropbox Idea

As part of the portfolio idea I've been working on, I need some kind of central location for providing minimal organization for what documents go with what course--whether it's a hyperlink to something on the web or an actual document stored in a private archive. I built something similar to this once before. We called it iceBox, and it was immediately very successful, mostly because of its simplicity. A screenshot from the prof's point of view is shown below.

Student last names have been redacted because of FERPA. Here, Joey L has submitted a document called "Web Interface", which he's designated as a rough draft, and has not yet been reviewed by the instructor. As soon as the instructor downloads it, this turns to "Reviewed." Later we added a feedback link too. I used this almost exclusively for the last math course I taught, taking probability assignments as Excel documents, and giving the grade via the simple feedback button.

The student interface is similar, and also very simple. The main difference is that it has a form for uploading a file. When this service was launched, there was no advertising beyond an email to faculty to try it out if they wanted. Despite this, it has been extremely successful. I attribute this mainly to simplicity and (related) reliability. In IT we had almost no support calls on the service. You can see a live usage graph, which shows a peak of 600 files per week during finals (with a student population of about 1000). The graph uses Maani Flash-based charts, which are very handy.

You can get a dropbox for yourself. This sort of thing isn't just good for portfolios. Imagine if all of your work files were automatically synched across your home computer, laptop, and desk computer at work, AND that you could get them through a web interface. That's the intent of DropBox, which you can see all about in this short screencast. 2GB is free, or you can pay $200/year for 100GB. I recently signed up to iDrive for online backup, which does something similar at a cheaper price ($50/yr for 120GB). This service won't synch across computers, however. It will allow downloads from the web. Another difference is that DropBox will allow you to publish any of your files to the web, so you hyperlink to anything in your archive, to paste in an email, for example. This sounds very useful, and once my iDrive backup is complete (it takes days) I'll sign up for a DropBox account and try it out.

Wednesday, June 03, 2009

Managing Meeting Entropy

I end up having a lot of meetings--more than I used to. And I'm beginning to drown in meeting notes, which usually consist of an agenda with doodles, circles, arrows, and almost indecipherable half-sentences. Then there's some yellow pad page that is associated with more asterisks (circled), boxes, and more scribbled instructions to myself. Frankly, it can become a mess very quickly, and these things start to form a depressing pile of "meeting goo" after a while. It reminds me of a line from Thomas Pynchon's The Crying of Lot 49: a "salad of despair."

I googled around for suggestions about how to manage such a mess, but didn't come across anything particularly useful. So yesterday I created a meeting note template, shown below.

Eight of so of these will fit on a single hole-punched sheet. The L M H is priority, and my code for the little boxes is action item (*), information (i), question (?), and idea (light bulb). Under that is "ticket," meaning it needs to be formally tracked in the ticket system. Underneath is the destination--the areas of my responsibility. There's just enough room for a few sentences of description on the right. I explained this all to my project coordinator yesterday, and we're trying it out, but in the meetings I had yesterday I can already feel a calm descending from this minimal bit of organization.

The next step is to create a simple database to keep track of these, so I can retrieve them from my phone or other browser.

From the archives, see also: The Secret Life of Committees.

Monday, June 01, 2009

Online Markup

My latest project is to create a way for students and faculty to mark up documents on the web without a lot of infrastructure. This is part of a free-form portfolio project, where we worry mostly about saving hyperlinks so we can find things later and less about actually storing the objects themselves. One nice thing about this is that the creator (the student) has complete ownership over her creation, and can see it as something other than "just for class" work.

As an example, we could ask our students to sign up for a google account and create a google doc. These can be published to the web, like this one I created (from a screenplay). If you have a diigo account, you should be able to see the markups I put on it as well.

A second method is a bit more ambitious. I figured that we might want more control over the documents sometimes. We might not want students to have to worry about google accounts, for one thing. We also might want to keep copies of some kinds of work more private and store them on local servers. Students can all use word processors and with a little help can produce rich text files (RTF) rather than more complex varieties, like (shudder) docx.

With a little CGI magic we can create a drop-box service where RTF files get morphed into HTML files (web pages). The advantage of that is that then they can be marked up with diigo. This should work locally (at least it does in my tests), but could also be on a public web site using authentication.

So I built one. You can can see a converted RTF here (notes for a short talk I gave at the new library opening). If you're logged into diigo you should see the markup on it too. You can try it out for yourself by browsing to the plain upload page and submitting an RTF file. The conversion is far from perfect. Not all kinds of markup and formatting come over. Bold did, and numbered lists, but not hyperlinks or graphics. I think I can overcome some of this with time, working with the perl module that performs the magic.

All in all, this was a successful experiment. With modification, it could becomes a repository for RTFs and hyperlinks, which in combination with all the resources of the web, become an unlimited platform for providing coursework interactions.

Evolving Teaching with Formative Assessments

I found a very thoughtful article by Paul Black and Dylan Wiliam hosted at Phi Delta Kappa called "Inside the Black Box: Raising Standards Through Classroom Assessment." The article summarizes the authors' research in attempting to answer these questions:

Is there evidence that improving formative assessment raises standards?
Is there evidence that there is room for improvement?
Is there evidence about how to improve formative assessment?

The answer, they say, is a clear yes to all three. They make the point that teaching is currently largely about management, and more attention is paid to the bureaucratic duties than teaching effectiveness. The recommendation is to do more than stick a few assessment pieces onto the existing structure like ornamentation:

The research studies [...] show very clearly that effective programs of formative assessment involve far more than the addition of a few observations and tests to an existing program. They require careful scrutiny of all the main components of a teaching plan. Indeed, it is clear that instruction and formative assessment are indivisible.

As an example of the collision between management and teaching, consider the following example, from the paper.

One common problem is that, following a question, teachers do not wait long enough to allow pupils to think out their answers. When a teacher answers his or her own question after only two or three seconds and when a minute of silence is not tolerable, there is no possibility that a pupil can think out what to say.
There are then two consequences. One is that, because the only questions that can produce answers in such a short time are questions of fact, these predominate. The other is that pupils don't even try to think out a response. Because they know that the answer, followed by another question, will come along in a few seconds, there is no point in trying.

So here, the assessment (asking questions) is deemed ineffective. "What emerges very clearly here is the indivisibility of instruction and formative assessment practices."

The article closes with advice on implementation:

What teachers need is a variety of living examples of implementation, as practiced by teachers with whom they can identify and from whom they can derive the confidence that they can do better. They need to see examples of what doing better means in practice.

Assessment directors or coordinators will recognize this as good advice. Training everyone to re-think their teaching in a mass re-education project isn't likely to work. It takes time and focus, diffusing knowledge out from local experts who are trusted. This difficulty is due to the individualized nature of the problem--every teacher has to find ways that work for him or her, for his or her subject area. There is no BandAid fix:

This study suggests that assessment, as it occurs in schools, is far from a merely technical problem. Rather, it is deeply social and personal.

This account rings true to me, and underlines what teachers and many in the assessment profession already know--that it's not enough to mandate "assessment" from on high and expect anything good from it. Approaching the problem as a bureaucratic exercise leads naturally to "assessments" of the same time--reams of statistics dissociated from the practice of teaching and learning. Actual improvement has to start with more personal motivations--"yes, I care about how my students learn." Leaders can move an institution in that direction, but I don't think it can be mandated and then considered solved.