Monday, November 28, 2011

Link Salad

A Monday's worth of interesting education-related links:

On non-cognitives, we have two articles from the Boston Globe. The first is "How College Prep is Killing High School":
A number of economists, including Nobel economist James Heckman, have documented the need for noncognitive or so-called soft skills in the labor market, such as motivation, perseverance, risk aversion, self-esteem, and self-control.
The second is "How Willpower Works":
In dozens of studies conducted over the past 25 years, Baumeister has found that taking on specific habits - like brushing your teeth with the opposite hand you’d normally use - can increase levels of self-control. In a phone interview, he likened willpower to a muscle: “If you exercise it, you can make it stronger. There’s nothing magical about it.’’
Then there is the less optimistic offering from the New York Times "The Dwindling Power of a College Degree," which contains a warning for all of us:
A general guideline these days is that people are rewarded when they can do things that take trained judgment and skill — things, in other words, that can’t be done by computers or lower-wage workers in other countries.
The Wall Street Journal has a scorecard of career salaries by degree, in case you're keeping score. The highest 75th percentile salary goes to math and computer science combined. Compare it to math education:

A partial listing of the WSJ salary/major list found here.
The quote in the New York Times article about computers replacing us is especially interesting when juxtaposed to the ambitious research plan described in "Mining the Language of Science," from Phyorg.com:
Scientists are developing a computer that can read vast amounts of scientific literature, make connections between facts and develop hypotheses.
Stanford University is offering a free online course on machine learning if you want to learn how to make a computer smarter than yourself (true story).

 To round out that topic, here are two articles on the limits of human understanding. First from Physorg.com again is "People are Biased against Creative Ideas, Studies Find," including these findings:
  • Creative ideas are by definition novel, and novelty can trigger feelings of uncertainty that make most people uncomfortable. 
  •  People dismiss creative ideas in favor of ideas that are purely practical -- tried and true. 
  •  Objective evidence shoring up the validity of a creative proposal does not motivate people to accept it. 
  • Anti-creativity bias is so subtle that people are unaware of it, which can interfere with their ability to recognize a creative idea.
The second article, from SciGuru, is "Ignorance is bliss when it comes to challenging social issues."
The less people know about important complex issues such as the economy, energy consumption and the environment, the more they want to avoid becoming well-informed, according to new research published by the American Psychological Association. And the more urgent the issue, the more people want to remain unaware [...]
This illustrates the mechanism I described in "Self-limiting Intelligence."  You can test yourself on these last two points. Here's a creative idea from Business Insider, and a challenging social issue from The Economist. Good luck!


Wednesday, November 23, 2011

Assessments, Signals, and Relevance

In "Tests and Dialogues" I promised to address the use of rubrics, which I get to. But before I do, I want to extend the ideas presented in the last few articles. By coincidence, my daughter  provided an example the same day I wrote the article.
My daughter Epsilon had a math test and a French test yesterday, so naturally I asked how they went. She had spent quite some time reviewing (with my imperfect help in the French class), and said that the tests were easy except that she forgot what the degree of a polynomial is (ugh!). She said she was able to guess at some things she didn't know, which made my eyebrows rise. Guess? Sure, she says, it's almost all multiple choice.  Here I began to sputter. What?? Algebra and French...multiple choice? Yes, says she, it's because of the EOCs. That would be the local name for the monological "End of Course" state tests. Since the EOCs are multiple choice, and there is so much weight put on them, it makes economic sense to optimize all testing to resemble "the ones that matter." She's 14, and this is old hat to her by now.
The wrong assessments plus a factory mentality optimizes local relevance at the cost of global irrelevance. David Kammler wrote a marvelous parable in this vein, "The Well Intentioned Commissar." Achieving goals can be inherently very complex. When we try to grasp their workings by simplifying cause and effect (e.g. in order to manage like a factory), we can lose important information. This is detrimental when optimizing the simplified problem is not the same as optimizing the original problem. The impact is not merely academic. I read a story in The Economist years ago that went like this:
A state in the US was spending more on road repair that it thought reasonable, and sought to make the situation more equitable by passing cost onto the owners of the heavy trucks that were doing the most damage. So they instituted an axle fee--the more axles the truck had, the higher the cost to the truck owner to use the roads. This was a simple approximation: the heavier the truck, the more axles. What could go wrong? The outcome was that truckers, not being stupid, started using trucks that carried just as much weight, but on fewer axles. This increased the ground pressure of the trucks (same weight over less area) and damaged the roads even more than before. In this case they didn't merely optimize irrelevance, but actually exacerbated the problem they were trying to fix.
Even being irrelevant has an associated opportunity cost. The time spent learning how to game multiple choice tests could be better spent. We can only imagine what the long term cost is, when students finally figure out that real problems don't come with a built-in 20% chance of guessing the right answer.

It is not a coincidence that all of this ties together with the idea in "Self-Limiting Intelligence," where the problem I tried to illuminate about intelligent systems is that self-change easily turns into self-deception. My last couple of articles have mostly ignored the fact that there are powerful motivations lurking behind official definitions. Here's an example of how motivation subverts definitions from theNewspaper.com, which calls itself "a journal of the politics of driving."
Automated ticketing vendor American Traffic Solutions (ATS) filed suit Tuesday against Knoxville, Tennessee for its failure to issue tickets for turning right on a red light -- and that is costing the company a lot of money. A state law took effect in July banning the controversial turning tickets, but the Arizona-based firm contends the law should not apply to their legal agreement with the city, which anticipated the bulk of the money to come from this type of tickets.
If this seems silly, here's a more disturbing example:
Judge Mark A. Ciavarella and former Senior Judge Michael T. Conahan are accused of taking $2.6 million for sending children to two [correctional] facilities owned by Pittsburgh businessman Greg Zappala.
Privatizing a corrections facility created an economic value for criminal offenders, which increased the supply of such through more application of the monological standard by judges. This is what juries are there to prevent--provide a dialogical check.

In higher education, the definition of educational success adopted by the policymakers is "enrollment," and "graduation," for which the state pays plenty. See my previous article "Flipping Colleges for Profit," for how that turns out in the hands of private investors who seek to maximize dollars/student. We maximize enrollment and (perhaps) graduation at very large monetary expense to the taxpayer in grants and loans to students who default at high rates. This counterproductive effect is evidence of an over-simplified index of success.

It would be understandable if you took away from the discussion so far that monological = bad and dialogical = good, but that's not the case. Systems have to function monologically most of the time. I base this on a simple argument:
Systems of all kinds exist. Some work, some don't. The ones that survive do so in part because they are motivated to survive. The actions a system takes to survive can be ultimately reduced to binary "do this/don't do that" decisions. Motivations drive those decisions based on information from the internal and external environment. This reduction of complex data into a simple binary decision we might call an assessment, and the result is a 'signal.' Pain when you stub your toe is a "don't do that" signal corresponding to the implicit motivation to avoid bodily harm. 
If we put together all these signals, they comprise a language. If it works, it models the environment and allows the system to survive (eat this, don't eat that). The diagram that goes with a motivation-driven decision loop is the one I discussed in "Self-Limiting Intelligence."

The point is not that monological motivation-driven signals are bad for us, it is that we have to use the right ones if we want to succeed. Sometimes dialogues get turned into signals, as in a plebiscite or anywhere else where public opinion matters. Marketing is another example, but in reverse--working from a motivation to try to affect dialogue so someone can sell more soap. In those cases, lots of energy is spent in trying to affect conversations. The BBC's recent article "Fake forum comments are 'eroding' trust in the web" is an example.

Part of the decision about what assessment to use to create signals should be driven by the consideration that weighing it down with economic value will probably degrade the quality. This problem is ubiquitous. It includes counterfeiting, cheating, and corruption of all sorts. It even shows up in natural selection, as Darwin figured out, in sexual selection--explaining peacock feathers, for example. It is perhaps embodied in the advice "you have to fake it to make it."

In order to make a decision, a system has to process a potentially infinite amount of data for a few clues as to what will accomplish its goal (fulfilling motivations). This assessment is a massive data compression that, if it's done well, describes in signal-language the important elements of the environment relative to motivation.

An example will illustrate the point. First, let's look at the role of complexity and assessment. I took the photos below at the Cape Fear Serpentarium in Wilmington, North Carolina. (It's a fantastic place to visit, along with Fort Fisher and the nearby aquarium if you're in the area.) This is the Gaboon Viper, and I first read about it in a zoo--I think in Columbia, South Carolina. It's a big, slow snake that prefers to sit and wait for lunch to walk or hop by. If you're a small animal, the picture below shows your perspective:
The Gaboon Viper presents a high-complexity look to prey.
 In its natural environment, the snake's colors perfectly fit into the surrounding forest floor. The complex patters of light and dark break up the shape of the form, so that a rodent is unlikely to correctly assess this information and form the signal SNAKE! The snake presents a high complexity visual presentation to the world, and is rewarded for this concealment by a reduced probability that a rodent will correctly assess the situation.

And a low-complexity "here I am!" to large beasts.
However, the snake has another problem. It's got a great lifestyle, sitting around waiting for dinner to walk by, but there are also large hooved beasts that are far too large to eat and impossible to get out of the way of when they come ambling by. It's a good thing to be hidden from small critters that one might consume, but quite another to be hidden from a huge monster that might step on you and break your spine! So the presentation for a viewer looking down on the serpent needs adjustment. Instead of concealment, it wants to create an instant assessment in the bovine brain of SNAKE! As you can see in the photo on the right, this is accomplished (via natural selection, of course) by white stripes that look like the center of a highway.  I looked for research that actually demonstrates that cows can see these snakes better than ones without such coloration, but didn't come up with anything. So treat this as informed speculation rather than fact, unless someone can point me to an authoritative source. But the effect is real. In the family of large cats, some females have white dots on the backs of their ears so they can present a low-complexity "follow me" sign to their kittens in low light. Military vehicles do something similar so they don't run into one another in the dark, but also don't make good targets.

To continue the example:
Suppose you are going out for an afternoon hike in a tropical jungle. You check into the matter and see that there are deadly poisonous snakes in the area. Fortunately, there is a guy whose job it is to monitor the jungle nearby for such threats, and post a sign at the trail head with a warning when appropriate.  You may rightfully be dubious. There is a very large tract of undeveloped forest out there, and how plausible is it that this guy--who may be the governor's favorite nephew, for all you know--could have checked for every possible snake? So you are not reassured when you see a big green NO SNAKES TODAY THANK YOU sign nailed to a tree. In effect, you've rejected the data reducing assessment and have decided to create your own signals. That is, the final assessment for "is there a snake here?" remains pending.  You carefully watch where you step, tying up a large part of your mind to continually test the environment against your snake-matching perception. This is a lot of work and quite stressful, so you give it up and go to lunch instead.
The example shows the trade-off in an early or late assessment into a signal. Early signals make subsequent decisions easier. That's why most businesses like stability.

So the big question for a complex system is when to do the assessment for a given motivation. There are symmetrical arguments for and against early decisions based on limited data:

Early assessment from data to signal:
Pro: If we assess early, we reap the economic benefit similar to mass production. All decisions that depended on the first one can now go about their business. By mandating that 21 is the legal age to drink alcohol, it creates a simple environment for liquor stores, as opposed to say administering an on-the-spot test for "responsible drinking" for each customer. 
Con: Creating the signal greatly simplifies the actual state of the world. If subsequent decisions need detailed information, it won't be available. Worse, the signal may be wrong entirely.
 "Housing prices never fall" was an early assessment that led to a lot of unfortunate consequences.
Con: Signal manipulation for economic benefit (what we would call corruption in a government) can cause a wide-spread disconnect from reality. Adopting unproven test scores as the measure of educational success creates a false economy and doesn't reflect the actual goal.
Late assessment from data to signal:
Pro: Information isn't lost before it's needed. The example of walking in the jungle illustrates this. 
Pro: Local corruptions of signals have only local effects. From a virus's point of view, it's ever-changing protein coat means that it can't be intercepted easily. On the other side, those who have to decide on a vaccine have a difficult decision to make about which variants to target.
Con: It's more costly because decisions are made individually instead of in mass production style. Every state has a different set of paperwork for allowing truckers to use the roads, which impedes commerce. Conversely, internet sales are aided by not having to worry about every locality's rules.
There's no one right answer. The cost for deferring assessment can be very high. The railroad owners finally created time zones to solve the problem of every town running on a different clock. A common currency is obviously good for everyone. Standardized traffic laws are a boon. Formalized ownership of property is a prerequisite for a modern society. All of these involve monological definitions that are somewhat based on early assessment of evidence and are somewhat arbitrary. In some cases, it can be completely arbitrary and still immensely helpful (e.g. which side of the road we drive on). Imagine what a mess it would be if we had to survey the dialogical landscape every morning to see whether most people were driving on the right or left, stopping at green lights or red, and make our adjustments accordingly.

Applying this to Higher Education
Education at all levels has to process so many students that good organization is essential. So it has to be a system, and there are going to be a lot of semi-arbitrary decisions made just to provide a workable system language. There are many, many of these, and those of us who labor within the system take it for granted that these things exist:
  • Courses
  • Grades
  • Credit-hours
  • Grade levels or course level designations
  • Set times for instruction (e.g. one hour lecture, which is really fifty minutes)
  • Set curricula
  • Degrees and diplomas
None of these have much directly to do with learning. The pressure to define what a credit hour means for online courses shows the rigidity of the system. It's worth a moment to look at one of these in detail, so let's follow that train of thought into the dialogical tunnel. Compare the Carnegie fifty-minute block class with how we naturally learn.

The Kahn Academy comprises a large group of tutorials on YouTube started by Sal Khan (who quit managing a hedge fund to do this), and now operated by an impressive team. The home page claims that over 87 million lessons have been delivered, with more than 2700 videos on offer covering a range of academic subjects and levels. The one shown in the picture is about long division. You can see that it's a little less than 10 minutes long. Why ten minutes? Why not fifty minutes? Some are longer and some shorter, depending on the needs of the subject.

The curriculum in the Khan Academy is not a set of courses with prerequisites. It's much more natural than that, using a "knowledge map" to show connections between the ideas taught in the videos. Here's a sample:
There are suggestions for prerequisites, but they are per topic, not per course. Each of these has associated problems to be solved, challenges, and badges to earn. There are individualized feedback reports, and the ability for coaches to be involved.

I recently signed up for a free course on Machine Learning taught out of Stanford University. The lectures were online with a built-in quiz in each one. The videos are of varying length, but none close to fifty minutes. I skipped stuff I was already familiar with and browsed topics I didn't know as much about. Once I had to go back to the introductory material to clarify a point, had my "aha!" moment, and then forged ahead. To me this is a very natural way to learn. It's not at all systematic. I would never have had the time to sit through a traditional lecture course on the subject, but with the browsing ability I have with the online presentation, I can choose what I want off the menu and maximize the use of my time.

What's the point? 
If we create systems that make early decisions for learners so that we can make the logistics work, we save time in administration but suffer opportunity cost for each student. This "early assessment" model leads to preemptive decisions solidified into policy and practice: courses, semesters, grades, and so on.

A "late assessment" version would be to provide just-in-time instruction for each student so that it could be used in some authentic learning situation. By authentic I mean anything that doesn't cause the student to think "when will I ever use this?"  For example, individual or group projects that require learning the content, but perhaps only a piece at a time on demand.

The early collapse of complexity into a simple bureaucratic language includes the factory-like quality checks that occur along the way in the form of (increasingly standardized) tests. This early assessment presents problems for consumers of the product: employers and the graduates themselves, and society as a whole. The educational system doesn't give much information about what the students have learned other than these formalized assessments. It's like the NO SNAKES TODAY THANK YOU sign.

Before the Internet, late assessment methods would have been too expensive to use for the whole system. Not anymore. We have the opportunity to adopt a new model whereby we coach students on how to learn for themselves. So much the better if this learning is in the context of some interesting project that the student can show off to peers and (if desired) the world. The creation of a rich portfolio along the way allows employers or anyone else with access to make their own assessments.

I do not think that the massive higher education system will change this radically anytime soon. Nor are we ready to make such a change. Something like the Kahn Academy for every academic discipline would be a massive undertaking. Textbook publishers, if they are forward thinking, may lead the way. Imagine an online "textbook" that was actually a web of interrelated ideas mapped out transparently, with video lectures and automated problem solvers associated. This frees up what I once called the low bandwidth part of the course and allows for more creative official class time. Effectively, it offloads all the monologue to out-of-class time and lets you get to the dialogue directly.

There is an opportunity for programs here and there to begin to experiment with this transition. As the examples in my previous articles show, this is already happening. In the most optimistic case, this prevents further solidification of the factory mentality in higher education by showing valid alternatives.

Tuesday, November 22, 2011

Tests and Dialogues

In "The End of Preparation" I argued that standardized tests, as they exist now, are not very suited to the task of correctly classifying quality of the partial products we call students. Certainly the tests give us information beyond mere guessing, but the accuracy (judging from the SAT) is not high enough to support a factory-like production model. I pointed out that test makers do not usually even attempt to ascertain what the accuracy rate is. Instead we get validity reports that use a variety of associations. If we brought that idea back to the factory line, it would look something like this.
Announcing the new Auto Wiper Assessment (AWA). It is designed to test the ability of an auto to wipe water off the windshield. Its validity has been determined by high negative correlations with crash rates of autos on rainy days and low correlation on sunny days. 
On a real assembly line, the question would be as simple as Does it work now? and Are the parts reliable enough to keep it working?  Both of these can be tested with high precision. And of course, we can throw water on the windshield to directly observe whether the apparatus functions as intended. Direct observation of the functional structure of learning is not possible without brain scanners. Even then, we wouldn't really know what we are looking at--the science isn't there yet. What we do know is fascinating, like the London Taxi Cab study, but we're a long way from understanding brains the way we understand windshield wipers.

Validity becomes a chicken-and-egg problem. Suppose our actual outcome is "critical thinking and complex reasoning," to pick one from Academically Adrift. There are tests that supposedly tell us how capable students are at this, but how do we know how good the tests are? If there were already a really good way to check, we wouldn't need the test! In practice, the test-makers get away with waving their hands and pointing to correlations and factor analyses, like the Auto Wiper Assessment example above. This is obviously not a substitute for actually knowing, and it's impossible to calculate the accuracy rate from the kind of current validity studies that are done. The SAT, as I mentioned is an exception. This is because it does try to predict something measurable: college grades.

This is not a great situation. How do we know if the test makers are selling flim-flam? In practice, I think tests have to "look good enough" to pass casual inspection, and they can amount to neo-phrenology without anyone every knowing. How else can the vast amount of money being spent on standardized tests be explained? I'd be happy to be wrong if someone can point me to validity studies that show test classification error rates similar to the SAT's. A ROC graph would be nice.

The argument might be that since reductionist definitions are not practical, and there really is no way to know whether a test works except through indirect indications like correlations, this is the best we can do. But it isn't. In order to support that claim, let me develop the idea by contrasting two sorts of epistemology. It's essential to the argument and also worth the exposition for its own sake. When I first encountered these ideas, they changed the way I see the world.

Monological Knowing
Sometimes we know something via a simple causal mechanism: an inarguable definition. For example, when the home plate umpire calls a strike in a baseball game, that's what it is. It doesn't matter if the replay on television shows that the pitch was actually out of the strike zone.  Any argument about that will be in a different space--perhaps a meta-discussion about the nature of how such calls should be made. But within the game, as veteran umpire Bill Klein is quoted as saying "It ain't nothin till I call it!"

Monological definitions are generally associated with some obvious sign. An umpire jerking his clenched fist after a pitch means it was a strike. Sometimes the definitions come down to chance, as with a jury trial. In the legal system, you are guilty if the jury finds you guilty, which has only indirectly to do with whether or not you committed a crime. The unequivocal sign of your guilt is a verdict from the jury. Other examples include:
  • Course grades, defining 'A' Student, 'B' Student etc.
  • Time on the clock at a basketball or football game, which corresponds only roughly to shared perception of time passing (perceived time doesn't stop during a time-out, but monological time can).
  • Pernicious examples of classifying a person's race, e.g. leading up to the Rwandan genocide. You are what it says you are on your documents.
Sometimes the assignments are random or arbitrary. Sometimes a single person gets to decide the classification, as with course grades. There is sometimes pressure from administrators to create easily understood algorithms for computing grades in order to handle grade appeals, but instructors usually have wide latitude in assigning what amounts to the monological achievement level of the student.

I got bumped from a flight one time, and came away from the gate with the knowledge that I was "confirmed" on the next flight. That didn't mean what I thought it did, however. According to the airline's (monological) definition, "confirmed" means that the airline knows you are in the airport waiting, so you're a sure seat if they have an extra. It does not mean that such a seat is guaranteed for you.

Dialogical Knowing
This might be more properly called polyphonic, but for the sake of parallelism, allow me the indulgence. In contrast to a monological handing down of definitions from some source, dialogical knowledge has these characteristics:
  • It comes from multiple sources
  • There isn't universal agreement about it (definitions are not binding if they exist)
  • It's subjective
Whereas there is a master copy of what a Kilogram is in a controlled chamber in France, there is no such thing for the concept of "heavy." A load you are carrying will feel heavier after an hour than at the beginning of the hour. Furthermore, we can disagree about the heaviness. This is messy and imperfect, but very flexible because no definitions are needed. Anyone can create a dialogical concept, and it gets to compete with all the others in an ecology where the most fit survive. This fact is what prevents loose shared understanding from devolving too far into nonsense as a whole. There's plenty of nonsense (like fortune-telling), but we can communicate in a shared language very effectively even in the absence of formal definitions. 

If I tell you that I liked the movie Kung Fu Panda, you know what I mean. There are movies you like too, and you probably assume I feel about this movie the way you feel about those is some vague sense. You may disagree, but that's not a barrier to understanding. We could have a complex conversation about what constitutes a "good" movie, which doesn't have a final, monological answer. In Assessing the Elephant I compared this to the parable of the blind men inspecting an elephant, each sharing their own perspective. I used this as a metaphor for assessing general education outcomes, which are generally broad and hard to define monologically.

Tension between Monologue and Dialogue
Parallel to the tension between accountability and improvement in outcomes assessment, there is a tension between monological and dialogical knowledge in any system. The demand for locked-down monological approaches is the natural consequence of being part of a system, which as I described last time, needs to manage fuzziness and uncertainty in order to function. That's why we have monological definitions for what it means to be an adult, or "legally drunk." It makes systematization possible. Much of the time, this entails replacing a hard dialogical question ("what is an adult?") by a simple monological definition ("anyone 21 years or older"). In ordinary conversation we may switch these meanings without noticing, but sometimes the tension is obvious.

The question "which candidate will do the best job in office?" gets answered by "which candidate got the most votes?" It replaces an intractable question with one that can be answered systematically in a reasonable amount of time Of course it's an approximation of unknown validity. Monologically, the system decides on the "best" candidate, but the dialogical split on the issue can be 49% vs 51%.

Someone put together an page describing the relationship between monological Starbucks definitions of drink sizes and the shared understanding of small, medium, large. The site, which you can find here, is a perfect foil for this discussion. I find it hysterically funny. Here's a bit of it:
The first problem is that Starbucks is right, in a sense. I've established that asking for a "small coffee" gets you the 12-ounce size; "medium" or "medium-sized" gets you 16 ounces; and "large" gets you a 20 ounce cup. However, in absolute rather than relative terms, this is nuts. A "cup" is technically 8 ounces, and in the case of coffee, a nominal "cup" seems to be 6 ounces, as indicated by the calibrations on the water reservoirs of coffee makers, [...]
When a referee makes a bad call in a sports event, the crowd reacts negatively. The dialogical "fact" doesn't agree with the monological one, which is seen as artificial and not reflecting the reality of shared experience.

It may be appalling, but it makes sense that the Oxford English Dictionary now includes the work "nucular" as a synonym for "nuclear." This is the emodiment of a philosophy that the dictionary should reflect the dialogical use of language, not some monological official version.



In assessment, it's quite natural to fall victim to the tension between these two kinds of knowledge. As noted, tests of learning almost never come with warning labels that say This test gives the wrong answer 35% of the time. The test doesn't have any other monological ways of knowing to compete with, other than possibly other similar tests, so by default the test becomes the monological definition of the learning outcome. Because it replaces a hard question ("how well can our students think?") with an easily systematized one ("what was the test score?") it's attractive to anyone who has to watch dial and turn knobs in the system. In the classroom, however, the test may or may not have anything to do with the shared dialogical knowledge--that messy, subjective, imperfect consensus about how well students are really performing.

A Proposal to Bridge the Gap
Until we better understand how brains work, it's not realistic to hope for a physiology-based monological definition of learning to emerge to compete with testing. However, it would be very interesting to see how well tests align with the shared conception of expert observers. This doesn't seem to be a standard part of validity testing in education, and I'm not sure why. It's in everyone's best interests to align the two.

There is a brilliant history of this kind of research in psychology, culminating in the definition of the Big Five personality traits, which you can read about here. From Wikipedia, here is the kernel of the idea:
Sir Francis Galton was the first scientist to recognize what is now known as the Lexical Hypothesis This is the idea that the most salient and socially relevant personality differences in people’s lives will eventually become encoded into language. The hypothesis further suggests that by sampling language, it is possible to derive a comprehensive taxonomy of human personality traits.
Subjective assessments have a bad reputation in education, but lexical hypothesis was shown to be workable in practice. It's not astounding that dialogical language has meaning, but it doesn't seem fashionable to admit it.

Given all this, it's obvious that we should at least try to understand the resemblance between monological tests of "critical thinking and complex reasoning" or "effective writing" and the dialogical equivalent. It's simple and inexpensive to do this if one already has test results. All that's required is to ask people who have had opportunity to observe students what they think. Any way it turns out, the results will be interesting.

Suppose the test results align very well with dialogical perceptions. That's great--we can use either tests or subjective surveys as we prefer.

If the two don't align, then we have to ask who's more likely to be correct. In this case the tests lose out because of a simple fact: test scores don't matter in the real world. What does matter are the subjective impressions of those who employ our graduates or otherwise interact with them professionally. In the world beyond the academy, it's common shared perceptions that are the metric of success, and it won't do any good to point to your test scores. In fact, there is a certain shadenfreude in disproving credentials, as in watching videos of graduates from Ivy U who don't know why the seasons change. It isn't just Missouri: we're a show me society.

You'll notice that either way, the test results are largely unneeded. This illuminates why they are being used. Self-reported dialogical assessments depend on trust. In theory, tests can be administered in an adversarial environment. This restates Peter Ewell's quote in my previous article. This is a recipe for optimizing irrelevance. In Assessing the Elephant, I called this a degenerate assessment loop and gave this example:
A software developer found that there were too many bugs in its products, so it began a new system of rewards. Programmers would be paid a bonus for every software bug they identified and fixed. The number of bugs found skyrocketed. The champagne was quickly put back on ice, however, when the company realized that the new policy had motivated programmers to create more bugs so that they could “find” them.
Similar "degenerate" strategies find their way into educational practices because of the economic value placed on monological simplifications used in low-trust settings. We read about them in the paper sometimes.

Surveying the Dialogical Landscape
I have implemented surveys at two institutions to gather faculty ratings of student learning outcomes. I have many thousands of data points, but no standardized test scores to compare them to, so I can't check the alignment as I described above. The reliability of these ratings is: about a 50% probability of exact match on a four-point scale for the same student, same semester, same learning outcome, with different instructors. I've already written extensively about that, for example here and on this blog, as well as some chapters in assessment books, which you can find on my vita.

Conclusion
In a tight system, monological approaches are useful. The human body is a good example of this, but we should note that at least two important systems are more dialogical than monological: the immune system and the conscious mind. The world beyond graduation resembles a competitive ecology more like what the immune system faces than a systematic by-the-numbers existence like a toenail.

The only reason to use monological tests is if we don't trust faculty. This can't even be done with any intellectual honesty because we can't say that the tests are any good. What I proposed in the "The End of Preparation" is that we move to dialogical methods of assessment throughout and beyond the academy. These can still be summarized for administrators to look at, but only if there is trust at all levels. And really, if there is no trust between faculty and administration, the whole enterprise is doomed.

The mechanism of using public porfolios showing student records of performance can be purely dialogical--a student's work can have different value to different observers inside and outside the academy.


Next time I'll address what all this has to do with rubric-based assessment.

[Next article in this series: "Assessments, Signals, and Relevance"]

Some Frivolous Thoughts
As I said, this dichotomy changed the way I think about the world, and I find interesting tid-bits everywhere. One interesting idea is the hypothesis that as a domain of interest becomes more reliably theoretical (like alchemy becoming chemistry), the nomenclature transitions from descriptive and dialogical to arbitrary and monological. I went poking through several dictionaries looking for evidence of the names of the elements, to find examples. Copper may be an instance, perhaps having been named for Crete, as in "Cretian metal." If the name is too old, the etymology is foggy. Steal is more recent, and it seems to derive from a descriptive Germanic word for stiff. Compare that to Plutonium, which is modern and non-descriptive. Of course, with arbitrary naming, the namer can choose to be descriptive, as Radium arguably is. This thesis needs some work.

In biology, Red-Winged Blackbird is a descriptive name for the monological Agelaius phoeniceus. In a good theory, it doesn't matter what you call something. What matters is the relationships between elements, like the evolutionary links between bird species as laid out in a cladistic family tree.Modern scientists are more or less free to name new species or sub-atomic particles whatever they want. Organic chemistry is an interesting exception, because the names themselves are associated with composition. They are simultaneously descriptive and monological.

Drug names are particularly interesting. Viagra, for example, has a chemical name that describes it, but that obviously wouldn't do for advertising purposes. Here's what one source says about the naming process:
Drug companies use several criteria in selecting a brand name. First and foremost, the name must be easy to remember. Ideally, it should be one physicians will like -- short and with a subliminal connotation of the drug. Some companies associate their drugs with certain letters (e.g., Upjohn with X and Glaxo with Z). If the drug is expected to be used eventually on a nonprescription basis, the name should not sound medicinal. There must be no trademark incompatibilities, and the company must take account of the drug's expected competition.
It sounds like the name is chosen to fit neatly into a dialogical ecology.

The history of the SAT's name is interesting from this perspective, but I will bring this overlong article to a close.

Acknowledgments: The idea for the monological/dialogical dichotomy came out of conversations with Dr. Adelheid Eubanks about her research on Mikhail Bakhtin. I undoubtedly have mangled Bakhtin's original ideas, and neither he nor Adelheid should be held responsible for that.

Sunday, November 20, 2011

The End of Preparation

A few days ago, I wrote "A Perilous Tail" about problems with the underlying distributions of measurements (in the sense of observations turned into numbers) we employ in education. I've shown previously that at least for the SAT, where there is data to check, predictive validity is not very good: we can only classify students correctly 65% of the time. When this much random chance is involved in decision-making, the effect is that we can easily be fooled. Here I cited the example that might lead us to believe that yelling at dice improves their "performance." It's also unfair to hold individuals accountable if their performance is tied to significant factors they can't control, and it invites cheating and finger-pointing.

As I mentioned in the previous article, I have a solution to propose. But I haven't really done the problem justice yet.  We encounter randomness every day, so how do we deal with it? Functioning systems have a hard time with too much randomness, so the systematic response is to manage it by reducing or avoiding uncertainties, and when that cannot be done, we might imagine it away (for example, throwing salt over one's shoulder to ward off bad luck). Many of our ancestors undoubtedly faced a great deal of uncertainty about what they would be able to eat, so much of our mental and physical activity and the very way our bodies are constructed, has to do with finding suitable food (which can be of various sorts) and consuming it effectively. Compare that with the homogeneous internal system of energy delivery that feeds and oxygenates the cells in our body via our circulatory system. Complex systems often take messy stuff from the environment and then organize it for internal use. I will use a mass-production line like the one pictured below for such a system.
Source: Wikipedia

The idea is to find materials in the environment that can be used to implement a plan, turning rocks into aluminum and tree sap into rubber, sand into glass, and heating, squashing, and otherwise manipulating raw natural resources until they come to create an airplane. This process is highly systematized so that parts are interchangeable, and the reliability of each step can be very high. Motorola invented the concept of Six Sigma to try to reduce the randomness in a manufacturing process to negligible amounts. This is at least theoretically possible in physical systems that have reliable mechanical properties.

What do we do when randomness can't be eliminated from the assembly line? One approach is to proceed anyway, because assembly lines have great economies of scale, and can perhaps be useful even if there are a large number of faulty items produced. Computer chip makers have to deal with a certain percentage of bad chips at the end of the line, for example. When chemists make organic molecules that randomly choose a left-right symmetry (i.e. chirality), sometimes they have to throw away half of the product, and there's no way around it.

The educational system in the United States has to deal with a great deal of variability in the students it gets as inputs and the processes individuals experience. It superficially resembles a mass production line. There are stages of completion (i.e. grade levels), and bits of assembly that happen in each one. There are quality checks (grades and promotion), quality assurance checks (often standardized tests), and a final stamp of approval that comes at the end (a diploma).

All this is accomplished while largely ignoring the undeniable fact that students are not standardized along a common design, and their mental machinery cannot be engineered directly the way an airplane can be assembled. In short, the raw material for the process is mind-bendingly more complex than any human-made physical device that exists.

Because of the high variability in outcomes, the means we use for quality assurance is crucially important, and this is where we have real opportunities to improve. This is the assessment problem. Current methods of assessment look a lot like factory floor assessments: the result of analyzing student performance is often a list of numbers that can be aggregated to show the executives how the line is working. Rewards and punishments may be meted out accordingly. In a real factory,  the current stage of production must  adequately prepare the product for the next stage of production. We must be able to correctly classify parts and pieces as "acceptable" or "not acceptable" according to whether or not they will function as required in the whole assembly. The odd thing about educational testing is that this kind of question doesn't seem to be asked and answered in a way that takes test error into account. Randomness is simply wished away as if it didn't exist. In the case of the SAT (see here), the question might be "is this student going to work out in college?" In practical terms this is defined as earning at least a B- average the first year (as defined by the College Board's benchmark). To their credit, the College Board published the answer, but this transparency is exceptional. Analyzing test quality in this way proceeds like this:
  1. State what the desired observable future effect of the educational component under review.
  2. Compare test scores with actual achievement of the outcome. What percentage succeeded at each score?
  3. Find a suitable compromise between true positives and true negative outcomes to use as your benchmark.
  4. Publish the true positive and true negative predication rate based on that benchmark.
To repeat, the College Board has done this, and the answer is that the SAT benchmark gives the right answer 65% of the time. This would make you rich if we were predicting stock prices, but it seems awfully low for a production line quality check.

Source: Wikipedia

Because we can't make the randomness go away, we imagine it away. So the assessments become de facto the measure of quality, and the quality of the tests themselves remains unexamined. In a real assembly line, an imperfect test would be found out eventually when the planes didn't perform as expected. Someone would notice and eventually track it down to a problem with the quality assurance program. There is so much uncertainty in education that this isn't possible, and the result is truly ironic: the deeper insinuation of tests that are unaccountable for their results. To be clear: any quality test that does not stand in for a clear predictive objective and provide research for its rate of correct classification in actual practice, is being used simply on faith. To be fair, it's virtually impossible to meet this bar. That excuse doesn't make the problem go away, however--it just makes it worse. One result is that test results distort perceptions of reality. If appearance is taken at face value for reality, then appearance has great economic value. There is incentive to do whatever it takes to get higher test scores, with unfortunate and predictable results.

To sum up the problem: variability is too high for good standardized tests to support an educational assembly line, and this fact is generally ignored for convenience.

I don't mean to imply that the major actors in education are incompetent. We are where we are because of historical factors that make sense as a system in evolution, and we have the means to take the next step.

The real world is not an assembly line. There is this expression we use when talking to students "when you get out into the real world...", as if the academy is a walled garden that excludes the vulgar world at large. This too, is a factory mentality. The students have heard it all before. From kindergarten on up, they hear stories about how hard it's going to be "when you get to high school," or "when you get to college." My daughter marveled at this during her first few weeks of high school. She was amazed and somewhat appalled that her middle school teachers had misled her about this. Of course it's not much harder--the system can only work with a high degree of integration, smoothing out the hard bits. If the wing assembly is slowing down the production line, then it needs attention. One could argue that the whole path from Kindergarten through Doctorate is becoming a smooth one for anyone who wants to trod it.

But the real world really is different. The assembly line stops at the hanger door, and the planes are supposed to be ready to fly. The tests don't matter anymore. No one is going to check their validity, nor delve too deeply into what a certification means after graduation. And in the real world, the factory mentality has to be unlearned: one cannot cram all night just before promotions are announced in order to game the system.

One solution is to try to change the real world to be more like the educational system. This is a practical choice for a military career, perhaps, where strict bureaucracy is essential to function. But it certainly is a mismatch for an entrepreneurial career, the engine of the nation's economic growth.

I believe it is now a reasonable and desirable choice to go the other direction, and change the assembly line to look more like the real world. The most important aspect to change is to admit uncertainty and begin to take advantage of it. This means we have to forget about the idea of standardizing and certifying. I will argue that we can do this to our great advantage, and introduce efficiencies into the economic structure of the nation in the process. Currently we pass our uncertainties on to employers. We hide behind test results and certificates, and leave it to employers to actually figure out what all that means. The result is that they have only very crude screening information at their disposal, and have to almost start from scratch to see what a graduate can actually do. The Internet can change all that. To my surprise, I discovered last week that it already is.

I had written a draft of this article last week, but when I ran it through my BS-detector, I couldn't bring myself to publish it. The reason is simple: it's just another untested idea, or so I thought. I hadn't actually employed the solution I will describe below, and so I didn't have anything concrete to show. But by coincidence, I saw exactly what I was looking for at the Virginia Assessment Group conference, at a presentation by Jeffrey Yan, who is the CEO of Digication. I didn't know about the company before last week.Vendors in higher education technology solutions may be excused perhaps for exaggerating the effectiveness of their products, and I generally think they are overpriced, too complicated, and too narrowly focused. I didn't have great expectations for Jeffrey's talk, but after about ten minutes I realized that he was showing off a practical application of what I was hypothesizing.

Digication is an eportfolio product. In what follows, I will not attempt to describe it as a software review would, but as it fits into the flow of ideas in this article.

The main idea is simple: instead of treating students as if we were preparing them for a future beyond the academy, treat them as if they were already there. In the real world, as it's called, our careers are not built on formalized assessments. To be sure, we have to deal with them in performance reviews or board certifications, but these are mostly barriers to success, not guarantees of it. Instead, it's the record of accomplishment we create as we go that matters. In many instances, promotions and accolades are inefficiently distributed, based on personal relationships and tenure, rather than merit, but this is not what we should aspire to. In fact, these imperfections are vulnerable to the sort of transparency that is within our grasp.

Senior Design Project at Stonybrook
In his presentation, Jeffrey showed examples of the sort of thing that's possible. Take a look at this senior design project at Stonybrook University. It's a real world project to design a new sort of sphygmomanometer (blood pressure meter). Quoting from the project page:
We aim to satisfy all the customer needs by designing a [blood pressure measuring] device that translates the vibrations into a visual indication of blood pulses, more specifically the first pulse to force its way through the occluded artery (systolic) and the last pulse detectable before laminar flow is regained (diastolic).
Another showcased student portfolio was from a second year student at the same institution, who created a public portfolio to tell the world about his interests and abilities. He shows how to solve what we call a difference equation (similar to a differential equation) using combinatoric methods here. This shows an interest in versatility in the subject that cannot be communicated with a few numbers in an assembly-line type report.

By concentrating on authentic evidence of accomplishment, rather than artificially standardized means of observation, we create an important opportunity: a public portfolio can be judged on its own merits, rather than via an uncertain intermediary. It's the difference between seeing a movie yourself and knowing only that it got three and a half stars from some critic.

The solution to the factory mentality presents itself. If students see that they are working for themselves and not as part of some unfathomable assembly process, accumulating what will become a public portfolio of their accomplishments, their learning becomes transparent. They can directly compare themselves to peers in class, peers at other institutions, graduates from all over, and professionals in the field. I imagine this leading to a day when it's simply unthinkable for any professional not to have an up-to date professional eportfolio linked to his or her professional social networking presence (see mathoverflow.net, Academia.edu, and LinkedIn.com as examples of such networks). Once started, the competitive edge by those with portfolios will become obvious--you can learn much more from a transparent work history than you can from a resume.

While in school, of course, some work, maybe much of it, needs to be private, to gestate ideas before presenting them to the world. But the goal should be for a forward-looking institution of higher education to begin to create public sites like the Stonybrook showcase and the one at LaGuardia Community College. Ultimately, universities need to hand the portfolios off to the students to develop as their respective careers unfold. I understand that graduates get to keep their portfolios and continue to develop them with Digication's license, as long as it is maintained.

Here's the manifesto version:
We don't need grades. We don't need tests or diplomas or certificates or credit hours. None of that matters except insofar as it is useful to internal processes that may help students produce authentic evidence of achievement. That, and that alone is how they should be judged by third parties.
Some advantages of switching from "assemble and test" to authentic work that is self-evidently valuable:
  1. We change student mentality from "cram and forget" to actual accomplishment. We can make the question "when will I ever really use this stuff?" go away.
  2. The method of assessing a portfolio is deferred to the final observer. You may be interested in someone else's opinion or you may not be. It's simply there to inspect. Once this is established, third parties will undoubtedly create a business out of rating portfolios for suitability for your business if you're too busy to do it yourself.
  3. Instead of just a certificate to carry off at graduation, students could have four years' worth of documentation on their authentic efforts. This idea is second nature to a generation who grew up blogging and posting YouTube videos.
  4. It doesn't matter where you learned what. A student who masters quantum mechanics by watching MIT or Kahn Academy videos might produce as good work as someone sitting in class. It makes a real meritocracy possible.
  5. Intermediate work matters. Even if someone never finishes a degree, they have evidence beyond a list of grades that they learned something. And it's in rich detail.
There's more than this, actually. The very nature of publishing novel work is changing. At present, the remnants of paper bound publication, with its interminable delays, exorbitant costs, virtual inability to correct errors, and tightly bound intellectual property issues, is still around. But it's dying.  A journal is nothing more than a news aggregator, and those are now ubiquitous and free. It's hard to say what the final shape of publishing will be, but something like a standardized portfolio will probably be front and center. When I say 'standardized', I mean containing certain key features like metadata and historical archive, so that you can find things, cross-reference, and track changes. As the professional eportfolio develops, it will need help from librarians to keep it all straight, but this can be done at a much lower cost than the publishing business now incurs in lost productivity, restricted access, and cost to libraries.

The focus will, I believe, shift from journals and other information aggregators, to the individuals producing the work. And institutions will share in some of the glory if part of the portfolio was created under their care.

All of this has been around for a while, of course. Eportfolios are almost old news in higher education, and I've blogged about them before. My previous opinion was that there was no need for a university to invest in its own portfolio software because everything you already need is on the web. If you want to put a musical composition on the web, just use Noteflight, and of course there's YouTube for videos, and so on. All that's needed is a way to keep track of hyperlinks to these in a way that can allow instructors to annotate as needed with rubrics and such. The demos convinced me, however, that having a standard platform that can be easily accessible for private, class-wide, collaborative, or public use is worth paying for. I don't know how much it costs in practice, but there is value beyond what one can get for free on the Internet.

Portfolios have been incorporated here and there as just another part of the machinery, amounting to a private repository of student work that can be used for rubric ratings to produce more or less normalized ratings of performance--an advanced sort of grading. This is useful as a formative means of finding all sorts of pedagogical and program strengths and weaknesses. The point of this article is not that portfolios are a better way to produce test-like scores, but that the test scores themselves will become obsolete as external measures of performance. For professors to get feedback on student performance, and for the students themselves to hear directly what the professors and their peers think is invaluable. It's essential for teaching and learning. But it's downright destructive to use this is as a summative measure of performance, for example for holding teachers accountable. The instant you say "accountability," no one trusts anyone else, and there really is no way to run the enterprise but as a factory, with inspectors enforcing every policy. It cannot work in the face of the uncertainties inherent to the inputs and outputs of education.

There is a history of tension in higher education between the desire for authenticity and the simultaneous wish for factory-like operational statistics that show success or failure. The Spellings Commission Report has a nice sidebar about Neumont University and mentions their portfolio approach (their showcase is here), but can't tear itself away from standardized approaches to learning assessment. Three years before, the Council for Higher Education Accreditation beautifully illustrated the tension:
[I]t is imperative for accrediting organizations–as well as the institutions and programs
they accredit–to avoid narrow definitions of student learning or excessively standardized
measures of student achievement. Collegiate learning is complex, and the evidence used
to investigate it must be similarly authentic and contextual. But to pass the test of public
credibility–and thus remain faithful to accreditation’s historic task of quality assurance –
the evidence of student learning outcomes used in the accreditation process must be
rigorous, reliable, and understandable.
This is from CHEA's 2003 paper "Statement Of Mutual Responsibilities for Student Learning Outcomes: Accreditation, Institutions, and Programs."  More recently, Peter Ewell wrote "Assessment, Accountability, and Improvement: Revisiting the Tension" as the first Occasional Paper for the National Institute for Learning Outcomes Assessment, in which he illuminates the game-theoretic problem I alluded to above:
Accountability requires the entity held accountable to demonstrate, with evidence, conformity with an established standard of process or outcome. The associated incentive for that entity is to look as good as possible, regardless of the underlying performance. Improvement, in turn, entails an opposite set of incentives. Deficiencies in performance must be faithfully detected and
reported so they can be acted upon. Indeed, discovering deficiencies is one of the major objectives of assessment for improvement.
In a real factory setting, tests of mechanical process can be very precise, eliminating the difference between what the assessment folks call formative (used to ferret out useful improvements) and summative (an overall rating of quality). If a machine is supposed to produce 100 widgets per hour, and it's only producing 80, it's clear what the deficit is, and the mechanic or engineer can be called in to fix it. But when one is held accountable for numbers like standardized test results that have a considerable amount of uncertainty (which itself is probably unknown, as I pointed out before), the game is very different. It is less like a factory and more like going to market with a bag of some good and some counterfeit coins, which I described in "The Economics of Imperfect Tests." One's optimal strategy has less to do with good teaching than with manipulating the test results anyway one can. Unfortunate examples of that have made national news in K-12 education.

My proposal is that we in higher education take a look at what Stonybrook and others are doing, and see if there is not merit to an emphasis on authentic student learning outcomes, showcased when appropriate for their and our benefit. That we don't consider a grade card and a diploma an adequate take-away from four years and a hundred thousand dollars of investment. That instead, we help them begin to use social networking in a professional way. Set them up with a LinkedIn account during the orientation class--why not? Any sea change from teach/test/rinse/repeat to more individual and meaningful experiences will be difficult for most, but I believe there will be a payoff for those who get there first. Showing student portfolios to prospective students as well as prospective employers creates a powerful transparency that will inevitably have valuable side effects. Jeffrey told said that some of the portfolios get millions of Internet views. How many views does a typical traditional assignment get? A handful at most, and maybe only one.

The odd thing is that this idea is already quietly in place and old hat in the fine arts, performing arts, and architecture departments, and there are probably some I'm not aware of. Who would hire a graphic designer without seeing her portfolio, even if she had a wonderful-looking diploma? This means that we probably have experts already on campus. Computer Science is a natural fit for this too, and there's already a professional social network set up at Stackoverflow.com.

A good first step would be to allow portfolio galleries to count for outcomes assessment results in the Voluntary System of Accountability (VSA). Currently, the only way to participate is to agree to use standardized tests. From the agreement's provision 17:
Participate in the VSA pilot project to measure student learning outcomes by selecting one of three tests to measure student learning gains. 
a) Collegiate Assessment of Academic Proficiency (CAAP) – two modules: critical thinking and writing essay - http://www.act.org/caap/. 
b) Collegiate Learning Assessment (CLA) – including performance task, analytic writing task - http://www.cae.org/content/pro_collegiate.htm. 
c) ETS Proficiency Profile (formerly known as MAPP) – two sub scores of the test: critical thinking and written communication - http://www.ets.org/. Either the Standard or the Abbreviated form can be used.
The VSA is a wonderful program, but it is handicapped by this requirement. If you already use one of these tests, that's fine, but it's expensive and a distraction if you don't find them useful. More to the point of this article, there is no option on the list to report authentic outcomes. Adopting another pilot project to see how far the public portfolio idea will sail would be a great addition.

[The next article in this series is "Tests and Dialogues"]

Acknowledgements: Thanks to Jeffrey Yan for letting me chew his ear off after his presentation. And thanks to the coordinators of the Virginia Assessment Group for putting that wonderful event together.

Disclaimer: I have no financial interest in any of the companies mentioned in this article.

Wednesday, November 16, 2011

A Perilous Tail

 A certain kind of intellectual honesty seems to be critical to systems that want to survive. Even without the subtleties I discussed earlier, it's obvious that a system that ignores reality can only survive as long as the environment is completely benign. By coincidence, I came across Daniel Kahneman's Thinking, Fast and Slow, which catalogs a number of ways in which we humans can fool ourselves. One instance of this is particularly relevant to training, management, and education. It occurs in any rating of performance that involves some element of luck.
Dr. Kahneman describes an episode with military training instructors, where he was talking about studies that show positive reinforcement is the key to better learning. This point of view was flatly contradicted by his audience, who claimed the following (my description):
When a cadet does a bad job at something, I yell at him. Usually he does better the next time. If I praise him for doing a good job, his performance almost always declines the next time!
The author describes this as an ah-ha! moment for him. The solution to this paradox is cleverly described in the book. Here's my version.

As we have seen with SAT scores, the predictive validity of even well-researched tests can be poor (65% correct classification in the case of the SAT benchmark). The remaining variance may as well be chalked up to chance unless we have more information to bear. In addition to measurement error, there can be chance involved in the performance itself. That is, many unpredictable environmental variables may come to bear on the outcome. Baseball games, for example, have a large amount of luck injected into the outcome, so that it's only over a large number of games does relative performance actually reveal itself (see Moneyball by Michael Lewis).

When luck is involved, something called regression to the mean happens--exceptional events are usually followed by unexceptional ones. To make this clear, you can do the following experiment, to mimic the drill instructors' experience. You need some dice.

Roll two dice and add. Consider higher sums as a good performance, and lower sums as poor. Feel free to strongly admonish the dice when they roll low numbers like 2 or 3, and lavish praise on them when they roll 11s and 12s. You'll find that the stern words work wonders--the rolls almost always improve afterwards! On the other hand, the praise is counter-productive since 11s and 12s are usually followed by lower rolls.


We can imagine most events happening in the 'fat part' of a bell curve, and we are generally ill-equipped to encounter events far out on the tails, which by definition are very rare. It's not just the paradox described above. Nassim Taleb wrote a whole book about this called The Black Swan. Other, more speculative thinkers, have imagined thus:
If you assembled all the humans who ever have or ever will live into a distribution according to when they were born, the curve would likely look like some kind of hump with tails on both sides. If you chose one human at random, he or she would likely come from the fat part of the curve. Therefore, that's where we likely are at this moment in time--that is, we would expect to be typical rather than exceptional. If this is true, then it puts probabilistic bounds on our expectations for the duration of human civilization. The math varies, depending on your assumptions, but something like a few thousand years would be a reasonable upper bound using this method.
In business, there's an idea called Six Sigma that is supposed to reduce process errors to an infinitesimal fraction (six sigma means six standard deviations from the mean, which for a Normal distribution is an exceedingly small percentage: about 4 errors per million attempts). Yesterday someone suggested to me that we might use Six Sigma in higher education. I laughed, not because there aren't probably useful ideas there (similar to institutional effectiveness), but because the inherent fuzziness of our core business--changing brains--is so fraught with unknowns. I think one sigma is about as good as we're likely to do. Although we're not well prepared to deal with it, we live in the tail of the distribution.

What percentage of "value-added" indices are due to random chance, do you suppose? This statistical method of computing theoretical filling of the learning vessel has been institutionalized to reward or punish teachers and schools. As a mathematician and assessment professional, it's hair-raising to read the pat descriptions from the link above, like:
Q: How does value-added assessment sort out the teachers' contributions from the students' contributions?

A:Because individual students rather than cohorts are traced over time, each student serves as his or her own "baseline" or control, which removes virtually all of the influence of the unvarying characteristics of the student, such as race or socioeconomic factors. 
Test scores are projected for students and then compared to the scores they actually achieve at the end of the school year. Classroom scores that equal or exceed projected values suggest that instruction was highly effective. Conversely, scores that are mostly below projections suggest that the instruction was ineffective.
Taking another page from Dr. Kahneman's book, this is an instance of solving a simple problem that superficially resembles the actual problem, because the original problem is too hard. It's easy to imagine that the distribution is tight and the tails insignificant, that we control and understand all the elements of chance that might contribute to a computed value-added parameter. Unfortunately, the direct link to reality doesn't reveal itself easily, and so there is no immediate feedback that would correct the problem by making it obviously wrong to observers. This is a case where we should be assiduously honest with our reasoning and doubts. Suppose the SAT's accuracy is representative, and the underlying achievement tests classify students correctly no more than 65% of the time. What fraction of the value-added score is simply random?

The general problem is caused by the unknown unknowns that plague complex observations. There is an elegant way out of this mess that doesn't involve advanced math or huge sample sizes. Moreover, it solves the most important problem in higher education. How's that for a cliff-hanger?

Tuesday, November 15, 2011

Searching IPEDS Data

I found a nice site for sifting through some of the important bits of IPEDS data at www.collegeresults.org, especially for comparing institutions. My metric of choice is instructional dollars / FTE when trying to roughly compare quality. You can find this under the "Finance and Faculty" tab.

Sunday, November 06, 2011

Spark Intelligence

Yesterday's post about self-limiting intelligence may come off as pessimistic, and indeed, I think very few leaders of organizations (nations, companies, colleges,...) have a standing agenda item labeled "survival." I think they should, and in some ways think that our forebears were more attuned to that, but it's just a notion. The cathedral in Cologne took about 640 years to complete. What projects do we have ongoing now with that sort of horizon? I think if you asked a large corporation's CEO what he or she thought of the company's prospects three or four centuries out, you'd get a strange look. Next quarter is what matters.

Suppose for a moment that it's true that a singular intelligent system (SIS) can only get so smart before it starts working against its own interests. It's doomed, and it would be smart enough to realize its's doomed. What then?

Although we have no evidence of other intelligent life in the universe, since we are here ourselves, it's possible that such life has or will exist somewhere else (the Drake Equation tries to pin that down, but that's not what I'm concerned with). So our hypothetical doomed SIS knows this too. That is, it knows that although all civilizations will eventually collapse, the universe is a fertile ground for new ones to spring up. This is what I call spark intelligence--new SISs cropping up now and then across the galaxy. Therefore, there is a possibility for the universe to maintain a disjointed "stream of consciousness" if these independent SISs could communicate with each other. Time and distance scales make any sort of synchronous communication unlikely, so it has to be asynchronous, like one civilization reading an ancient book left by another long-gone culture. This would enable a sort of meta-intelligence comprising knowledge and culture from a long sequence of dead civilizations: a universal Domesday book. Eventually one of them would have to figure out how to get this package to a new universe before this one suffers heat death, but there are billions of years left to do that.

Imagine these sparks of intelligence going off all around the hundreds of billions of galaxies in our observable patch of space. Many of them reach the same conclusion I just have. Some of them might have the motivation to participate (motivation is essential to survival, recall, and we're talking about survival of knowledge and culture). There are two ways to participate. One is to create a library that can be seen and decoded from a very long way off, and the other is to search for and assimilate the libraries of others.

If this giant inter-library loan program exists, it would depend on the "spark" rate, the probability that a civilization will be motivated to participate, and the window of opportunity it has to do so with existing resources before it collapses. An obvious first step would be to see if there are libraries already out there. Of course, we're already doing that with SETI, but it's not a high priority.

You may be having trouble seeing over the pile of hypotheticals I assembled in the preceding paragraphs, so let me bring all this back to Earth. Much of the analysis above also applies right here. A regulated market economic system, for example, provides fertile ground for "sparks" of a different sort--businesses of all sorts spring into existence and then eventually get eaten or die. A few last hundreds of years. But they too share a common "culture bank," hold conferences and host professional organizations in order to share ideas (while hiding trade secrets, of course) similar to the galactic library I proposed. [Edit: We can also see that there are policy implications for the government that regulates the system: keeping the ground fertile for new 'sparks' and making sure that the eventual end of any enterprise is planned for. That way "too big to fail" wouldn't be the critical issue it is now. If the philosophy is that every enterprise will eventually fail, and that this has to be planned for, it's not a catastrophic surprise when it happens.]

There may be a case made for education being like this too: providing the right environment for novelty to emerge in the form of new research results, new art, and so on. If so, it's probably not intentional from an organizational leadership viewpoint. The general tone of administration in my experience is more about how to keep a bureaucracy running efficiently. The effect of this machinery on learning are a factor, but the delineation between creating an environment for success and simply expecting it is a fuzzy one. As an example, assuming that learning is mostly related to how well students are taught is a bureaucratic simplicity (the inputs have a lot to do with it). Or the idea that if a student passes a writing class she can then write as well as she needs to. This "inoculation" philosophy is purely process-driven, and is almost antithetical to the idea that minds are cultivated, not stamped out in a factory. More on this theme next time.

Saturday, November 05, 2011

Self-Limiting Intelligence

I think intelligence grows only to the point where it begins to interfere with itself. In short, when we get smart enough, we begin to outsmart ourselves and actually undermine our own survival. Here, the inclusive 'we' could apply to individuals, but is more aimed at organizations: corporations, institutions, or governments.

I have been researching survival of these entities in the abstract for several years, and I seem to be all alone in this. This is a real mystery, because survival is the sine qua non for everything else we care about. If you are interested in some of the background, see Survival Strategies [1] or "Surviving Entropy" [2] in this blog.

My interest is in the seemingly pessimistic question "is it likely that an intelligent being or organization can survive for an indefinite period?" This is contrasted with the messy sort of survival exhibited by ecologies that evolve over time, which I refer to in short-hand as a MIC (multiple independent copies). The intelligent systems are shortened to SIS for singular intelligent system. The primary difference is that it's impossible to reproduce and mutate an organization the same way it is a bacterium. All your eggs are in one basket, so to speak. Whereas a bacteria culture can lose 99% of its population and pull through, a singular system can't afford a single lethal mistake.

In [1] I showed a couple of interesting facts about a SIS. First, it has to learn how to predict (or engineer) its environment at a very fast rate, unlike a MIC, which gets this for free via even the most desultory rate of reproduction. In actual fact, we have evidence that the ecology of life on Earth (a MIC) has survived for some billions of years, whereas we have no evidence of any government or other organization (a SIS) surviving for more than a few thousand years (I'm being generous). Put another way, when we look at the vast and enduring features of the universe around us, they are uniformly non-intelligent. This is the source of the so-called Fermi Paradox.

The second interesting fact about a SIS is that although it may be smart enough to change itself, it is impossible for it to predict the ultimate result of those changes. For an organism that is the product of an ecology, this is not an issue. Animals often come prepared for their earthly homes with protective coloration and other adaptations for the environment they will live in. They don't need to change this, or if they do, the provision is built-in but limited (like a chameleon). A frog can't re-engineer itself into a bird if it finds the need to fly. A SIS, on the other hand, may have to adapt to completely foreign environments over time.

The problem a SIS faces is that it generally cannot predict what will happen to it after a self-change, so it doesn't know if this change is good or bad in the long run. It can try to guess by simulating itself, but there's an essential limitation here. There are two types of simulation, detailed below.
Suppose a SIS considers changing its 'constitution' in some way, which will affect the way future decisions are made. It builds a sophisticated computer model of itself making this change to see what will happen. There are two possibilities:
1) The simulation is perfectly good: so good that the SIS cannot change the outcome even if it's a bad one.
2) The simulation is only approximate: the SIS can take a look at the future and change its mind about making the change.
In the first case, a perfect simulation tells us not only what the future holds, but also whether or not the organization will make the change. This is because it incorporates all information about the SIS, including the complete present state. So it will present a result like "you make the change and then X happens," or "you don't make the change." A perfectly true self-simulation has to have this property. So it's like Cassandra's warning--even if it predicts an undesirable future, it still has to live it! 
Such perfect simulations are really only possible with completely deterministic machines, like a computer with known inputs. In practice, all sorts of variables might knock it off course. So what about approximations? The essential element of an approximation is to be able to make a decision about the future. The most fundamental one might be "if I make this change, will I eventually self-destruct?" This is the most fundamental question for a SIS. The most dangerous challenge from the environment for a SIS comes from within itself.
The US constitution makes it harder to change the constitution than to pass ordinary laws. This is a prudent approach to self-modification.
Unfortunately, decision problems like this are not reachable by general-purpose processes. This is covered in [1], but you might peek at Rice's Theorem to see the breathtaking limitations of our knowledge of what deterministic systems will do. So we can simulate in the short term, but the long-term effect will be a mystery.

So a SIS can only learn about self-change empirically, by trying things out, or short-term simulations. It can't ask about the general future. Although the external environment may be quite challenging, and survival may be a risk because of factors beyond its control, the internal question of how to manage self-change are just as bad or worse. Hence my hypothesis that the odds will catch up with any SIS eventually, and it will crash. This also jibes with with all the empirical evidence we have.

This is where I left the question in [1], but in the last couple of years I think I've identified a fundamental mechanism for self-destruction that any SIS has to overcome.It has practical implications for institutions of higher learning and other sorts of systems like businesses and governments.

In my last post, I showed a diagram for an institutional effectiveness loop that looks more technical than the usual version. Here it is again, with some decorations from the talk I gave at the Assessment Institute.

The diagram actually comes from my research on systems survival, and it is a schematic for how a SIS operates in its environment. The (R) and (L) notations refer to 'Reality' and 'Language' respectively. Recall that the I in SIS stands for Intelligent, and this is what I mean: the intelligent system has ways of observing the environment, encoding those observations into a language that compresses the data by looking for interesting features, and  models the interactions between these. This allows a virtual simulation of reality to be played out in the SIS, enabling it to plan what to do next in order to optimize its goals. This is the same thing as an institutional effectiveness loop in higher education, in theory at least.

Language is much more malleable than reality: we can imagine all sorts of interactions that aren't likely to actually occur. For example, astrology is a language that purports to model reality, but doesn't. It's essential for the SIS to be able to model the real environment increasingly well. The mathematical particulars are given in [1] in terms of increasing survival probabilities.

There's something essential missing from the diagram above. That is the motivation for doing all this. When the SIS plans, it's trying to optimize something. This motivation is not to be taken for granted, because there's no reason to assume that a SIS even wants to survive unless it's specifically designed that way. For example, a modern air-to-air missile has good on-board ways to observe a target aircraft (e.g. radar or heat signature), a model for predicting the physics of its own flight and the target's, and the means to implement a plan to intercept. So by my definition, it's reasonably intelligent. But it doesn't care that it will be blown up along with its target.

Motivation to survive is a decoration on a SIS. Of course it won't likely survive long without it, but it's not to be taken for granted, which makes the question of what happen under self-change very important. It's quite possible to make a change that eliminates the motivation for self-survival. What exactly constitutes survival is a messy topic, so let's just consider this general feature of an SIS, which has applications to personal life as well as governments, corporations, military organizations, and universities:
Motivations can change or be subverted when self-modifications are made.
This doesn't sound very profound; it's the particular mechanism shown below that is the interesting part. Here's how it works. When we observe our environment, we encode this into some kind of language, specialized to help us understand where we are in relation to our goals. For example, if I stub my toe on external reality, I get a finely-tuned message that informs me immediately that my most recent action was inimical to my goals for self-preservation: it hurts! This pain signal is just like any other bit of information encoded into a custom language: it can be intercepted or subverted. There are medicines and anesthetics that can reduce or completely eliminate the pain signal. Because signals are purely informational, they are always vulnerable to such manipulation by any system that can self-change.

Motivations are closely tied to these signals. It may be a simple correspondence, as with pain, or something abstract that comes from modeling the environment, like fear of illness. Sometimes these come into conflict, as the example below illustrates.
Sometimes I get sleepy driving on the interstate. If I find myself beginning to micro-sleep, I pull off the road and nap for 15 minutes. How is it that my brain can be so dumb as to fall asleep while I'm driving? Something very old in there must be saying "it's comfortable here, there's not much going on, so it's a good time to sleep," in opposition to the more abstract model of the car careening off the road at speed. We can try to interfere with the first signal with caffeine or loud music or opening the windows, or we can just admit that it's better to give in to that motivation for a few minutes in a safer place.
The mechanism for limiting intelligence works like this:
A SIS tries to attain goals by acting so as to optimize encoded signals that correspond to motivations. If it can self-modify, the simplest way to do this is to interfere with the signal itself.
I think it is very natural for a SIS to begin to fail because it fools itself to artificially achieve goals by presenting itself with signals that validate that. Even if external reality would disagree.
I just finished reading Michael Lewis' The Big Short, which is rife with examples of signal manipulation. Here are a couple. 1) The ratings agencies (S&P, Moody's) had two motivations in conflict: generating revenue by getting business rating financial instruments (such as CDOs), and generating accurate ratings. These are in conflict because if they rate something poorly (and perhaps unfairly), they may lose business. The information stream got subverted that should have signaled that  it was a bad idea repackaging high-risk loans into triple-A rated instruments. 
2) According to Lewis, counterparts at Goldman Sachs learned exactly how to tweak the signals in order to get the result they wanted from the bond raters (by manipulating the way risky loans were structured to optimize an average credit rating, for example). 
3) In an example of self-deception, risk management offices of the investment banks managed were fooled by the ratings agencies "credit-laundering" and their own trading desks, which allowed vast liabilities to go unnoticed. 
4) The whole economic apparatus of the world largely ignored signs that the system was on the verge of collapse. 
If we imagine that a SIS is continually trying to increase its survival chances, an observation that probabilities are decreasing instead is obviously bad news. If it can self-modify it has the choice to accept this unwelcome fact about probabilities, or it could interfere with the signal (ignore it, for example).

Alternatively, the internal model of an SIS may associate a potential benefit with a planned act, which is a good thing. Any evidence that this may not work out as intended would decrease the value of the act, and this (also bad) news might be subverted, so that only supporting evidence is accepted. This is usually called confirmation bias in humans.

It's natural to ask, if this is such a problem, why hasn't civilization already collapsed from ignoring bad news and amplifying good news? The answer, I think, is that humans comprise the civilization and all its organized systems, and humans can't completely self-modify. Yet. Imagine if you could.

What if every emotional reaction could be consciously tuned through some mental control panel? What to be happier? Just turn up the dial. Don't like pain? Turn that dial down.

Because humans are actually members of a MIC (that is, an ecology), we are subject to selection pressure from the environment. Viewed as discrete systems, our organizations inherit some of this evolutionary common sense, but it's diluted. Individual humans often have a lot to say about how an organization operates, and can imbue them with denial and confirmation bias. Organizations are easily self-changed, and can't predict how those changes will turn out. I think, however, that certain strategies can ameliorate some of the most self-destructive behaviors. Here they are:
1) Create cultures of intellectual honesty, and actively audit signals, languages, and models to make sure they correspond to what's empirically known, whether it's good news or not. Intellectual honesty should be audited the same way financials are: by an outside agency doing an in-depth review. In the long run this doesn't solve the problem because any such agency will have the same problems (self-deception, inability to predict effects of changes, etc.), but it might increase the quality of decision-making in the short to near term.
2) Be conservative and deliberate about changes to signals, languages, predictive models, and fundamental structure. Audit those continually and transparently. Everyone should know what the motivations are and what signals apply to each. Moreover, 'best practices for survival' should be used. Since much of our learning is from other systems that failed, this wisdom should be carefully archived and used.
These are particularly advisable for organizations that have motivational signals that are difficult or slow to interpret. For enterprises that are very close to objective reality, these measures are less necessary because of the obviousness of the situation. For example, it's hard to argue with the scoreboard in a sporting event. We can close our eyes if we don't like the score, but there's really not much room for misinterpretation. Therefore, one would expect a successful team to be either very lucky or else have good models of reality reflected in their language and signals. The same could be said of military units in active service, traders on a stock exchange, or any other occupation with signals that are hard to interfere with.

Examples in the other direction, where signals are or have been ignored are the financial crisis already mentioned, the looming disaster of global warming, the eventual end of cheap oil, and human overpopulation.  On an individual level, unnecessarily bad diets, lack of exercise, smoking, and so on are examples of abstract survival signals ("doctor say so") versus visceral motivations (e.g. tastes good) that show flaws in our motivational calculus.

I intend in the next post or two to show how this is related to the business of higher education.