Friday, December 16, 2011

Free Hypothesis-Generating Software

In the last year there have been announcements of two free software packages that use machine learning techniques to mine data for relationships. The resulting mathematical formulas can be used to form hypotheses about the underlying phenomena (i.e. whatever the data represents).

The first one I have mentioned before. It's Eureqa from Cornell, which uses symbolic regression. There is an example on the Eureqa site that poses this sample problem:
This page describes an illustrative run of genetic programming in which the goal is to automatically create a computer program whose output is equal to the values of the quadratic polynomial x2+x+1 in the range from –1 to +1. That is, the goal is to automatically create a computer program that matches certain numerical data. This process is sometimes called system identification or symbolic regression.
The program proceeds as an evolutionary search. The graph pictured below is a schematic of the way the topology of the evolved "critters" is formed.
A family tree of mathematical functions. (Image Source: geneticprogramming.com)
There is a limitation to genetic programming that is also a threat to any intelligent endeavor: the problem may not be amenable to evolutionary strategies. There are some problems where the only way to solve them is exhaustive search. Only if the solution space is "smooth" in the sense that good solutions are "near" almost-good solutions is the genetic approach going to find solutions faster than exhaustive search. On a philosophical note, modern successes with physical sciences suggest that the universe is kind to us in this regard. The "unreasonable effectiveness" of mathematics (the title of an article by Eugene Wigner) in producing formulas that model real world physics is a hopeful sign that we may be able to decode the external environment so that we can predict it before it kills us. (The internal organization of complex systems is another matter, and there's not much success to look to there.). Note, however, that even here formulas have not really been evolutionary, but revolutionary. The formulation of Newton's laws of motion are derivable from Einstein's relativity, but not vice versa. The "minor tweak" approach doesn't work very often, Einstein's Cosmological Constant notwithstanding.

The second data miner is aptly called MINE, and comes from the Broad Institute of Harvard and MIT. You can read about it on their site broadinstitute.org. The actual program is hosted at exploredata.net, where you can download a java implementation with an R interface. Here's a description from the site:
One way of beginning to explore a many-dimensional dataset is to calculate some measure of dependence for each pair of variables, rank the pairs by their scores, and examine the top-scoring pairs. For this strategy to work, the statistic used to measure dependence should have the following two heuristic properties. 
Generality: with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships. 
Equitability: the statistic should give similar scores to equally noisy relationships of different types. For instance, a linear relationship with an R2 of 0.80 should receive approximately the same score as a sinusoidal relationship with an R2 of 0.80.
It's interesting that this is a generalized approach to my correlation mapper software, the difference being that I have only considered linear relationships. For survey data, it's probably not useful to look beyond linear relationships, but I look forward to trying the package out to see what pops up. It looks easy to install and run, and I can plug it into my Perl script to automatically produce output that complements my existing methods. A project for Christmas break, which is coming up fast.

Update: I came across an article at RealClimate.org that illustrates the danger of models without explanations. Providing a correlation between items, or a more sophisticated pattern based on Fourier analysis or the like, isn't a substitute for a credible explanatory mechanism. Take a look at the article and comments for more.

By coincidence, I am reading Emanaul Derman's book Models.Behaving.Badly: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life. It has technical parts, which I find quite interesting, and more philosophical parts that leave me scratching my head. The last chapter, which I haven't read yet, advertises "How to cope with the inadequacies of models, via ethics and pragmatism." Stay tuned...

Update 2: You can read technical information about MINE in this article and supplementary material.

Saturday, December 10, 2011

Randomness and Prediction

I saw a question at stackoverflow.com asking why computer programs can't produce true random numbers. I can't locate the exact page now, but here's a similar one. The question spooked around in my head all day, despite my head-down work to catch up on paperwork after being away at the SACSCOC meeting (see the tweets). After coming home, I finally gave in and wrote some notes down on the topic. It has applications to assessment, believe it or not.

According to complexity theory, "random" means infinitely complex. Complexity is the size of a perfect description of, for example, a list of numbers. If we are given an infinitely long list of numbers like
1, 1, 1, 1, 1, ....
it's easy to see that it has very low complexity. We can describe the sequence as "all ones."  Similarly, the powers of two makes a low complexity sequence, or any other simple arithmetic sequence. We could create a more complicated computer program that tries to produce numbers that are as "mixed-up" as possible--this is what pseudo-random number generators do, but if we have access to the program (i.e. the description of the sequence), we could perfectly predict the numbers in the sequence. It's hard to call that random.

Truly random numbers (as far as we know) come from real-world phenomena like radioactive decay. You can have a certain amount of this so-called "entropy" for free from internet sources like Hotbits. I use such services for my artificial life experiments (insert maniacal laugh here). Real randomness is a valuable commodity, and I'm constantly running over my limit for what I can get for free from these sites. Here's a description from their site of where the numbers come from:
HotBits is an Internet resource that brings genuine random numbers, generated by a process fundamentally governed by the inherent uncertainty in the quantum mechanical laws of nature, directly to your computer in a variety of forms. HotBits are generated by timing successive pairs of radioactive decays detected by a Geiger-Müller tube interfaced to a computer.
What would be involved if you wanted to predict a sequence of such numbers (which will come in binary as ones and zeros)? As far as we know, radioactive decay is not predictable from knowing the physical state of the system (see Bell's Theorem for more on such things).

Even in a mechanical system such as a spinning basket of ping-pong balls, like those used for selecting the winning numbers in lotteries, a complete description of the system that is sufficient to allow you to predict which balls will emerge to declare the winner would be a very long set of formulas and data. In other words, even if it's not infinitely complex, it's very, very complex.

But what if we want partial credit? This was the big idea that had half my brain working all day. What if we are content to predict some fraction of the sequence, and not every single output? (Like "lossy" compression of image files versus exact compressors for executables.) For example, if I flip a coin over and over, and I confidently "predict" for each flip that it will come up heads, I will be right about half the time (in fact, any predictor I use is going to be right half the time). So even with the simplest possible predictor, I can get 50% accuracy.

Imagine that we have an infinitely complex binary sequence S-INF that comes from some real source like radioactive decay. We write a program to do the following:
For the first 999,999 bits, we output a 1 each time. For the one-millionth bit we output the first S-INF bit, which is random. Then we repeat this process forever, with very long strings of 1s followed by a random one or zero. Call this sequence S-1
It should be clear that a perfect predictor of S-1 is impossible with a finite program. It's still infinitely complex because of the interjection of bits from S-INF. But on the other hand, we can accurately predict the sequence a large portion of the time. We'll be wrong once in every two million bits, on average, if we just predict that the output will be 1 every time.

There is a big difference between randomness and predictability. If we take this another step, we could imagine making a picture of prediction-difficulty for a given sequence. An example from history may make this clearer.
Before Galileo, many people presumably thought that heavy objects fell faster than lighter objects. This is a simple predictor of a physical system. It works fine with feathers and rocks, but it gives the wrong answer for rocks of different weights. Galileo and then Newton added more description (formulas, units, measurements) that allowed much better predictions. These turned out to be insufficient for very large scale cases, and Einstein added even more description (more math) to create a more complex way of predicting gravitational effects. We know that even relativity is incomplete, however, because it doesn't work on very small scales, so theorists are trying ideas like string theory to find an even more complex predictor that will work in more instances. This process of scientific discovery increases the complexity of the description and increases the accuracy of predictions as it does. 
Perfect prediction in the real world can be very, very complex, or even infinitely complex. That means that there isn't enough time or space to do the job perfectly. As someone else has noted (Arthur C. Clark?), some systems are so complex that the only way to "predict" what happens is to watch the system itself. But even very complex systems may be predictable with high probability, as we have seen. What is the relationship between the complexity of our best predictor and the probability of a correct prediction? This will depend on the system we are predicting--there are many possibilities. Below are some graphs to illustrate predictors of a binary sequence. Recall that a constant "predictor" can't do worse that 50%.

The optimistic case is pictured below. This is embodied in the philosophy of progress--as long as we keep working hard, creating more elaborate (and accurate) formulas, the results will come in the form of better and better predictors.

 The worst case is shown below. No matter how hard we try, we can't do better than guessing. This is the case with radioactive decay (as far as anyone knows).
The graph below is more like the actual progress of science as a "punctuated equilibrium." There are increasingly large complexity deserts, where no improvement is seen. Compare the relatively few scientists that led to Newton's revolution or the efforts of Einstein and his collaborators to the massive undertaking that is string theory (and its competition, like loop quantum gravity).
Note that merely increasing the complexity of a predictor is easy. The hard part is figuring out how to increase prediction rates. You can always make a formula or description more complex, but by doing so it doesn't guarantee that the predictors are any better. Generally speaking, there is no computable (that is, systematic or deterministic) method for automatically finding optimal predictors for a given complexity level. You might think that you could just try every single program of a given complexity level and proceed by exhausting the possibilities, but you run into the Halting Problem. There are practical ways to tackle the problem though. This is a topic from the part of computer science called machine learning. A new tool that appeared this year from Cornell University is Eureqa, a program for finding formulas to fit patterns in data sets using an evolutionary approach.

Next time I will apply this idea to testing and outcomes assessment. It's very cool.

Friday, December 09, 2011

X-Raying Survey Data

I continue to develop and use the software I patched together to look at correlates (or covariates) within large scalar or ordinal data sets like surveys. I have gotten requests from several institutions in and out of higher ed to do these. A couple of interesting graphs that resulted are shown below, with permission of the owners of the data, who shall remain anonymous. Both of these are HERI surveys. I have found the HERI surveys the most revealing, partly because they discriminate so well between different dimensions. Some other surveys seem to produce (in the data sets I've seen) big globs of correlated items that are hard to get meaning from.

First the CIRP Freshman Survey at a private college. It neatly divides up the survey respondents into clusters. Rich urban kids negatively correlated to working class or middle class kids, athletes, the religious, and the environmentally-conscious all show up clearly. I've labeled the optional questions with an approximation of the prompt.[Download full-sized graph]
Next is the Your First Year College Survey at a different private college. I find the link between texting in class and recommending the school to others particularly interesting. That's at the bottom. Red lines are negative correlations.[Download full-sized graph]

Higher Ed's 1%

I found this chart at the Chronicle very interesting. It shows the ratio of presidential pay to average professor pay. For example, at Stevenson University, it's 16 to 1.



My rule of thumb is that the quality of an institution is ranked by Instructional Costs/FTE. Here are some selected institutions from the right side of the chart, along with their "Z-scores", courtesy of CollegeResultsOnline (2009 data).





Friday, December 02, 2011

Rubrics as Dialogue

In the last few articles beginning with "The End of Preparation", I have contrasted two epistemologies. One proceeds by definition, which I called monological, and the other emerges from dialogue. These are distinct and equally useful ways of understanding the world. We could display the discipline of physics as a very successful monological system. It allows scientists and engineers to model physical systems and design systems that will work as intended. It allows us to understand what the sun is, and where the atoms in our bodies came from, all as part of a broad model that uses consistent monological language. This successful union of reality, language, and model is what makes the physical sciences so powerful. With such a language we can think precisely about complexity, for example, which is the size of the description of some physical system. A jar of marbles being shaken is complex because each marble's position and velocity are different, requiring a long list of physical attributes to be listed. If they are all at rest in a square shape, the description in physical language can be compressed, and the state is less complex. Physicists would refer to this as high entropy versus low entropy.

By comparison, the language and ways of knowing that create popular culture is dialogical. There are no rules set down about what new words (or memes, if you prefer) will arise, and no deterministic rules that could be applied to predict cultural evolution. A stock exchange has some elements of monologism (precise definitions regarding financial transactions, for example), but the evolution of prices is dialogical--unpredictable consensus between buyers and sellers.

One of the characteristics that distinguishes a monological language from a dialogical one is that in the former case, the names can be arbitrary. What matters is their relationship in the model that's used for understanding the world. For example, electricity comes in Volts and Amperes, and its power is measured in Watts. These are names of scientists, as are the Ohm, Henry, and Farad, terms that refer to electrical properties of circuit elements. If they were dialogical names, they would more likely be "Zap," "Spark," and "Shock" or something similarly descriptive. This is because in a dialogue, it's an asset for words to be descriptive--you don't have to waste extra time saying what it is you meant. By contrast, it's enough to know that V=IR when calculating voltage in a circuit. It doesn't matter whether we call it Volts or Zaps.

It's a trope to poke fun at academics who speak in high-falutin' language just to say something ordinary. When Sheldon in Big Bang Theory gets stuck on a climbing wall, he says "I feel somewhat like an inverse tangent function that's approaching an asymptote," which is then reinforced by his desperate follow-up "What part of 'an inverse tangent function approaching an asymptote' did you not understand?" [video clip]  Some might argue that some academic disciplines that are inherently more dialogical use language that's unnecessarily opaque. This point was publicly made in the "Sokal Affair," where a scientist submitted a jargon-laden meaningless paper to a humanities journal as a hoax, and it was published.

Using Rubrics for Assessment
In order to connect these ideas to the assessment practice of using rubrics, let me first review what they are.

The term "rubric" in learning outcomes assessment means a matrix that indexes competencies versus accomplishment levels. For example, a rubric for rating student essays might include a "correctness" competency, which is probably one of several on the rubric. There would be a scale attached to correctness, which might be Poor, Average, Good, Excellent (PAGE), or one tied to a development sequence like "Beginning" through "Mastering." In our Faculty Assessment of Core Skills survey, we use Developmental, Fresh/Soph, Jr/Sr, Graduate to relate the scale to the expectations of faculty.

A rubric alone is not enough to do much good. A fully-developed process using rubrics might go something like this, starting with developing your own.

  1. Define a learning objective in ordinary academic language. "Students who graduate from Comp 101 should be able to write a standard essay that uses appropriate voice, is addressed to the target audience, is effective in communicating its content, and is free from errors."
  2. The competencies identified in the outcomes statement are clear: voice, audience, content, and correctness. These define the rows of the rubric matrix.
  3. Decide on a scale and language to go with it, e.g. PAGE.
  4. Describe the levels of each competency in language that is helpful to students. It's better to be positive than negative--that is, define what you want, not what you don't want when possible. There are many resources on constructing rubrics you can consult. The AAC&U's VALUE rubrics are examples to refer to.
  5. The rubric should be used in creating assignments, and distributed with the assignment, so the student is clear about expectations. Use of rubrics in grading varies--it's not necessary to tie an assessment to a grade, but there are some obvious advantages if you do.
  6. Rating the assignment that was designed with the rubric in mind should not be a challenge. If it's essential to have reliable results, then multiple raters can be used, and training sessions can reduce some of the variability in rating. Nevertheless, it's not an exact process.
  7. Over time you create a library of samples to show students (and raters) what constitutes each achievement level.
Note that the way to do this is NOT to take some rubric someone else has created and apply it to assignments that were not created with the rubric in mind. That's how I did it the first time, and wasted everyone's time.

Rubrics as Constructed Language
The learning objectives that rubrics are employed to assess are often complex, so even though an attempt is made to define the levels of accomplishment, these descriptions are in ordinary language. That is, there's no formal deductive structure or accompanying model that deterministically generates output ratings from inputs. Instead, the ratings rely on the judgment of professionals, who are free to disagree with one another. If your attitude is that there is one true rating that all raters must eventually agree on, you're likely to be frustrated. One problem is that although the competencies, like content and correctness in the example, are not independent. If there are too many spelling and grammar mistakes on a paper to gain any sort of comprehension, content, style, voice, and so on are also going to be degraded. One rater of writing samples I remember was adamant that a single spelling mistake implied that all other ratings would be lowered as well. 

So using rubrics is dialogical, but by the way of a nice compromise. The power in rubrics comes from restraining the language we use to describe student work, according to a public set of definitions. Even though these are not rigorous, they are still extremely useful in focusing attention on the issues that are deemed important. In addition, rubrics create a common language in the learning domain. It's important for students not to just know content, but how professionals critique content, and rubrics are a way to do that. They can be used for self-reflection or peer review to reinforce the use of that language.

The advantage of generating useful language is one reason I only use a PAGE scale as a last resort. Terms like poor, average, and so on are too generic, and too easily made relative. An excellent freshman paper and an excellent senior paper should not be the same thing, right? Bad choices in these terms early on can have long-term consequences when you want to do a longitudinal analysis of ratings. 

There is a tendency among some to view rubric rating as a more monological process, but I can't see how this can be supported for most learning outcomes. In my opinion, they are most useful in creating a common language to employ in teaching, to rein in the vast lexicon that might naturally be used and focus on the elements that we agree are the most important. This has positive benefits for everyone concerned.


Thursday, December 01, 2011

Chef's Salad

Here's another serving of link salad. The articles referenced connect to recent topics of discussion.


HERI just released a report on college graduation rates. They give details on regressions to predict completion, and provide the rates of correct classification rates for same. Here's an example:

Note that SAT scores don't add any information once high school GPA is accounted for. The correct classification rate can be compared to the rate of graduates. For example, if a test correctly predicts a coin flip 50% of the time, on the face of it this isn't very impressive. But it's actually more complicated than that. I have a kind of complexity theory approach to this sketched out on scrap paper, and will write about that later. In this case, the rate of four-year graduation from page 7 of the paper is 38.9%, so a correct classification rate of 68.3 could be compared to the strategy of predicting that no student will graduate, which is correct 61% of the time. Even by this crude comparison, the predictor looks useful.

HERI provides an associated calculator that lets you try out different scenarios related to graduation. Very cool.

The Virginia Assessment Group just published their new edition of Research & Practice in Assessment. I hope to be able to read it on the plane tomorrow, on the way to the annual SACS-COC meeting in Orlando. Last time I was there I got to see a shuttle launch, which was amazing. Not likely this time.

Coursekit is new learning management system that wants to be more like social media. This relates to the topic of connecting professional portfolios to a social network. I learned about Coursekit in this Wired Campus article in The Chronicle. Even more intriguing to me is Commons in a Box, a separate open source project to create professional networks. Quoting from the article in The Chronicle:
Educational groups, scholarly associations, and other nonprofit organizations will be able to leverage the Commons in a Box to give their members a space in which to present themselves as scholars to the public, to share their work, to locate and communicate with peers, and to engage in collaborative scholarship.
The original source is the CUNY Academic Commons.

Monday, November 28, 2011

Link Salad

A Monday's worth of interesting education-related links:

On non-cognitives, we have two articles from the Boston Globe. The first is "How College Prep is Killing High School":
A number of economists, including Nobel economist James Heckman, have documented the need for noncognitive or so-called soft skills in the labor market, such as motivation, perseverance, risk aversion, self-esteem, and self-control.
The second is "How Willpower Works":
In dozens of studies conducted over the past 25 years, Baumeister has found that taking on specific habits - like brushing your teeth with the opposite hand you’d normally use - can increase levels of self-control. In a phone interview, he likened willpower to a muscle: “If you exercise it, you can make it stronger. There’s nothing magical about it.’’
Then there is the less optimistic offering from the New York Times "The Dwindling Power of a College Degree," which contains a warning for all of us:
A general guideline these days is that people are rewarded when they can do things that take trained judgment and skill — things, in other words, that can’t be done by computers or lower-wage workers in other countries.
The Wall Street Journal has a scorecard of career salaries by degree, in case you're keeping score. The highest 75th percentile salary goes to math and computer science combined. Compare it to math education:

A partial listing of the WSJ salary/major list found here.
The quote in the New York Times article about computers replacing us is especially interesting when juxtaposed to the ambitious research plan described in "Mining the Language of Science," from Phyorg.com:
Scientists are developing a computer that can read vast amounts of scientific literature, make connections between facts and develop hypotheses.
Stanford University is offering a free online course on machine learning if you want to learn how to make a computer smarter than yourself (true story).

 To round out that topic, here are two articles on the limits of human understanding. First from Physorg.com again is "People are Biased against Creative Ideas, Studies Find," including these findings:
  • Creative ideas are by definition novel, and novelty can trigger feelings of uncertainty that make most people uncomfortable. 
  •  People dismiss creative ideas in favor of ideas that are purely practical -- tried and true. 
  •  Objective evidence shoring up the validity of a creative proposal does not motivate people to accept it. 
  • Anti-creativity bias is so subtle that people are unaware of it, which can interfere with their ability to recognize a creative idea.
The second article, from SciGuru, is "Ignorance is bliss when it comes to challenging social issues."
The less people know about important complex issues such as the economy, energy consumption and the environment, the more they want to avoid becoming well-informed, according to new research published by the American Psychological Association. And the more urgent the issue, the more people want to remain unaware [...]
This illustrates the mechanism I described in "Self-limiting Intelligence."  You can test yourself on these last two points. Here's a creative idea from Business Insider, and a challenging social issue from The Economist. Good luck!


Wednesday, November 23, 2011

Assessments, Signals, and Relevance

In "Tests and Dialogues" I promised to address the use of rubrics, which I get to. But before I do, I want to extend the ideas presented in the last few articles. By coincidence, my daughter  provided an example the same day I wrote the article.
My daughter Epsilon had a math test and a French test yesterday, so naturally I asked how they went. She had spent quite some time reviewing (with my imperfect help in the French class), and said that the tests were easy except that she forgot what the degree of a polynomial is (ugh!). She said she was able to guess at some things she didn't know, which made my eyebrows rise. Guess? Sure, she says, it's almost all multiple choice.  Here I began to sputter. What?? Algebra and French...multiple choice? Yes, says she, it's because of the EOCs. That would be the local name for the monological "End of Course" state tests. Since the EOCs are multiple choice, and there is so much weight put on them, it makes economic sense to optimize all testing to resemble "the ones that matter." She's 14, and this is old hat to her by now.
The wrong assessments plus a factory mentality optimizes local relevance at the cost of global irrelevance. David Kammler wrote a marvelous parable in this vein, "The Well Intentioned Commissar." Achieving goals can be inherently very complex. When we try to grasp their workings by simplifying cause and effect (e.g. in order to manage like a factory), we can lose important information. This is detrimental when optimizing the simplified problem is not the same as optimizing the original problem. The impact is not merely academic. I read a story in The Economist years ago that went like this:
A state in the US was spending more on road repair that it thought reasonable, and sought to make the situation more equitable by passing cost onto the owners of the heavy trucks that were doing the most damage. So they instituted an axle fee--the more axles the truck had, the higher the cost to the truck owner to use the roads. This was a simple approximation: the heavier the truck, the more axles. What could go wrong? The outcome was that truckers, not being stupid, started using trucks that carried just as much weight, but on fewer axles. This increased the ground pressure of the trucks (same weight over less area) and damaged the roads even more than before. In this case they didn't merely optimize irrelevance, but actually exacerbated the problem they were trying to fix.
Even being irrelevant has an associated opportunity cost. The time spent learning how to game multiple choice tests could be better spent. We can only imagine what the long term cost is, when students finally figure out that real problems don't come with a built-in 20% chance of guessing the right answer.

It is not a coincidence that all of this ties together with the idea in "Self-Limiting Intelligence," where the problem I tried to illuminate about intelligent systems is that self-change easily turns into self-deception. My last couple of articles have mostly ignored the fact that there are powerful motivations lurking behind official definitions. Here's an example of how motivation subverts definitions from theNewspaper.com, which calls itself "a journal of the politics of driving."
Automated ticketing vendor American Traffic Solutions (ATS) filed suit Tuesday against Knoxville, Tennessee for its failure to issue tickets for turning right on a red light -- and that is costing the company a lot of money. A state law took effect in July banning the controversial turning tickets, but the Arizona-based firm contends the law should not apply to their legal agreement with the city, which anticipated the bulk of the money to come from this type of tickets.
If this seems silly, here's a more disturbing example:
Judge Mark A. Ciavarella and former Senior Judge Michael T. Conahan are accused of taking $2.6 million for sending children to two [correctional] facilities owned by Pittsburgh businessman Greg Zappala.
Privatizing a corrections facility created an economic value for criminal offenders, which increased the supply of such through more application of the monological standard by judges. This is what juries are there to prevent--provide a dialogical check.

In higher education, the definition of educational success adopted by the policymakers is "enrollment," and "graduation," for which the state pays plenty. See my previous article "Flipping Colleges for Profit," for how that turns out in the hands of private investors who seek to maximize dollars/student. We maximize enrollment and (perhaps) graduation at very large monetary expense to the taxpayer in grants and loans to students who default at high rates. This counterproductive effect is evidence of an over-simplified index of success.

It would be understandable if you took away from the discussion so far that monological = bad and dialogical = good, but that's not the case. Systems have to function monologically most of the time. I base this on a simple argument:
Systems of all kinds exist. Some work, some don't. The ones that survive do so in part because they are motivated to survive. The actions a system takes to survive can be ultimately reduced to binary "do this/don't do that" decisions. Motivations drive those decisions based on information from the internal and external environment. This reduction of complex data into a simple binary decision we might call an assessment, and the result is a 'signal.' Pain when you stub your toe is a "don't do that" signal corresponding to the implicit motivation to avoid bodily harm. 
If we put together all these signals, they comprise a language. If it works, it models the environment and allows the system to survive (eat this, don't eat that). The diagram that goes with a motivation-driven decision loop is the one I discussed in "Self-Limiting Intelligence."

The point is not that monological motivation-driven signals are bad for us, it is that we have to use the right ones if we want to succeed. Sometimes dialogues get turned into signals, as in a plebiscite or anywhere else where public opinion matters. Marketing is another example, but in reverse--working from a motivation to try to affect dialogue so someone can sell more soap. In those cases, lots of energy is spent in trying to affect conversations. The BBC's recent article "Fake forum comments are 'eroding' trust in the web" is an example.

Part of the decision about what assessment to use to create signals should be driven by the consideration that weighing it down with economic value will probably degrade the quality. This problem is ubiquitous. It includes counterfeiting, cheating, and corruption of all sorts. It even shows up in natural selection, as Darwin figured out, in sexual selection--explaining peacock feathers, for example. It is perhaps embodied in the advice "you have to fake it to make it."

In order to make a decision, a system has to process a potentially infinite amount of data for a few clues as to what will accomplish its goal (fulfilling motivations). This assessment is a massive data compression that, if it's done well, describes in signal-language the important elements of the environment relative to motivation.

An example will illustrate the point. First, let's look at the role of complexity and assessment. I took the photos below at the Cape Fear Serpentarium in Wilmington, North Carolina. (It's a fantastic place to visit, along with Fort Fisher and the nearby aquarium if you're in the area.) This is the Gaboon Viper, and I first read about it in a zoo--I think in Columbia, South Carolina. It's a big, slow snake that prefers to sit and wait for lunch to walk or hop by. If you're a small animal, the picture below shows your perspective:
The Gaboon Viper presents a high-complexity look to prey.
 In its natural environment, the snake's colors perfectly fit into the surrounding forest floor. The complex patters of light and dark break up the shape of the form, so that a rodent is unlikely to correctly assess this information and form the signal SNAKE! The snake presents a high complexity visual presentation to the world, and is rewarded for this concealment by a reduced probability that a rodent will correctly assess the situation.

And a low-complexity "here I am!" to large beasts.
However, the snake has another problem. It's got a great lifestyle, sitting around waiting for dinner to walk by, but there are also large hooved beasts that are far too large to eat and impossible to get out of the way of when they come ambling by. It's a good thing to be hidden from small critters that one might consume, but quite another to be hidden from a huge monster that might step on you and break your spine! So the presentation for a viewer looking down on the serpent needs adjustment. Instead of concealment, it wants to create an instant assessment in the bovine brain of SNAKE! As you can see in the photo on the right, this is accomplished (via natural selection, of course) by white stripes that look like the center of a highway.  I looked for research that actually demonstrates that cows can see these snakes better than ones without such coloration, but didn't come up with anything. So treat this as informed speculation rather than fact, unless someone can point me to an authoritative source. But the effect is real. In the family of large cats, some females have white dots on the backs of their ears so they can present a low-complexity "follow me" sign to their kittens in low light. Military vehicles do something similar so they don't run into one another in the dark, but also don't make good targets.

To continue the example:
Suppose you are going out for an afternoon hike in a tropical jungle. You check into the matter and see that there are deadly poisonous snakes in the area. Fortunately, there is a guy whose job it is to monitor the jungle nearby for such threats, and post a sign at the trail head with a warning when appropriate.  You may rightfully be dubious. There is a very large tract of undeveloped forest out there, and how plausible is it that this guy--who may be the governor's favorite nephew, for all you know--could have checked for every possible snake? So you are not reassured when you see a big green NO SNAKES TODAY THANK YOU sign nailed to a tree. In effect, you've rejected the data reducing assessment and have decided to create your own signals. That is, the final assessment for "is there a snake here?" remains pending.  You carefully watch where you step, tying up a large part of your mind to continually test the environment against your snake-matching perception. This is a lot of work and quite stressful, so you give it up and go to lunch instead.
The example shows the trade-off in an early or late assessment into a signal. Early signals make subsequent decisions easier. That's why most businesses like stability.

So the big question for a complex system is when to do the assessment for a given motivation. There are symmetrical arguments for and against early decisions based on limited data:

Early assessment from data to signal:
Pro: If we assess early, we reap the economic benefit similar to mass production. All decisions that depended on the first one can now go about their business. By mandating that 21 is the legal age to drink alcohol, it creates a simple environment for liquor stores, as opposed to say administering an on-the-spot test for "responsible drinking" for each customer. 
Con: Creating the signal greatly simplifies the actual state of the world. If subsequent decisions need detailed information, it won't be available. Worse, the signal may be wrong entirely.
 "Housing prices never fall" was an early assessment that led to a lot of unfortunate consequences.
Con: Signal manipulation for economic benefit (what we would call corruption in a government) can cause a wide-spread disconnect from reality. Adopting unproven test scores as the measure of educational success creates a false economy and doesn't reflect the actual goal.
Late assessment from data to signal:
Pro: Information isn't lost before it's needed. The example of walking in the jungle illustrates this. 
Pro: Local corruptions of signals have only local effects. From a virus's point of view, it's ever-changing protein coat means that it can't be intercepted easily. On the other side, those who have to decide on a vaccine have a difficult decision to make about which variants to target.
Con: It's more costly because decisions are made individually instead of in mass production style. Every state has a different set of paperwork for allowing truckers to use the roads, which impedes commerce. Conversely, internet sales are aided by not having to worry about every locality's rules.
There's no one right answer. The cost for deferring assessment can be very high. The railroad owners finally created time zones to solve the problem of every town running on a different clock. A common currency is obviously good for everyone. Standardized traffic laws are a boon. Formalized ownership of property is a prerequisite for a modern society. All of these involve monological definitions that are somewhat based on early assessment of evidence and are somewhat arbitrary. In some cases, it can be completely arbitrary and still immensely helpful (e.g. which side of the road we drive on). Imagine what a mess it would be if we had to survey the dialogical landscape every morning to see whether most people were driving on the right or left, stopping at green lights or red, and make our adjustments accordingly.

Applying this to Higher Education
Education at all levels has to process so many students that good organization is essential. So it has to be a system, and there are going to be a lot of semi-arbitrary decisions made just to provide a workable system language. There are many, many of these, and those of us who labor within the system take it for granted that these things exist:
  • Courses
  • Grades
  • Credit-hours
  • Grade levels or course level designations
  • Set times for instruction (e.g. one hour lecture, which is really fifty minutes)
  • Set curricula
  • Degrees and diplomas
None of these have much directly to do with learning. The pressure to define what a credit hour means for online courses shows the rigidity of the system. It's worth a moment to look at one of these in detail, so let's follow that train of thought into the dialogical tunnel. Compare the Carnegie fifty-minute block class with how we naturally learn.

The Kahn Academy comprises a large group of tutorials on YouTube started by Sal Khan (who quit managing a hedge fund to do this), and now operated by an impressive team. The home page claims that over 87 million lessons have been delivered, with more than 2700 videos on offer covering a range of academic subjects and levels. The one shown in the picture is about long division. You can see that it's a little less than 10 minutes long. Why ten minutes? Why not fifty minutes? Some are longer and some shorter, depending on the needs of the subject.

The curriculum in the Khan Academy is not a set of courses with prerequisites. It's much more natural than that, using a "knowledge map" to show connections between the ideas taught in the videos. Here's a sample:
There are suggestions for prerequisites, but they are per topic, not per course. Each of these has associated problems to be solved, challenges, and badges to earn. There are individualized feedback reports, and the ability for coaches to be involved.

I recently signed up for a free course on Machine Learning taught out of Stanford University. The lectures were online with a built-in quiz in each one. The videos are of varying length, but none close to fifty minutes. I skipped stuff I was already familiar with and browsed topics I didn't know as much about. Once I had to go back to the introductory material to clarify a point, had my "aha!" moment, and then forged ahead. To me this is a very natural way to learn. It's not at all systematic. I would never have had the time to sit through a traditional lecture course on the subject, but with the browsing ability I have with the online presentation, I can choose what I want off the menu and maximize the use of my time.

What's the point? 
If we create systems that make early decisions for learners so that we can make the logistics work, we save time in administration but suffer opportunity cost for each student. This "early assessment" model leads to preemptive decisions solidified into policy and practice: courses, semesters, grades, and so on.

A "late assessment" version would be to provide just-in-time instruction for each student so that it could be used in some authentic learning situation. By authentic I mean anything that doesn't cause the student to think "when will I ever use this?"  For example, individual or group projects that require learning the content, but perhaps only a piece at a time on demand.

The early collapse of complexity into a simple bureaucratic language includes the factory-like quality checks that occur along the way in the form of (increasingly standardized) tests. This early assessment presents problems for consumers of the product: employers and the graduates themselves, and society as a whole. The educational system doesn't give much information about what the students have learned other than these formalized assessments. It's like the NO SNAKES TODAY THANK YOU sign.

Before the Internet, late assessment methods would have been too expensive to use for the whole system. Not anymore. We have the opportunity to adopt a new model whereby we coach students on how to learn for themselves. So much the better if this learning is in the context of some interesting project that the student can show off to peers and (if desired) the world. The creation of a rich portfolio along the way allows employers or anyone else with access to make their own assessments.

I do not think that the massive higher education system will change this radically anytime soon. Nor are we ready to make such a change. Something like the Kahn Academy for every academic discipline would be a massive undertaking. Textbook publishers, if they are forward thinking, may lead the way. Imagine an online "textbook" that was actually a web of interrelated ideas mapped out transparently, with video lectures and automated problem solvers associated. This frees up what I once called the low bandwidth part of the course and allows for more creative official class time. Effectively, it offloads all the monologue to out-of-class time and lets you get to the dialogue directly.

There is an opportunity for programs here and there to begin to experiment with this transition. As the examples in my previous articles show, this is already happening. In the most optimistic case, this prevents further solidification of the factory mentality in higher education by showing valid alternatives.

Tuesday, November 22, 2011

Tests and Dialogues

In "The End of Preparation" I argued that standardized tests, as they exist now, are not very suited to the task of correctly classifying quality of the partial products we call students. Certainly the tests give us information beyond mere guessing, but the accuracy (judging from the SAT) is not high enough to support a factory-like production model. I pointed out that test makers do not usually even attempt to ascertain what the accuracy rate is. Instead we get validity reports that use a variety of associations. If we brought that idea back to the factory line, it would look something like this.
Announcing the new Auto Wiper Assessment (AWA). It is designed to test the ability of an auto to wipe water off the windshield. Its validity has been determined by high negative correlations with crash rates of autos on rainy days and low correlation on sunny days. 
On a real assembly line, the question would be as simple as Does it work now? and Are the parts reliable enough to keep it working?  Both of these can be tested with high precision. And of course, we can throw water on the windshield to directly observe whether the apparatus functions as intended. Direct observation of the functional structure of learning is not possible without brain scanners. Even then, we wouldn't really know what we are looking at--the science isn't there yet. What we do know is fascinating, like the London Taxi Cab study, but we're a long way from understanding brains the way we understand windshield wipers.

Validity becomes a chicken-and-egg problem. Suppose our actual outcome is "critical thinking and complex reasoning," to pick one from Academically Adrift. There are tests that supposedly tell us how capable students are at this, but how do we know how good the tests are? If there were already a really good way to check, we wouldn't need the test! In practice, the test-makers get away with waving their hands and pointing to correlations and factor analyses, like the Auto Wiper Assessment example above. This is obviously not a substitute for actually knowing, and it's impossible to calculate the accuracy rate from the kind of current validity studies that are done. The SAT, as I mentioned is an exception. This is because it does try to predict something measurable: college grades.

This is not a great situation. How do we know if the test makers are selling flim-flam? In practice, I think tests have to "look good enough" to pass casual inspection, and they can amount to neo-phrenology without anyone every knowing. How else can the vast amount of money being spent on standardized tests be explained? I'd be happy to be wrong if someone can point me to validity studies that show test classification error rates similar to the SAT's. A ROC graph would be nice.

The argument might be that since reductionist definitions are not practical, and there really is no way to know whether a test works except through indirect indications like correlations, this is the best we can do. But it isn't. In order to support that claim, let me develop the idea by contrasting two sorts of epistemology. It's essential to the argument and also worth the exposition for its own sake. When I first encountered these ideas, they changed the way I see the world.

Monological Knowing
Sometimes we know something via a simple causal mechanism: an inarguable definition. For example, when the home plate umpire calls a strike in a baseball game, that's what it is. It doesn't matter if the replay on television shows that the pitch was actually out of the strike zone.  Any argument about that will be in a different space--perhaps a meta-discussion about the nature of how such calls should be made. But within the game, as veteran umpire Bill Klein is quoted as saying "It ain't nothin till I call it!"

Monological definitions are generally associated with some obvious sign. An umpire jerking his clenched fist after a pitch means it was a strike. Sometimes the definitions come down to chance, as with a jury trial. In the legal system, you are guilty if the jury finds you guilty, which has only indirectly to do with whether or not you committed a crime. The unequivocal sign of your guilt is a verdict from the jury. Other examples include:
  • Course grades, defining 'A' Student, 'B' Student etc.
  • Time on the clock at a basketball or football game, which corresponds only roughly to shared perception of time passing (perceived time doesn't stop during a time-out, but monological time can).
  • Pernicious examples of classifying a person's race, e.g. leading up to the Rwandan genocide. You are what it says you are on your documents.
Sometimes the assignments are random or arbitrary. Sometimes a single person gets to decide the classification, as with course grades. There is sometimes pressure from administrators to create easily understood algorithms for computing grades in order to handle grade appeals, but instructors usually have wide latitude in assigning what amounts to the monological achievement level of the student.

I got bumped from a flight one time, and came away from the gate with the knowledge that I was "confirmed" on the next flight. That didn't mean what I thought it did, however. According to the airline's (monological) definition, "confirmed" means that the airline knows you are in the airport waiting, so you're a sure seat if they have an extra. It does not mean that such a seat is guaranteed for you.

Dialogical Knowing
This might be more properly called polyphonic, but for the sake of parallelism, allow me the indulgence. In contrast to a monological handing down of definitions from some source, dialogical knowledge has these characteristics:
  • It comes from multiple sources
  • There isn't universal agreement about it (definitions are not binding if they exist)
  • It's subjective
Whereas there is a master copy of what a Kilogram is in a controlled chamber in France, there is no such thing for the concept of "heavy." A load you are carrying will feel heavier after an hour than at the beginning of the hour. Furthermore, we can disagree about the heaviness. This is messy and imperfect, but very flexible because no definitions are needed. Anyone can create a dialogical concept, and it gets to compete with all the others in an ecology where the most fit survive. This fact is what prevents loose shared understanding from devolving too far into nonsense as a whole. There's plenty of nonsense (like fortune-telling), but we can communicate in a shared language very effectively even in the absence of formal definitions. 

If I tell you that I liked the movie Kung Fu Panda, you know what I mean. There are movies you like too, and you probably assume I feel about this movie the way you feel about those is some vague sense. You may disagree, but that's not a barrier to understanding. We could have a complex conversation about what constitutes a "good" movie, which doesn't have a final, monological answer. In Assessing the Elephant I compared this to the parable of the blind men inspecting an elephant, each sharing their own perspective. I used this as a metaphor for assessing general education outcomes, which are generally broad and hard to define monologically.

Tension between Monologue and Dialogue
Parallel to the tension between accountability and improvement in outcomes assessment, there is a tension between monological and dialogical knowledge in any system. The demand for locked-down monological approaches is the natural consequence of being part of a system, which as I described last time, needs to manage fuzziness and uncertainty in order to function. That's why we have monological definitions for what it means to be an adult, or "legally drunk." It makes systematization possible. Much of the time, this entails replacing a hard dialogical question ("what is an adult?") by a simple monological definition ("anyone 21 years or older"). In ordinary conversation we may switch these meanings without noticing, but sometimes the tension is obvious.

The question "which candidate will do the best job in office?" gets answered by "which candidate got the most votes?" It replaces an intractable question with one that can be answered systematically in a reasonable amount of time Of course it's an approximation of unknown validity. Monologically, the system decides on the "best" candidate, but the dialogical split on the issue can be 49% vs 51%.

Someone put together an page describing the relationship between monological Starbucks definitions of drink sizes and the shared understanding of small, medium, large. The site, which you can find here, is a perfect foil for this discussion. I find it hysterically funny. Here's a bit of it:
The first problem is that Starbucks is right, in a sense. I've established that asking for a "small coffee" gets you the 12-ounce size; "medium" or "medium-sized" gets you 16 ounces; and "large" gets you a 20 ounce cup. However, in absolute rather than relative terms, this is nuts. A "cup" is technically 8 ounces, and in the case of coffee, a nominal "cup" seems to be 6 ounces, as indicated by the calibrations on the water reservoirs of coffee makers, [...]
When a referee makes a bad call in a sports event, the crowd reacts negatively. The dialogical "fact" doesn't agree with the monological one, which is seen as artificial and not reflecting the reality of shared experience.

It may be appalling, but it makes sense that the Oxford English Dictionary now includes the work "nucular" as a synonym for "nuclear." This is the emodiment of a philosophy that the dictionary should reflect the dialogical use of language, not some monological official version.



In assessment, it's quite natural to fall victim to the tension between these two kinds of knowledge. As noted, tests of learning almost never come with warning labels that say This test gives the wrong answer 35% of the time. The test doesn't have any other monological ways of knowing to compete with, other than possibly other similar tests, so by default the test becomes the monological definition of the learning outcome. Because it replaces a hard question ("how well can our students think?") with an easily systematized one ("what was the test score?") it's attractive to anyone who has to watch dial and turn knobs in the system. In the classroom, however, the test may or may not have anything to do with the shared dialogical knowledge--that messy, subjective, imperfect consensus about how well students are really performing.

A Proposal to Bridge the Gap
Until we better understand how brains work, it's not realistic to hope for a physiology-based monological definition of learning to emerge to compete with testing. However, it would be very interesting to see how well tests align with the shared conception of expert observers. This doesn't seem to be a standard part of validity testing in education, and I'm not sure why. It's in everyone's best interests to align the two.

There is a brilliant history of this kind of research in psychology, culminating in the definition of the Big Five personality traits, which you can read about here. From Wikipedia, here is the kernel of the idea:
Sir Francis Galton was the first scientist to recognize what is now known as the Lexical Hypothesis This is the idea that the most salient and socially relevant personality differences in people’s lives will eventually become encoded into language. The hypothesis further suggests that by sampling language, it is possible to derive a comprehensive taxonomy of human personality traits.
Subjective assessments have a bad reputation in education, but lexical hypothesis was shown to be workable in practice. It's not astounding that dialogical language has meaning, but it doesn't seem fashionable to admit it.

Given all this, it's obvious that we should at least try to understand the resemblance between monological tests of "critical thinking and complex reasoning" or "effective writing" and the dialogical equivalent. It's simple and inexpensive to do this if one already has test results. All that's required is to ask people who have had opportunity to observe students what they think. Any way it turns out, the results will be interesting.

Suppose the test results align very well with dialogical perceptions. That's great--we can use either tests or subjective surveys as we prefer.

If the two don't align, then we have to ask who's more likely to be correct. In this case the tests lose out because of a simple fact: test scores don't matter in the real world. What does matter are the subjective impressions of those who employ our graduates or otherwise interact with them professionally. In the world beyond the academy, it's common shared perceptions that are the metric of success, and it won't do any good to point to your test scores. In fact, there is a certain shadenfreude in disproving credentials, as in watching videos of graduates from Ivy U who don't know why the seasons change. It isn't just Missouri: we're a show me society.

You'll notice that either way, the test results are largely unneeded. This illuminates why they are being used. Self-reported dialogical assessments depend on trust. In theory, tests can be administered in an adversarial environment. This restates Peter Ewell's quote in my previous article. This is a recipe for optimizing irrelevance. In Assessing the Elephant, I called this a degenerate assessment loop and gave this example:
A software developer found that there were too many bugs in its products, so it began a new system of rewards. Programmers would be paid a bonus for every software bug they identified and fixed. The number of bugs found skyrocketed. The champagne was quickly put back on ice, however, when the company realized that the new policy had motivated programmers to create more bugs so that they could “find” them.
Similar "degenerate" strategies find their way into educational practices because of the economic value placed on monological simplifications used in low-trust settings. We read about them in the paper sometimes.

Surveying the Dialogical Landscape
I have implemented surveys at two institutions to gather faculty ratings of student learning outcomes. I have many thousands of data points, but no standardized test scores to compare them to, so I can't check the alignment as I described above. The reliability of these ratings is: about a 50% probability of exact match on a four-point scale for the same student, same semester, same learning outcome, with different instructors. I've already written extensively about that, for example here and on this blog, as well as some chapters in assessment books, which you can find on my vita.

Conclusion
In a tight system, monological approaches are useful. The human body is a good example of this, but we should note that at least two important systems are more dialogical than monological: the immune system and the conscious mind. The world beyond graduation resembles a competitive ecology more like what the immune system faces than a systematic by-the-numbers existence like a toenail.

The only reason to use monological tests is if we don't trust faculty. This can't even be done with any intellectual honesty because we can't say that the tests are any good. What I proposed in the "The End of Preparation" is that we move to dialogical methods of assessment throughout and beyond the academy. These can still be summarized for administrators to look at, but only if there is trust at all levels. And really, if there is no trust between faculty and administration, the whole enterprise is doomed.

The mechanism of using public porfolios showing student records of performance can be purely dialogical--a student's work can have different value to different observers inside and outside the academy.


Next time I'll address what all this has to do with rubric-based assessment.

[Next article in this series: "Assessments, Signals, and Relevance"]

Some Frivolous Thoughts
As I said, this dichotomy changed the way I think about the world, and I find interesting tid-bits everywhere. One interesting idea is the hypothesis that as a domain of interest becomes more reliably theoretical (like alchemy becoming chemistry), the nomenclature transitions from descriptive and dialogical to arbitrary and monological. I went poking through several dictionaries looking for evidence of the names of the elements, to find examples. Copper may be an instance, perhaps having been named for Crete, as in "Cretian metal." If the name is too old, the etymology is foggy. Steal is more recent, and it seems to derive from a descriptive Germanic word for stiff. Compare that to Plutonium, which is modern and non-descriptive. Of course, with arbitrary naming, the namer can choose to be descriptive, as Radium arguably is. This thesis needs some work.

In biology, Red-Winged Blackbird is a descriptive name for the monological Agelaius phoeniceus. In a good theory, it doesn't matter what you call something. What matters is the relationships between elements, like the evolutionary links between bird species as laid out in a cladistic family tree.Modern scientists are more or less free to name new species or sub-atomic particles whatever they want. Organic chemistry is an interesting exception, because the names themselves are associated with composition. They are simultaneously descriptive and monological.

Drug names are particularly interesting. Viagra, for example, has a chemical name that describes it, but that obviously wouldn't do for advertising purposes. Here's what one source says about the naming process:
Drug companies use several criteria in selecting a brand name. First and foremost, the name must be easy to remember. Ideally, it should be one physicians will like -- short and with a subliminal connotation of the drug. Some companies associate their drugs with certain letters (e.g., Upjohn with X and Glaxo with Z). If the drug is expected to be used eventually on a nonprescription basis, the name should not sound medicinal. There must be no trademark incompatibilities, and the company must take account of the drug's expected competition.
It sounds like the name is chosen to fit neatly into a dialogical ecology.

The history of the SAT's name is interesting from this perspective, but I will bring this overlong article to a close.

Acknowledgments: The idea for the monological/dialogical dichotomy came out of conversations with Dr. Adelheid Eubanks about her research on Mikhail Bakhtin. I undoubtedly have mangled Bakhtin's original ideas, and neither he nor Adelheid should be held responsible for that.

Sunday, November 20, 2011

The End of Preparation

A few days ago, I wrote "A Perilous Tail" about problems with the underlying distributions of measurements (in the sense of observations turned into numbers) we employ in education. I've shown previously that at least for the SAT, where there is data to check, predictive validity is not very good: we can only classify students correctly 65% of the time. When this much random chance is involved in decision-making, the effect is that we can easily be fooled. Here I cited the example that might lead us to believe that yelling at dice improves their "performance." It's also unfair to hold individuals accountable if their performance is tied to significant factors they can't control, and it invites cheating and finger-pointing.

As I mentioned in the previous article, I have a solution to propose. But I haven't really done the problem justice yet.  We encounter randomness every day, so how do we deal with it? Functioning systems have a hard time with too much randomness, so the systematic response is to manage it by reducing or avoiding uncertainties, and when that cannot be done, we might imagine it away (for example, throwing salt over one's shoulder to ward off bad luck). Many of our ancestors undoubtedly faced a great deal of uncertainty about what they would be able to eat, so much of our mental and physical activity and the very way our bodies are constructed, has to do with finding suitable food (which can be of various sorts) and consuming it effectively. Compare that with the homogeneous internal system of energy delivery that feeds and oxygenates the cells in our body via our circulatory system. Complex systems often take messy stuff from the environment and then organize it for internal use. I will use a mass-production line like the one pictured below for such a system.
Source: Wikipedia

The idea is to find materials in the environment that can be used to implement a plan, turning rocks into aluminum and tree sap into rubber, sand into glass, and heating, squashing, and otherwise manipulating raw natural resources until they come to create an airplane. This process is highly systematized so that parts are interchangeable, and the reliability of each step can be very high. Motorola invented the concept of Six Sigma to try to reduce the randomness in a manufacturing process to negligible amounts. This is at least theoretically possible in physical systems that have reliable mechanical properties.

What do we do when randomness can't be eliminated from the assembly line? One approach is to proceed anyway, because assembly lines have great economies of scale, and can perhaps be useful even if there are a large number of faulty items produced. Computer chip makers have to deal with a certain percentage of bad chips at the end of the line, for example. When chemists make organic molecules that randomly choose a left-right symmetry (i.e. chirality), sometimes they have to throw away half of the product, and there's no way around it.

The educational system in the United States has to deal with a great deal of variability in the students it gets as inputs and the processes individuals experience. It superficially resembles a mass production line. There are stages of completion (i.e. grade levels), and bits of assembly that happen in each one. There are quality checks (grades and promotion), quality assurance checks (often standardized tests), and a final stamp of approval that comes at the end (a diploma).

All this is accomplished while largely ignoring the undeniable fact that students are not standardized along a common design, and their mental machinery cannot be engineered directly the way an airplane can be assembled. In short, the raw material for the process is mind-bendingly more complex than any human-made physical device that exists.

Because of the high variability in outcomes, the means we use for quality assurance is crucially important, and this is where we have real opportunities to improve. This is the assessment problem. Current methods of assessment look a lot like factory floor assessments: the result of analyzing student performance is often a list of numbers that can be aggregated to show the executives how the line is working. Rewards and punishments may be meted out accordingly. In a real factory,  the current stage of production must  adequately prepare the product for the next stage of production. We must be able to correctly classify parts and pieces as "acceptable" or "not acceptable" according to whether or not they will function as required in the whole assembly. The odd thing about educational testing is that this kind of question doesn't seem to be asked and answered in a way that takes test error into account. Randomness is simply wished away as if it didn't exist. In the case of the SAT (see here), the question might be "is this student going to work out in college?" In practical terms this is defined as earning at least a B- average the first year (as defined by the College Board's benchmark). To their credit, the College Board published the answer, but this transparency is exceptional. Analyzing test quality in this way proceeds like this:
  1. State what the desired observable future effect of the educational component under review.
  2. Compare test scores with actual achievement of the outcome. What percentage succeeded at each score?
  3. Find a suitable compromise between true positives and true negative outcomes to use as your benchmark.
  4. Publish the true positive and true negative predication rate based on that benchmark.
To repeat, the College Board has done this, and the answer is that the SAT benchmark gives the right answer 65% of the time. This would make you rich if we were predicting stock prices, but it seems awfully low for a production line quality check.

Source: Wikipedia

Because we can't make the randomness go away, we imagine it away. So the assessments become de facto the measure of quality, and the quality of the tests themselves remains unexamined. In a real assembly line, an imperfect test would be found out eventually when the planes didn't perform as expected. Someone would notice and eventually track it down to a problem with the quality assurance program. There is so much uncertainty in education that this isn't possible, and the result is truly ironic: the deeper insinuation of tests that are unaccountable for their results. To be clear: any quality test that does not stand in for a clear predictive objective and provide research for its rate of correct classification in actual practice, is being used simply on faith. To be fair, it's virtually impossible to meet this bar. That excuse doesn't make the problem go away, however--it just makes it worse. One result is that test results distort perceptions of reality. If appearance is taken at face value for reality, then appearance has great economic value. There is incentive to do whatever it takes to get higher test scores, with unfortunate and predictable results.

To sum up the problem: variability is too high for good standardized tests to support an educational assembly line, and this fact is generally ignored for convenience.

I don't mean to imply that the major actors in education are incompetent. We are where we are because of historical factors that make sense as a system in evolution, and we have the means to take the next step.

The real world is not an assembly line. There is this expression we use when talking to students "when you get out into the real world...", as if the academy is a walled garden that excludes the vulgar world at large. This too, is a factory mentality. The students have heard it all before. From kindergarten on up, they hear stories about how hard it's going to be "when you get to high school," or "when you get to college." My daughter marveled at this during her first few weeks of high school. She was amazed and somewhat appalled that her middle school teachers had misled her about this. Of course it's not much harder--the system can only work with a high degree of integration, smoothing out the hard bits. If the wing assembly is slowing down the production line, then it needs attention. One could argue that the whole path from Kindergarten through Doctorate is becoming a smooth one for anyone who wants to trod it.

But the real world really is different. The assembly line stops at the hanger door, and the planes are supposed to be ready to fly. The tests don't matter anymore. No one is going to check their validity, nor delve too deeply into what a certification means after graduation. And in the real world, the factory mentality has to be unlearned: one cannot cram all night just before promotions are announced in order to game the system.

One solution is to try to change the real world to be more like the educational system. This is a practical choice for a military career, perhaps, where strict bureaucracy is essential to function. But it certainly is a mismatch for an entrepreneurial career, the engine of the nation's economic growth.

I believe it is now a reasonable and desirable choice to go the other direction, and change the assembly line to look more like the real world. The most important aspect to change is to admit uncertainty and begin to take advantage of it. This means we have to forget about the idea of standardizing and certifying. I will argue that we can do this to our great advantage, and introduce efficiencies into the economic structure of the nation in the process. Currently we pass our uncertainties on to employers. We hide behind test results and certificates, and leave it to employers to actually figure out what all that means. The result is that they have only very crude screening information at their disposal, and have to almost start from scratch to see what a graduate can actually do. The Internet can change all that. To my surprise, I discovered last week that it already is.

I had written a draft of this article last week, but when I ran it through my BS-detector, I couldn't bring myself to publish it. The reason is simple: it's just another untested idea, or so I thought. I hadn't actually employed the solution I will describe below, and so I didn't have anything concrete to show. But by coincidence, I saw exactly what I was looking for at the Virginia Assessment Group conference, at a presentation by Jeffrey Yan, who is the CEO of Digication. I didn't know about the company before last week.Vendors in higher education technology solutions may be excused perhaps for exaggerating the effectiveness of their products, and I generally think they are overpriced, too complicated, and too narrowly focused. I didn't have great expectations for Jeffrey's talk, but after about ten minutes I realized that he was showing off a practical application of what I was hypothesizing.

Digication is an eportfolio product. In what follows, I will not attempt to describe it as a software review would, but as it fits into the flow of ideas in this article.

The main idea is simple: instead of treating students as if we were preparing them for a future beyond the academy, treat them as if they were already there. In the real world, as it's called, our careers are not built on formalized assessments. To be sure, we have to deal with them in performance reviews or board certifications, but these are mostly barriers to success, not guarantees of it. Instead, it's the record of accomplishment we create as we go that matters. In many instances, promotions and accolades are inefficiently distributed, based on personal relationships and tenure, rather than merit, but this is not what we should aspire to. In fact, these imperfections are vulnerable to the sort of transparency that is within our grasp.

Senior Design Project at Stonybrook
In his presentation, Jeffrey showed examples of the sort of thing that's possible. Take a look at this senior design project at Stonybrook University. It's a real world project to design a new sort of sphygmomanometer (blood pressure meter). Quoting from the project page:
We aim to satisfy all the customer needs by designing a [blood pressure measuring] device that translates the vibrations into a visual indication of blood pulses, more specifically the first pulse to force its way through the occluded artery (systolic) and the last pulse detectable before laminar flow is regained (diastolic).
Another showcased student portfolio was from a second year student at the same institution, who created a public portfolio to tell the world about his interests and abilities. He shows how to solve what we call a difference equation (similar to a differential equation) using combinatoric methods here. This shows an interest in versatility in the subject that cannot be communicated with a few numbers in an assembly-line type report.

By concentrating on authentic evidence of accomplishment, rather than artificially standardized means of observation, we create an important opportunity: a public portfolio can be judged on its own merits, rather than via an uncertain intermediary. It's the difference between seeing a movie yourself and knowing only that it got three and a half stars from some critic.

The solution to the factory mentality presents itself. If students see that they are working for themselves and not as part of some unfathomable assembly process, accumulating what will become a public portfolio of their accomplishments, their learning becomes transparent. They can directly compare themselves to peers in class, peers at other institutions, graduates from all over, and professionals in the field. I imagine this leading to a day when it's simply unthinkable for any professional not to have an up-to date professional eportfolio linked to his or her professional social networking presence (see mathoverflow.net, Academia.edu, and LinkedIn.com as examples of such networks). Once started, the competitive edge by those with portfolios will become obvious--you can learn much more from a transparent work history than you can from a resume.

While in school, of course, some work, maybe much of it, needs to be private, to gestate ideas before presenting them to the world. But the goal should be for a forward-looking institution of higher education to begin to create public sites like the Stonybrook showcase and the one at LaGuardia Community College. Ultimately, universities need to hand the portfolios off to the students to develop as their respective careers unfold. I understand that graduates get to keep their portfolios and continue to develop them with Digication's license, as long as it is maintained.

Here's the manifesto version:
We don't need grades. We don't need tests or diplomas or certificates or credit hours. None of that matters except insofar as it is useful to internal processes that may help students produce authentic evidence of achievement. That, and that alone is how they should be judged by third parties.
Some advantages of switching from "assemble and test" to authentic work that is self-evidently valuable:
  1. We change student mentality from "cram and forget" to actual accomplishment. We can make the question "when will I ever really use this stuff?" go away.
  2. The method of assessing a portfolio is deferred to the final observer. You may be interested in someone else's opinion or you may not be. It's simply there to inspect. Once this is established, third parties will undoubtedly create a business out of rating portfolios for suitability for your business if you're too busy to do it yourself.
  3. Instead of just a certificate to carry off at graduation, students could have four years' worth of documentation on their authentic efforts. This idea is second nature to a generation who grew up blogging and posting YouTube videos.
  4. It doesn't matter where you learned what. A student who masters quantum mechanics by watching MIT or Kahn Academy videos might produce as good work as someone sitting in class. It makes a real meritocracy possible.
  5. Intermediate work matters. Even if someone never finishes a degree, they have evidence beyond a list of grades that they learned something. And it's in rich detail.
There's more than this, actually. The very nature of publishing novel work is changing. At present, the remnants of paper bound publication, with its interminable delays, exorbitant costs, virtual inability to correct errors, and tightly bound intellectual property issues, is still around. But it's dying.  A journal is nothing more than a news aggregator, and those are now ubiquitous and free. It's hard to say what the final shape of publishing will be, but something like a standardized portfolio will probably be front and center. When I say 'standardized', I mean containing certain key features like metadata and historical archive, so that you can find things, cross-reference, and track changes. As the professional eportfolio develops, it will need help from librarians to keep it all straight, but this can be done at a much lower cost than the publishing business now incurs in lost productivity, restricted access, and cost to libraries.

The focus will, I believe, shift from journals and other information aggregators, to the individuals producing the work. And institutions will share in some of the glory if part of the portfolio was created under their care.

All of this has been around for a while, of course. Eportfolios are almost old news in higher education, and I've blogged about them before. My previous opinion was that there was no need for a university to invest in its own portfolio software because everything you already need is on the web. If you want to put a musical composition on the web, just use Noteflight, and of course there's YouTube for videos, and so on. All that's needed is a way to keep track of hyperlinks to these in a way that can allow instructors to annotate as needed with rubrics and such. The demos convinced me, however, that having a standard platform that can be easily accessible for private, class-wide, collaborative, or public use is worth paying for. I don't know how much it costs in practice, but there is value beyond what one can get for free on the Internet.

Portfolios have been incorporated here and there as just another part of the machinery, amounting to a private repository of student work that can be used for rubric ratings to produce more or less normalized ratings of performance--an advanced sort of grading. This is useful as a formative means of finding all sorts of pedagogical and program strengths and weaknesses. The point of this article is not that portfolios are a better way to produce test-like scores, but that the test scores themselves will become obsolete as external measures of performance. For professors to get feedback on student performance, and for the students themselves to hear directly what the professors and their peers think is invaluable. It's essential for teaching and learning. But it's downright destructive to use this is as a summative measure of performance, for example for holding teachers accountable. The instant you say "accountability," no one trusts anyone else, and there really is no way to run the enterprise but as a factory, with inspectors enforcing every policy. It cannot work in the face of the uncertainties inherent to the inputs and outputs of education.

There is a history of tension in higher education between the desire for authenticity and the simultaneous wish for factory-like operational statistics that show success or failure. The Spellings Commission Report has a nice sidebar about Neumont University and mentions their portfolio approach (their showcase is here), but can't tear itself away from standardized approaches to learning assessment. Three years before, the Council for Higher Education Accreditation beautifully illustrated the tension:
[I]t is imperative for accrediting organizations–as well as the institutions and programs
they accredit–to avoid narrow definitions of student learning or excessively standardized
measures of student achievement. Collegiate learning is complex, and the evidence used
to investigate it must be similarly authentic and contextual. But to pass the test of public
credibility–and thus remain faithful to accreditation’s historic task of quality assurance –
the evidence of student learning outcomes used in the accreditation process must be
rigorous, reliable, and understandable.
This is from CHEA's 2003 paper "Statement Of Mutual Responsibilities for Student Learning Outcomes: Accreditation, Institutions, and Programs."  More recently, Peter Ewell wrote "Assessment, Accountability, and Improvement: Revisiting the Tension" as the first Occasional Paper for the National Institute for Learning Outcomes Assessment, in which he illuminates the game-theoretic problem I alluded to above:
Accountability requires the entity held accountable to demonstrate, with evidence, conformity with an established standard of process or outcome. The associated incentive for that entity is to look as good as possible, regardless of the underlying performance. Improvement, in turn, entails an opposite set of incentives. Deficiencies in performance must be faithfully detected and
reported so they can be acted upon. Indeed, discovering deficiencies is one of the major objectives of assessment for improvement.
In a real factory setting, tests of mechanical process can be very precise, eliminating the difference between what the assessment folks call formative (used to ferret out useful improvements) and summative (an overall rating of quality). If a machine is supposed to produce 100 widgets per hour, and it's only producing 80, it's clear what the deficit is, and the mechanic or engineer can be called in to fix it. But when one is held accountable for numbers like standardized test results that have a considerable amount of uncertainty (which itself is probably unknown, as I pointed out before), the game is very different. It is less like a factory and more like going to market with a bag of some good and some counterfeit coins, which I described in "The Economics of Imperfect Tests." One's optimal strategy has less to do with good teaching than with manipulating the test results anyway one can. Unfortunate examples of that have made national news in K-12 education.

My proposal is that we in higher education take a look at what Stonybrook and others are doing, and see if there is not merit to an emphasis on authentic student learning outcomes, showcased when appropriate for their and our benefit. That we don't consider a grade card and a diploma an adequate take-away from four years and a hundred thousand dollars of investment. That instead, we help them begin to use social networking in a professional way. Set them up with a LinkedIn account during the orientation class--why not? Any sea change from teach/test/rinse/repeat to more individual and meaningful experiences will be difficult for most, but I believe there will be a payoff for those who get there first. Showing student portfolios to prospective students as well as prospective employers creates a powerful transparency that will inevitably have valuable side effects. Jeffrey told said that some of the portfolios get millions of Internet views. How many views does a typical traditional assignment get? A handful at most, and maybe only one.

The odd thing is that this idea is already quietly in place and old hat in the fine arts, performing arts, and architecture departments, and there are probably some I'm not aware of. Who would hire a graphic designer without seeing her portfolio, even if she had a wonderful-looking diploma? This means that we probably have experts already on campus. Computer Science is a natural fit for this too, and there's already a professional social network set up at Stackoverflow.com.

A good first step would be to allow portfolio galleries to count for outcomes assessment results in the Voluntary System of Accountability (VSA). Currently, the only way to participate is to agree to use standardized tests. From the agreement's provision 17:
Participate in the VSA pilot project to measure student learning outcomes by selecting one of three tests to measure student learning gains. 
a) Collegiate Assessment of Academic Proficiency (CAAP) – two modules: critical thinking and writing essay - http://www.act.org/caap/. 
b) Collegiate Learning Assessment (CLA) – including performance task, analytic writing task - http://www.cae.org/content/pro_collegiate.htm. 
c) ETS Proficiency Profile (formerly known as MAPP) – two sub scores of the test: critical thinking and written communication - http://www.ets.org/. Either the Standard or the Abbreviated form can be used.
The VSA is a wonderful program, but it is handicapped by this requirement. If you already use one of these tests, that's fine, but it's expensive and a distraction if you don't find them useful. More to the point of this article, there is no option on the list to report authentic outcomes. Adopting another pilot project to see how far the public portfolio idea will sail would be a great addition.

[The next article in this series is "Tests and Dialogues"]

Acknowledgements: Thanks to Jeffrey Yan for letting me chew his ear off after his presentation. And thanks to the coordinators of the Virginia Assessment Group for putting that wonderful event together.

Disclaimer: I have no financial interest in any of the companies mentioned in this article.