Thursday, April 29, 2010

Surviving Entropy

My 'for-fun' research project is on survival strategies for complex systems.  I found an interesting result last night regarding systems survival.  To make sense of it, you need to understand the context.  The big picture is a system (think biological organism or intelligent civilization) that moves through time trying to maintain a state--surviving, in other words.  The system is made up of components that degrade via entropy.  As a simpler analogy, imagine a satellite that has an on-board computer, but has lost all but the most basic programming.  All that the thing can do is upload and run a new program you send it.  See the amazing graphic below.  This is like a system moving through time and being subject to random internal changes.  The program as it it sent is the original version, and the program as it arrives is the one after some time has gone by.



To simulate entropy, imagine that the satellite's radios cannot do any kind of error correction or detection--you have to build that into the code of the program you send.  This is a very difficult challenge--to write a program that can detect and correct errors in itself.  It can't be done with any kind of certainty because the functionality of the program may be corrupted by the very errors it needs to correct. 

To do the math, we need some definitions, which the graphic below is intended to help.  The rows below are the symbols we send to the satellite's computer, comprising a program.  In the illustration the rows are 28 blocks long, representing 28 symbols (very short for a real program).  Each one of these is vulnerable to corruption due to noise in the channel.  The received code is marked red in a block if transmission error has caused the wrong symbol to be received.  Our first parameter n is the number of symbols.  So n=28 in the illustrations.

If the probability of a symbol arriving intact is p, and the probabilities are independent of one another, then the probability of all of them arriving intact is



Because I'm really interested in what happens in nature as survival systems are subjected to entropy, we need not restrict ourselves to sending a single copy of the program.  Nature is massively parallel, so imagine now that we have lots of computers running simultaneously and we can send several copies (call that parameter k).  In the graphic below, k=5, since there are five layers (five mutually redundant copies).
The probability of at least one program surviving is given by

Now imagine that our program can be broken up into self-contained chunks. Each segment has the property that it can detect if it is broken some of the time (I call this the Am I Me? test) and the ones that are working correctly can tell detect and incapacitate copies of their chunk that aren't working (the Are You Me? test). All we need is one surviving chunk from each type. This is illustrated below, with the vertical dividers showing chunks. Dark green blocks are members of chunk-instances that survive. The number of chunks is parameter m.
In the simplest case where all the chunks are the same size (m divides n), the probability of survival is now

As you would guess,when k is larger the probability of survival (correct uploading of the program) increases, and a m grows the same is true.  What's interesting is that the best trade-off between the two is different, depending on whether p is near zero or near one.  When the probability p of any one symbol arriving correctly is small, the advantage of the segmented transmission (m >1) over unsegmented (m = 1) is


This means that for unreliably transmission of symbols, once k is modestly large, it pays to concentrate on building up m because of the way exponents work.  So more segments is the best survival strategy.  On the other hand, if the chances of an individual symbol arriving intact are high, we have

[Edit: actually the limit is for (1-seg)/(1-unseg), which is a comparison of the chance of NOT surviving, but the rest of the conclusions hold.  See proofs.) The proofs of these limits are too long to include in the margins of this blog.  Notice the symmetry?  The situation is now reversed, and once we have a modest number of segments, we're better off building more redundancy in them.  Also notice that the number of program steps n is not present in either limit.  This means that the strategies are scalable. 

I'm looking for real world examples of this phenomenon, either in human-designed systems or biological ones.  For example, in comparing molecules to cells in the human body, we can assume that molecular mechanism are very reliable, with high probability of persisting across time (p close to one).  Cells have both Am I Me? and Are You Me? tests (programmed cell death and the immune system, respectively), so they qualify as components made up of molecules.  Many cells of different types make up a body, so the number of types is m and the number of each type is k.  Without further assumptions, the model above would predict that we would see many more copies of each cell type than number of cell types.  This seems to be the case: there are trillions of cells and hundreds of cell types.

How robust is this?  Are there other examples?  I'd like to find an example where p is low.

Wednesday, April 28, 2010

Reflection on Generalization of Results

Blogging is sometimes painful.  The source of the discomfort is the airing of ideas and opinions that I might find ridiculous later (like maybe the next day).  Having an eternal memorial to one's dumb ideas is not attractive.  I suppose the only remedy is public reflection, which is no less discomforting.  To wit...

Yesterday I wrote:
The view from a discipline expert is naturally dubious of the claims that learning can be weighed up like a sack of potatoes, and the neural states of a hundred billion brain cells can be summarized in a seven-bit statistic with an accuracy and implicit model that can predict future behavior in some important respect.  Aren't critical thinkers supposed to be skeptical of claims like that?
I've mulled this over for a day.  A counter-argument might go like this:  A sack of potatoes has a very large number of atoms in it, and yet we can reduce those down to a single meaningful statistic (weight or mass) that is a statistical parameter determined from multiple measurements.  The true value of this parameter is presumed to exist, but we cannot know it except within some error bounds with some degree of probabilistic certainty.  This is not different from, say, an IQ test in those particulars.

I think that there is a difference, however.  Let's start with the basic assumption at work: that our neighborhood of the universe is reliable, meaning that if we repeat an experiment with the same initial conditions, we'll get the same outcomes.  Or, failing that, we'll get a well-defined distribution of outcomes (like the double slit experiment in quantum mechanics).  Moreover, we additionally assume that similar experiments yield similar results for a significant subset of all experiments.  This "smoothness" assumption grants us license to do inductive reasoning, to generalize results we have seen to ones we have not.  Without these assumptions, it's hard to see how we could do science. Restating the assumptions:
1. Reliability  Experiment under same conditions gives same results, or (weaker version) a frequency distribution with relatively low entropy

2. Continuity  Experiments with "nearby" initial conditions give "nearby"results.
Condition 1 grants us license assume the experiment relates to the physical universe.  If I'm the only one who ever sees unicorns in the yard, it's hard to justify the universality of the statement.  Condition 2 allows us to make inductive generalizations, which is necessary to make meaningful predictions about the future.  This why the laws of physics are so powerful--with just a few descriptions, validated by a finite number of experiments, we can predict an infinite number of outcomes accurately across a landscape of experimental possibilities. 

My implicit point in the quote above is that outcomes assessment may satisfy the first condition but not the second.  Let's look at an example or two.
Example.  A grade school teacher shows students how the times table works, and begins assessing them daily with a timed test to see how much they know.  This may be pretty reliable--if Tatiana doesn't know her 7s, she'll likely get them wrong consistently.  What is the continuity of the outcome?  Once a student routinely gets 100% on the test, what can we say?  We can say that Tatiana has learned her times tables (to 10 or whatever), and that seems like an accurate statement.  If I said instead that Tatiana can multiply numbers, this may or may not be true.  Maybe she doesn't know how to carry yet, and so can't multiply two-digit numbers.  Therefore, the result is not very generalizable. 
Example.  A university administers a general "critical thinking" standardized test to graduating students.  Careful trials have shown a reasonable level of reliability.  What is the continuity of the outcome?  If we say "our students who took the test scored x% on average," that's a statement of fact.  How far can we generalized?  I can argue statistically that the other students would have had similar scores.  I may be nervous about that, however, since I had to bribe students to take the test.  Can I make a general statement about the skill set students have learned?  Can I say "our graduates have demonstrated on average that they can think critically"? 
To answer the last question we have to know the connection between the test and what's more generally defined as critical thinking.  This is a validity question.  But what we see on standardized tests are very particular types of items, not a whole spectrum of "critical thinking" across disciplines.  In order to be generally administered, they probably have to be that way. 

Can I generalize from one of these tests and say that good critical thinkers in, say, forming an argument, are also good critical thinkers in finding a mathematical proof or synthesizing an organic molecule or translating from Sanskrit or creating an advertisement or critiquing a poem?  I don't think so.  I think there is little generality between these.  Otherwise disciplines would not require special study--just learn general critical thinking and you're good to go.

I don't think the issue of generalization (what I called continuity)  in testing gets enough attention.  We talk about "test validity," which wallpapers over the issue that validity is really about a proposition.   How general those propositions can be and still be valid should be the central question.  When test-makers tell us they're going to measure the "value added" by our curriculum, there ought to be a bunch of technical work that shows exactly what that means.  In the most narrow sense, it's some statistic that gets crunched, and is only a data-compressed snapshot of an empirical observation.  But the intent is clearly to generalize that statistic into something far grander in meaning, in relation to the real world. 

Test makers don't have to do that work because of the sleight of hand between technical language and everyday speech.  We naturally conjure an image of what "value added" means--we know what the words mean individually, and can put them together.  Left unanalyzed, this sense is misleading.  The obvious way to see if that generalization can be made would be to scientifically survey everyone involved to see if the general-language notion of "value added" lines up nicely with the technical one.  This wouldn't be hard to do.  Suppose they are negatively correlated.  Wouldn't we be interested in that?

Harking back to the example in the quote, weighing potatoes under normal conditions satisfies both conditions.  With a good scale, I'll get very similar results every time I weigh.  And if I add a bit more spud I get a bit more weight.  So it's pretty reliable and continuous.  But not under all conditions.  If I wait long enough, water will evaporate out or bugs will eat them, changing the measurement.  Or if I take them into orbit, the scale will read differently.  The limits of generalization are trickier when talking about learning outcomes.  Even if we assume that under identical conditions, identical results will occur (condition 1) the continuity condition is hard to argue for.  First, we have to say what we mean by "nearby" experiments.  This is simple for weight measurements, but not for thinking exercises.  Is performance on a standardized test "near" the same activity in a job capacity?  Is writing "near" reading?  It seems to me that this kind of topological mapping would be a really useful enterprise for higher education to do.  At the simplest level it could just be a big correlation matrix that is reliably verified.  As it is, the implicit claims of generalizability of the standardized tests of thinking ability are too much to take on faith. 

So, I stand by the quoted paragraph. It just took some thinking about why.

Tuesday, April 27, 2010

Opening Doors

"Opening Doors to Faculty Involvement in Assessment" is the title of a new paper by Pat Hutchings, published by the National Institute for Learning Outcomes Assessment. Here's the thesis:
The assessment literature is replete with admonitions about the importance of faculty involvement, a kind of gold standard widely understood to be the key to assessment’s impact “on the ground,” in classrooms where teachers and students meet. Unfortunately, much of what has been done in the name of assessment has failed to engage large numbers of faculty in significant ways.
She ultimately suggests some remedies:
  1. Build assessment around the regular, ongoing work of teaching and learning;
  2. Make a place for assessment in faculty development;
  3. Integrate assessment into the preparation of graduate students;
  4. Reframe assessment as scholarship;
  5. Create campus spaces and occasions for constructive assessment conversation and action; and
  6. Involve students in assessment.
There's a lot to dissect here. Let's start with the big picture: what is faculty-driven assessment supposed to achieve? If the answer is better classroom instruction, that's one thing. That's easy. If the answer is to ensure that graduates are prepared for the work force in the name of accountability, that's another issue entirely. In this paper, the ultimate outcome isn't clear to me. I've asked this question at conferences--including once to Peter Ewell (who wrote a forward to this piece), asking for the government to give us summative information about employment histories of graduates by institution (from the IRS, I presume) so that we could actually see what happens to them, at least in terms of earning power. I wrote an article for U. Business with the same plea. I've asked at the state level. Forget about standardized tests--this would give us actual information, not proxies--for something close to accountability. We could look at total cost compared to financial outcomes and employment chances. This matters because it could affect big curriculum changes in a way that classroom-centered assessment cannot. We can debate whether such a mercantile view of education has merit, but the results could be surprising, as this Wall Street Journal report, and my analysis of it, shows. The article does not give us a hard goal as the outcome of assessment, which ironically typical of discussions about assessment.  This confusion between micro and macro is one of the obstacles to getting anything done.  In his forward, Peter Ewell writes:
Now we have creative and authentic standardized general skills tests like the Collegiate Learning Assessment (CLA) and the Critical-Thinking Assessment Test (CAT), as well as a range of solid techniques like curriculum mapping, rubric-based grading, and electronic portfolios. These technical developments have yielded valid mechanisms for gathering evidence of student performance that look a lot more like how faculty do this than ScanTron forms and bubble sheets.
The assessment techniques here range from general standardized tests suitable (maybe) only for comparing institutions to archival techniques for individual student work, encompassing a spectrum of approaches to assessment, and more or less suitable depending on the ultimate outcome.  The descriptors "authentic" and "valid" are contingent on what the use is. The CLA isn't useful for determining if a math major has learned any math, or a dance major has learned any dance.  Is it useful for predicting employment?  Who knows? Although it doesn't look like a bubble sheet, it correlates very highly with SAT, so the effect is arguably the same.  Mechanisms, of course, can't be valid (only propositions can), and I think the excerpt more than anything is a statement about what is fashionable in the view of top-down assessment management. 

The author talks explicitly about  the management of assessment activities (pg 9):
“If one endorsed the idea that, say, a truly successful liberal arts education is transformative or inspires wonder, the language of inputs and outputs and ‘value added’ leaves one cold” (Struck, 2007, p. 2). In short, it is striking how quickly assessment can come to be seen as part of “the management culture” (Walvoord, 2004, p. 7) rather than as a process at the heart of faculty’s work and interactions with students.
I think this is accurate, and one has to admit that assessment is largely driven by management, starting with accreditors in many cases. The best classroom-assessment cultures are probably built bottom-up from the faculty, but the impetus for assessment is top-down. This is not helped by the language and culture of the community of assessment professionals, which is heavily influenced by the testing = measurement = reality philosophy home to Educational Psychology programs and standardized testing companies. 

The view from a discipline expert is naturally dubious of the claims that learning can be weighed up like a sack of potatoes, and the neural states of a hundred billion brain cells can be summarized in a seven-bit statistic with an accuracy and implicit model that can predict future behavior in some important respect.  Aren't critical thinkers supposed to be skeptical of claims like that?

Management can be hypocritical too.   The standard line is that grades aren't assessments, implying that grades are independent of learning.  If this is true, the whole schema for assigning and recording grades is a colossal fraud that the management (from the feds down) ought to be rooting out and replacing with assessments they believe in.  How many institutions other than WGU don't give grades?  And why do test makers use GPAs in making arguments for validity? (Edit:  CLA, for example.)

On the other hand, it makes perfect sense to a faculty member to focus on what happens in the classroom.  Good teachers, chairs, and program coordinators, already make improvements on what they see.  It makes sense to institutionalize this by rewarding the activity, advertising techniques that seem to work and focusing attention on learning.  "Opening Doors" primarily focuses on how this might be done, and is recommended reading for assessment directors who work with faculty as a facilitator.  One of the problems is that PhDs are given degrees for knowledge of a discipline, not teaching effectiveness.  Once employed, they find out about routine administration, and questions identified by the author "what purposes and goals are most important, whether those goals are met, and how to do better" are left out in hyperspace:
Ironically, however, they have not been questions that naturally arise in the daily work of the professoriate or, say, in department meetings, which are more likely to deal with parking and schedules than with student learning.
Have you ever thought about how much bureaucracy there is in a university?  All the apportionment of time and resources, the forms and signatures, accounting and correspondence, distractions and procedures?  And what is the management approach to inducing a culture of assessment?  More bureaucracy.  Like giant databases that are supposed to give measurements of layered dimensions of learning outcomes.  From page 12 we have:
Some campuses are now employing online data management systems, like E-Lumen and TracDat, that invite faculty input into and access to assessment data (Hutchings, 2009). With developments like these facilitating faculty interest and engagement in ways impossible (or impossibly time consuming or technical) in assessment’s early days, new opportunities are on the rise.
Anyone familiar with this sort of system knows it doesn't facilitate faculty interest and engagement; it sends most of them running to higheredjobs.com. Turning teachers into bureaucrats isn't the answer.

On the other side, the author talks about anecdotal evidence for (pg 7)
[...] assessment’s power to prompt collective faculty conversation about purposes, often for the first time; about discovering the need to be more explicit about goals for student learning; about finding better ways to know whether those goals are being met; and about shaping and sharing feedback that can strengthen student learning.
A couple of paragraphs earlier she quotes one faculty member as saying “assessment is asking whether my
students are learning what I am teaching.”  This makes sense.  I gave a lecture once on the fundamental theorem of calculus that I thought was simply brilliant in clarity and exposition.  My bubble was burst almost immediately--I realized from their reactions and Q&A that the students hadn't understood it.  It was all a bunch of gobble-dee-goop symbols on the board to them.  It makes sense to try to fix that.  Please just don't try to do it like No Child Left Behind's ubiquitous bureaucracy or other top-down arrogance.  At the top, please just clearly articulate a goal that you can provide real, unequivocal evidence for (employment statistics, salaries, graduation rates, family size, language fluency, NOT some vague learning outcome--that's only a means to and end if it even can be said to exist).  Even then, remember that higher ed might be compared to the final process in an assembly line where the finishing touches are put on a car.  Outcomes like employment typically start right after graduation, but that doesn't mean that higher ed can be solely held responsible for the result.  If you want to raise the intellectual capacity of the country, figure out how to turn off all the vacuous "flickering lights" entertainment that inundates young minds--I bet that would have a massive effect on literacy rates.  But I'm probably biased because I grew up without a TV.  At any rate, this is not a higher ed problem, this is a societal problem--one that أبو زيد عبد الرحمن بن محمد بن خلدون الحضرمي wrote about in The Muqaddimah in the 14th century: dynasties fail because the success leads inevitably to failure (my paraphrasing).

The author sorts through some of the reasons for slow adoption of assessment.  One is that the "work of assessment is an uneasy match with institutional reward systems."  I think this is on the money.  If you look at your institution's way of evaluating teaching, chances are it relies heavily on a "customer-service" survey in the form of a standardized teaching evaluation done by students.  A formal version of ratemyprofessor.com.  This might seem fair, since everyone gets the same survey, but it's not--it's just easy.  The author mentions later on the Peer Review of Teaching Project, which was new to me.  This looks like a rich and healthy approach to teaching evaluation that would naturally involve assessment activities.  I'm looking for something like this to start a conversation at my university.  Here are some key questions identified from the website:
  • How can I show the intellectual work of teaching that takes place inside and outside of my classroom?
  • How can I systematically investigate, analyze, and document my students’ learning?
  • How can I communicate this intellectual work to campus or disciplinary conversations?
In my "view from the battlefield," the author makes a mistake in endorsing standardized testing as an answer--a lurch back into the bureaucratic viewpoint (pg 12):
The Collegiate Learning Assessment (CLA), for instance, forgoes reductive multiple-choice formats in favor of authentic tasks that would be at home in the best classrooms; CLA leaders now offer workshops to help faculty design similar tasks for their own classrooms, the idea being that these activities are precisely what students need to build and improve their critical thinking and problem-solving skills.
Here's an example of one such CLA prompt on a "make-an-argument" item, taken from one of their advertisements:
Government money would be better spent on preventing crime than in dealing with criminals after the fact.
Feel free to gasp with horror, but this sort of thing would not be at home in any of the classes I've ever taught in math or computer science.  Am I supposed to stop teaching computer architecture for a day and hold a discussion about rhetoric?  The prompt is obviously too general to have a correct answer, so I suppose the point is to see whether or not the respondent can argue well.  That's all wonderful, but it's not what the student studies math for.  Let me put it another way: would you rather fly on a plane that was designed by an engineer who knew a lot about engineering or one that got top marks on the prompt above?  To propose to compare institutions or give a measurement of "value-added" based on this stuff is ludicrous in the space where discipline-based instruction happens.  Maybe it makes sense in political science or a rhetoric class.

On the other hand, this sort of thing would look really great to politicians who deal with questions like this every day.  To them, this may be authentic.  To discipline experts, probably not.  But I have a solution: shouldn't they be learning that stuff in high school or even earlier?  I realize it's the antithesis of No Child Left Behind-style thinking, but maybe it's worth considering...

The second half of the quote above is frightening, implying that it's a good thing that the test maker can become a consultant on how to improve scores on the test.  Let's take that as a critical thinking exercise of the sort that CLA espouses.  Here are my hypotheses (I wrote about this first here):
  1. CLA is taken seriously as a way to assess valued added learning and compare institutions (this is what they advertise)
  2. CLA consultants can effectively increase an institution's scores on the test (this isn't hard to believe, since they know how the thing is scored).
Conclusion: there is economic benefit to using CLA + consultants in that it makes your institution look better relative to those who don't.  This conclusion is independent of any assumptions about learning.  It creates a system where the testing company controls the inputs and the outputs, much like SAT and SAT prep.  It's a successful business model: create a problem and sell the solution.  Unless you're really, really sure that your conclusions about the test results are useful, it's not smart to be on the receiving end of this unless you can just afford to blow the money and call it advertising.

Despite the odd misstep into virulent bureaucracy and too much enthusiasm for top-down assessments not tied to objective top-level goals, the article gives excellent advice for building the culture of assessment our accreditors are always going on about.  The recommendations are practical and useful, and can address the assessment problem at the troop level.  The big issue of getting the top-down approach fixed is not the topic of the article, and given where we are, is probably too much to hope for.

Monday, April 26, 2010

Loans

A College Board report "Who Borrows Most?" includes this striking graphic on student loan loads. 

The for-profits clearly take the description seriously.  The article makes a pitch for financial literacy for students.  I don't see that making much of a dent, though.  I think what would make a difference is disqualifying for-profits from being eligible for government-backed loans and federal aid, like PELL grants.  Why should tax dollars go to feed the bottom line of these companies?

Friday, April 23, 2010

Survival Strategies Talk

Last week I gave a talk on my math research to a group of undergrads and faculty at Augsburg College.  University of St. Thomas in St. Paul put me up a couple of nights so I could do so.  Here are the notes and presentation.  The latter will change as I add stuff to it--I use the Prezi as a mindmap for the paper I'm writing.

The blocks in the text are where to click next on the presentation, but it's approximate.  I don't have a YouTube version yet.




My topic this afternoon is survival.  We’ll use math to talk about philosophy.  Both are valuable. █ I created this from data from the Wall Street Journal.  █.  According to their data, math and philosophy majors show the greatest gain in salaries. █ Liebnitz knew something about both of these subjects.  █ █.

So, by survival strategies, I don’t mean buy a cabin in the woods and stock up on poodles and rifle shells█.  Mathematicians never study anything that interesting. █
 
For some reason, it’s easier to talk about something surviving than it is to talk about it being alive. So let’s go with that.  The data on your computer’s hard drive █ may survive a system crash, even though it was never  alive.

Let’s take that example.  How many of you have lost data that’s important to you?  It’s not much fun.

What are the chances of this happening?  Let’s say that the probability that your data survives a year is some number p. █ Then the chance of surviving two years is p squared (assuming it doesn’t degrade over time) and so on. At the nth year we have p to the n power.  What kind of graph does p to the n have?  █

So no matter how good it is, the hard drive has probability zero of surviving forever.

If a fixed survival probability p is a problem we have to figure out how to increase it.  So the more general formula, where the survival probabilities can change from year to year is here.  █  The small p is the annual probability, and the big P is the probability that our subject will get that old.  The big pi means multiply.  So lower case p sub 5 is the chance that it will survive year five, given it’s already survived years one through four, whereas capital P sub five is the total probability that we get through year five. 

In order live forever, I guess it’s obvious that we want to the probabilities to go up rather than remain constant.  How fast do they need to go up?

There’s a nice form we can use to describe increasing probabilities.  Here it is—a double exponential. █  If we multiply a sequence of annual probabilities like this, the geometric series formula helps us out,  █ so we can simplify total probability big Pn easily and see that as n goes to infinity, the chance of our subject surviving is p to the one over one minus b.

This is the best we can hope for, by the way—a limit greater than zero.  100% survival guarantee is not in the cards.  If in practice, these numbers are low for advanced civilizations, we wouldn’t expect to see them roaming around the galaxy, for example
 
█Here are some graphs of survival probabilities for the double exponential with different bs.  We want b to be small.

It’s interesting to compare the double exponential formula to actual census data. █  Here, the bars on the left are US populations of certain ages, and the bars on the right are the double exponential probabilities with b set to 1.08.  We have to have b less than one to get a shot at indefinite survival. Some people, like Ray Kurzweil think humans will be able to live forever, so this 1.08 figure would make a nice benchmark. 

So…back to hard drives.  What do we do in practice to increase the odds of survival? █  Backups. █ How many backups do you need in order for your data to live forever?  The number of backups can’t be fixed.  Is that clear?  It has to grow.  How fast does it have to grow?  What if you added one every year?  Is that enough? Do we need exponential growth? [poll]█

The way to figure this out is first assume the backups are independent, which is an approximation at best.  But it lets us multiply the chances of failure (we’ll call these q) to get the chance of total failure.

The linear grown model might look like this. █   In year one we have a hard drive.  In year two we add a USB stick.  In year three we add a SIM card, then a DVD, then hard copy.  If any one of these fails it gets replaced.  The only way for total failure is if all backups fail during a year.  We can bound that from below █ and show that this method is successful.  So linear growth is enough.  This graph █ compares this backup scheme with the double exponential curve we saw a minute ago.  The curves are different but have the same limit.

You don’t need exponential growth. Of course real independence is impossible.  If we think about ecologies of biological life, and how these manage to survive, they combine exponential growth with variance in design to create multiple copies that develop independence from one another.  It would be a cool undergrad research project to figure out how to visualize this.  █ 

So, my friends, an ECOLOGICAL solution █ is a very powerful one to the survival problem.  How long has life existed continuously on this planet?  A few billion years, right?  █ No single organism has lived that long, but the ecological solution to the problem has worked for that long. 

Before we congratulate ourselves, however, let’s think for a moment.  Are all of the things that we would like to survive copyable?  What about the Mona Lisa?  Is a reproduction just as good?  What about yourself?  A nation?  Making copies isn’t always an option.

So what about individual or singular survival? █ Here’s where it gets really interesting.  Assume we have some subject that we can’t █ create an ecology out of. █  We have to increase the probabilities of survival directly. So this is simple-sounding.  First, figure out what possible actions are the best ones for us, then do that.  █  We could say this is the “be smart” strategy.  █  Now I chose a fitness place as an example for a reason.  How many of you consider yourselves intelligent human beings?  I don’t need to look—everybody raised their hand, right?  Okay, how many of you believe that exercise is essential to good health?  How many of those who do actually exercise? 

My point is that knowing what to do and doing it are different.  We’ll come back to that.  First, let’s focus on knowing what to do.

█A word from our sponsors.  Seriously, eat your fiber while you’re young.  Go to wikipedia and look up diverticulitis.  If you forget everything else I say today, remember that.

█ Well, we have some clues.  We call it science.  Frankly it’s amazing.  As a civilization, we are exploring every nook and cranny—and every crook and nanny too—of our environment, to find out how it works.  We are doing VIRTUALLY what evolution does physically, but much much faster. 

Neils Bohr said that predictions are hard, especially about the future.  Whereas an ecology can just try out stuff physically and throw away the mistakes as the consequence of natural selection, our intelligent survival machine can only make virtual mistakes.  This presents a problem we’ll call induction risk. █

What’s the next element of this sequence?  We can guess it’s a zero because that’s the simplest explanation.  But is the simplest explanation always right?

What does a turkey think before the holidays? █  Probably—man, life is good.  Eating and sleeping all day.  It’s like a resort.  Then comes Thanksgiving. 

We have two clear and opposite points of view from classical philosophy. █  Epicurus—don’t throw away possible solutions.  █  Occam’s Razor—keep the simplest explanation.  More modern versions include updating a priori probabilities█, and ones based on Kolmogorov complexity theory█.  But in my opinion these are engineering hacks to the basic problem that Hume█ identified:  telling the future isn’t 100% possible.  So we’re left with trial and error█ and imperfect models of the world. We cannot look at past experience and predict the future with certainty, and this induction risk █ means that even when we think we know what to do, we may be wrong.  Some one said that in theory, theory and practice are the same.  In practice they’re not.

Okay, well, blah, philosophy this, blah, blah.  Where’s the computer stuff?  █  As a critical thinker, what do you do when you don’t understand something complicated?  Look at simple examples.  Make a model that only takes into account a few key characteristics.  Let’s do that.


So we’re going to create an ecology of algorithms—computer programs that have to try to stay alive in an environment by predicting what comes next.  You can think of these critters as an analogy to a physical ecology like found in the wild, or as a virtual ecology of possible solutions in an intelligent being’s imagination.  Either way, the goal is to inductively find the solution to a simple problem.  Or else.

Here’s how it works.  █ It’s a game like rock paper scissors.  The environment can play a 0, a 1, or some other positive integer.  The critter can do the same, with the results listed on the chart.

The critter does not get to see what the environment plays until after it’s made its own choice, so it has to rely on what it thinks the environment will do based on past experience.  In other words, induction. 

The critter can self-destruct by playing a zero.  Think of a computer program crashing.  That’s what this is—a fatal internal problem.  This is the halt code.

Or it can play a one and participate in the environment—think of this as eating something it found.  If it’s harmful, it dies.  If it’s good it survives, and otherwise it just waits to play again. 

Or it can just observe by playing something other than a zero or one.

Here’s an example. █ People sometimes debate whether fire can be considered alive or not because it eats and breathes and reproduces and whatnot.  We can sidestep that and look at what a fire’s survival strategy is.  It’s pretty simple—it burns anything it can, right?  There’s no much subtlety.  Look at the environment and the critter and tell me how long it will survive.  Until the first environmental zero. 

So if a fire is alive, its survival complexity is very low. It’s a constant, which is the least complex thing there is. 

Another example.  █What would you guess the environment is doing here?  We’ll see in a minute how  the simulation does with this. 

Now the goal of a critter is to accumulate an infinite number of ones.  It can’t just sit there and passively watch without participating in order to qualify for survival.  there’s a lot of interesting things we can do at this point.  We can classify environments by how hard they are to survive.  If there’s NO infinite subsequence of 1s in the environmental plays, then it’s not survivable at all.  We can say more than that by talking about the computational complexity of survivable algorithms, and whether or not they’re even reachable by a critter based on inductive reasoning.  These are fascinating questions.  But let’s get to the sim. 

█ I created a simple programming language out of an existing Turing-complete one called—well, the name is rude.  If you’re interested I’ll tell you after.  I wanted the language to have no syntax other than symbol order—so every string of commands is an acceptable program.  And I make the symbols mostly symmetrical so they look like bugs when strung together. 

This is implemented in perl, and you can have the code if you want it.  Here’s my first attempt. █  I set up a repeating environment 012012… and wrote a program to solve it.  The numbers are the state vector for the critter—numbers it can store for its own use.  Only one of these is getting used here, so I figured it would be an opportunity to improve on my design by evolving toward a smaller state vector.  I cranked up the starvation rate to help it along.  But what happened is that it found this solution pretty quickly. █  It’s like “dude, just look at the old environment and add one.”  So, okay, I felt a little stupid. 

█ There are a lot of parameters and tweeks I won’t go into. The language █ has and interpreter change as I try different things out, but I keep a catalog of all of them.  Basically, successful critters get to reproduce, and their offspring mutate according to a parameter. 

[short demo]

So one thing to look at is how these virtual ecologies compare to real ones.  █  What you see here is a set of three scripted runs.  Each run is a hundred individual simulations averaged.  The severity of the environment increases to see what effect this has on population size and diversity.  Basically the critters have to learn to count to three.  They start in an entirely benign environment—an endless buffet that our “burn baby burn” strategy would work fine in.  The only risk is self-halting.  Then it transitions to a 012 cycle, so the critters have to time the ones to survive.

There’s a lot going on here, but this is population in the middle.  Transitioning from the buffet causes the population to crash.  The lines at the bottom show how many critters died from eating the wrong thing versus self-halting.  And the lines at the top give diversity. 

This is what we would expect to see.  As the environment becomes more severe, successful critters take over, which works, but limits the diversity.  Changing to a different environment once it has specialized is almost guaranteed to make the population crash and burn. 

█The statistical statement of this is Fisher’s Fundamental Theorem, generalized by Price.  I haven’t formally computed these parameters yet—that would be a good undergrad research project.

Here’s another one, where three populations are given increasingly hard environments.  Population is at the bottom, and diversity at the top.  There’s more going on here I’ll get to in a minute.  Here are some of the solutions from these three hundred runs.  █

Not one of these looks at the old environment.  The command for that is the greater than sign—it looks like an antenna.  No antennas on any of these critters—they all survive by counting out the environment, synchronizing with it.  In fact, it’s very hard to breed critters of the cause-effect sort.  I have more to say about that, but you’ve had your computational fix now—let’s zoom back to the induction risk problem.

You’ll note that part of what I was looking at in those graphs was a diagnosis of how often self-halts happen.  The importance of this depends on whether or not these singular intelligent survival machines are really computational engines or not. 

For the sake of argument, I’m going to assume something called the  █Church-Turing Thesis applies here, and wave my hands.   See me waving?

█ If our survival machine is an intelligent robot or a bureaucratic civilization that runs on computable rules, then we have to consider self-halting.

Everybody probably knows about Turing’s Halting problem.  Basically, you can’t know in general if a machine will give you the blue-screen of death. █  Before you dismiss this as something that only happens in windows, let’s look at a real world example.

█ In the New York Times, Paul Krugman writes this.    So basically a government self-destructed because of one bad decision that locked up the system.

More ominous are the accounts of how close we came to nuclear war in the 20th century.  If we consider human civilization as a whole, I think it’s fair to say pulling that trigger is a self-halt.  █ You can read more theory about this sort of thing from Peter Suber. 

The key to this for me is this assumption. █ It goes like this:

  1. We have to learn about the environment to learn how to reduce induction risk.
  2. As part of that, we learn how to change our physical form
  3. Our behavior derives from our physical form
  4. Therefore, we can and must be able to self-modify

█Note that with drugs and genetic engineering we can already begin to do this.

Gregory Chaitin█ has this number he calls omega, which is the average halting probability for a random computational function.  We’re interested to know what one minus omega is—that would be our average chance of survival from self-halting.  But Chaitin proved that we can’t know this probability.  So there’s some unknown survival tax, if you will, because of the self-halt risk. 

Recall that the two requirements for a singular intelligence to survive is that it learns what to do to increase chances, and then does it.  This second part shouldn’t be taken for granted.

Evolution has partially solved this problem with emotional loadings—we get hungry when we don’t eat, and we have a strong survival instinct.  This genetic knowledge is different from our real-time apparatus.  As an example, imagine that someone tells you a cobra is loose in your tent.  You lay down anyway and take a nap.  Sounds hard to do, right?  Now imagine falling asleep while driving on the interstate.  People do that all the time.  What’s the difference?  We might crudely say the first is physical and the second is virtual.  For a self-altering critter, everything becomes virtual.  This is related to something called Morevec’s Paradox, that we don’t have time to delve into.

I think the usual assumption is that an artificial intelligence would behave and think like we do.  So we program it to be hungry when it needs power, and will then plug itself into the wall.  I think what would more likely happen is it would just say to itself, I don’t like being hungry and turn itself off, rather like Claude Shannon’s so-called “ultimate machine,” whose only function was to turn itself off if someone turned it on. 

One interesting question is this—is there a way to understand computation in a topological sense █ so that we could imagine a surface with attractors of what we might call computational orbits as these processes self-modify.  Are there attractors where the processes avoid self-halt under robust environmental conditions?  I think it’s fair to say that’s a hard question.

I’ve started taking a look at what happens in the ecologies generated by my critters.  Here’s one.  █  The graphs show self-halting frequency over time for runs of critters with increasingly difficult environments.  The general pattern is clear—the more difficult the environment, the more the self-halts.  This is because there is more churn in the population—more deaths and births, more critters born with self-halting tendencies.  But what’s interesting is that as the population solves the problem posed by the environment, the frequency drops. 

This would not be surprising in a static environment, but this one has a twist.  █

Let’s start over. █ Imagine you’re working for NASA and you have a space probe that forgot all its programming.  You have to transmit the code back up to it.  Unfortunately, its communications protocols are compromised and there is no error checking or error correction on the other end.  Your program, however is allowed to self-modify.  How do you ensure that the program that runs on the probe is uncorrupted?

We know that when things move through time, they tend to get messy. █  Formally, this is called the second law of thermodynamics.  When our hypothetical intelligent survival machine moves through time, it gets messed up too.  One would think that if it’s to survive, it has to deal with that.  Let’s see what the implications are.

In the simulation, I created a random state-change matrix that flipped program symbols according to the rule generated.  Here’s an example.  █  The greater than sign gets changed to the left bracket 91% of the time between generations of critters for every critter in the population.  You might think this would present a challenge to survival.  If you’ve done any programming, you know that randomly changing code is unlikely to have a good outcome.

But although no individual can survive this, an ecology can. █ And we can look at the transitions as we increase this entropy. 

Looking at individual runs rather than averages is instructive.  Here’s without random symbol swapping.  █  Almost all of them survived.  And these are the populations that had to deal with it. █  It’s a much different picture. 

So if an ecology can survive, can an intelligence?

Let me suggest one way around our space probe problem.    First, the real world isn’t exactly like scenario I described.  The real world isn’t a single processor running a single program, but a massively parallel “computer”.  So even if we can’t create an ecology of copies of ourselves, we could send copies in parallel.  Here’s the difference. █  Now instead of having to have the whole thing survive intact you only need on of several copies intact.  This is like the control systems on an airplane—redundancy. 

In any of these, no matter what sort of error correction might be used, there are two interesting questions these modules have to be able answer.

Am I me?  Or was I corrupted in transit?  In this case, the best thing in the modular case is a self-destruct.

The second question is are you me?  If we’re trying to maintain state—an identity—then replacing fellow defects with good copies is essential.

Can you think of any big complex thing that works this way?

Bodies, companies,…

We could go further and compartmentalize.  Now I only need one copy per module to survive. 

This leads to two interesting criteria for survival in the face of entropy:  an AM I ME test and an ARE YOU ME test.  We see both of these in compartmentalized biological organisms in the form of programmed cell death and the immune system. 

The enduring features of the universe are many and varied rather than singular and intelligent.  Think about that.

[Note: I ran out of time.  The full paper will have more details.]

Comparing CLA to Rubric Scores

“We found no statistically significant correlation between the CLA scores and the portfolio scores,” is the sentence that catches one's eye in this AAC&U feature article about University of Cincinnati's efforts to assess big fuzzy learning outcomes:
The students took the CLA, a ninety-minute nationally standardized test, during the same week in which faculty members assessed students’ e-portfolios using rubrics designed to measure effective communication and critical thinking. In the critical thinking rubric assessment, for example, faculty evaluated student proposals for experiential honors projects that they could potentially complete in upcoming years.  The faculty assessors were trained and their rubric assessments “normed” to ensure that interrater reliability was suitably high. 
 One administrator's conclusion about the mismatched scores is that:
The CLA can provide broad institutional data that satisfies VSA requirements, while rubric-based assessment provides better information to facilitate continuous program improvement.“When we talk about standardized tests, we always need to investigate how realistic the results are, how they allow for drill-down,” Robles says. “The CLA provides scores at the institutional level. It doesn’t give me a picture of how I can affect those specific students’ learning. So that’s where rubric assessment comes in—you can use it to look at data that’s compiled over time.”
You can find a PowerPoint show for the research here.  Here's a slide taken from that, which summarizes student perceptions of the CLA.
Here are the correlations in question:

It's hard to see how anything is related to anything else here, except maybe breaking arguments and analytical writing.  I would conclude that the CLA isn't assessing what's important to the faculty, and is therefore useless for making improvements.  Since UC is part of the VSA, they can't say that.  Instead they say:

The CLA is more valid? [choking on my coffee here] Valid for what? Saying that one school educates students better than another school?  The two bullets above seem Orwellian in juxtaposition.  How can an assessment be valid if it isn't useful for student-level diagnostics?  Yes, I understand the the CLA doesn't give the same items to each student, that it's intended to be only used to compare institutions or provide a "value-added"  index, but the fact that cannot be escaped is that learning takes place within students, not aggregates of them.  At some point, the dots have to be connected between actual student performance and test results if they're going to really be good for anything.  Oh, but wait: here's how to do that.

By the way, if you don't know the story of Cincinnatus, it's worth checking out. 

Thursday, April 22, 2010

Survey Inventory

AIR has a nice project to inventory instruments related to assessing quality of education.  Quote:
This web site provides an inventory of resources designed to assist higher education faculty and staff in the challenging task of assessing academic and support programs as well as institutional effectiveness, more broadly.
The home page for the project is here, and the inventory itself is here.  A screen capture from the first two is shown below.

Validity and Measurement

My wife is in charge of an "applied research" center here at the university, and ordered some books on research to satisfy her curiosity about how the other half lives (she's a lit geek).  I browsed them, looking for definitions of validity and measurement.   

From The Research Methods Knowledge Base by William M. K. Trochim and James P. Donnelly:
When people think about validity in research, they tend to think in term of research components.  You might say that a measure is a valid one, that a  valid sample was drawn, or that the design had strong validity, but all of those statements are technically incorrect.  Measures, samples, and designs don't have validity--only propositions can be said to be valid.  Technically, you should say that a measure leads to valid conclusions or that a sample enables valid inferences, and so on.  It is a proposition, inference, or conclusion that can have validity.
 This is the usual definition, which most people seem to ignore.  In casual conversation, people in assessment land and in psychology say things like "use a valid measurement" routinely in my experience.  More problematic is the usual linkage between validity and reliability.  Reliability is repeatability of results, or as Wikipedia puts it nicely "the consistency of a set of measurements or measuring instrument [...]"

Do you notice anything illogical here?  If validity is about propositions (about some instrument), and reliability is a property of the test itself, we can't justify the common assertion that validity requires reliability.  It's like saying that a thermometer needs someone to read it correctly--there are two issues conflated.

Example:  Jim Bob steps onto the balcony of his hotel room on his first evening of an all-expense paid trip to Paris for winning the Bugtussel Bowling Championship.  He looks at the thermometer nailed to the door frame, and is amazed to see that it is 10 degrees.  It's chilly out, but he didn't think it was that cold!
Clearly Jim Bob mistook Celsius for Fahrenheit, and made an invalid conclusion.  This has nothing to do with the reliability of the thermometer. You may rightfully object that of course we can always reach invalid conclusions--the trick is to find valid ones, and for that we require reliable instruments.  Not so.

Example: A standardized test is given to 145 students.  The assessment director gets a list of the students who took the test.  Is this list valid?  Yes.  In what sense is it reliable?

Here, the proposition is valid--it reflects reality--but isn't reliable longitudinally.  It's reliable in the trivial sense that if we look at the same instance of testing, we'd see the same roster of students, but almost anything is reliable by that standard.  The next time we give the instrument, we won't have the same roster.  Does this unreliability mean that the roster is invalid?  Of course not.

If reliability is a sine qua non for validity, then no unique observation can be valid.  Your impressions and conclusions about a movie on first viewing or first date are invalid because they can't be repeated.  When you read a book you really like, finish it and tell your friend "I enjoyed it," this is invalid according to testing dogma standards. You would have to first repeatedly read it for the first time and then assess the resulting statistics.

To underline the illogic of validity => reliability, consider the proposition "The inter-rater statistics on this method of assessment show it to be unreliable."  This is a common enough thing.  So how would we evaluate the validity of that statement?  If the central dogma it is true that validity requires reliability, there must be an underlying reliability that is a prerequisite for the validity of the conclusion.  Does that mean that we have to show that the inter-rater statistics are reliably unreliable?  That makes no sense: one instance of unreliability is all that is required to demonstrate unreliability.  A drunk driver may be only occasionally unreliable, but still absolutely unreliable, right?

If I make the statement "no cow is ever brown" the validity of that can be negated by a single instance of a brown cow.  I don't have to be able to reliably find brown cows, just one.  Yes, I have to be sure that the cow I saw really was brown, but this is a very low bar for reliability, and not what we're talking about.  Therefore, some kinds of propositions do not require reliability in order to be valid.

If I hear on the news that the Dow went up 15 points today, how should I evaluate the validity of this statement?  I could check other sources, but this will just establish the one fact.  I cannot repeat the day over and over to see if the Dow repeats its performance each time.  There is no way to establish longitudinal reliability.  Does that negate the validity of the statement?

The role of reliability is to allow us to make an inductive leap: if X has happened consistently in the past, maybe X is a feature of the universe, and will continue to happen.  Every time we eat a hamburger (or anything else), we make such an inductive leap.  Sometimes we have to be really clever to find out where the reliable parts are--like measuring gravity.  So rather than a dogma strong requirement about reliability and validity, we should say something like "we assume that reliability implies something persistent about the objective reality of the subject."

So, on to measurement.  I looked up that chapter in the book and found this:
Measurement is the process of observing and recording the observations that are collected as part of a research effort. (pg 56)
Contrast that with the nice Wikipedia definition:
In science, measurement is the process of obtaining the magnitude of a quantity, such as length or mass, relative to a unit of measurement, such as a meter or a kilogram.

From dictionary.com, a measurement is the "extent, size, etc., ascertained by measuring".

The scientific definitions lend themselves to units and physical dimensions.  The first definition, from the research methods book, is much more general.  Let's parse it.  There are three parts.  First, measurement is a process of observing.  Then we note that we should record the observation.  I would argue that this provision is unnecessary--we only talk about any datum as long as it's recorded somewhere, even if only in our minds.  If it's not recorded, it's nowhere to be found and irrelevant. We can, however, glean from this that 'measurement' is both a verb and a noun.

The last provision--that it must be part of a research effort--can also be discarded.  The purpose of the observer may not be known when the measurements are later used.  Perhaps an ancient astronomer made star charts for religious reasons.  Does that negate them as measurements, solely for that reason?  No.  So we are left simply with the first part: a measurement is a process of observing (verb), or the product of the same (noun).  This is much weaker than the scientific version, because we aren't tied to standard units of measurement.

My main objection to this is that there's no need to use another word.  If we mean observation, why don't we just say observation?

Edits: Fixed a vanished sentence fragment, and crossed out 'dogma', which is a silly word to use.  I hadn't had enough coffee yet. 

Math Links

Saturday, April 10, 2010

2010 Atlantic Assessment Conference

I'm co-presenting two talks on Monday at the 2010 AAC.

CS22 (10:30am): Ubiquitous Core Skills Assessment with Kaye Crook. Presentation is here.

The mission and general education goals of Coker College include a common list of "core skills": effective speaking, effective writing, analytical thinking, and creative thinking. For six years Coker has assessed these across the whole curriculum using a simple faculty driven method. Methods, results, and uses will be discussed. Note that the method is based on direct familiarity with student work; assessing a section of 300 students wouldn't work, so this is most suitable for smaller schools.

CS36 (3:15pm): Strategic Planning and Stakeholder Analysis with Kelli Rainey.  Presentation is here.

Johnson C. Smith University is in the process of transforming itself. We will give an overview in this session of the planning processes we are using, including stakeholder analysis, logic models, strategic benchmarking, project management, and the data systems that track these.