Wednesday, April 28, 2010

Reflection on Generalization of Results

Blogging is sometimes painful.  The source of the discomfort is the airing of ideas and opinions that I might find ridiculous later (like maybe the next day).  Having an eternal memorial to one's dumb ideas is not attractive.  I suppose the only remedy is public reflection, which is no less discomforting.  To wit...

Yesterday I wrote:
The view from a discipline expert is naturally dubious of the claims that learning can be weighed up like a sack of potatoes, and the neural states of a hundred billion brain cells can be summarized in a seven-bit statistic with an accuracy and implicit model that can predict future behavior in some important respect.  Aren't critical thinkers supposed to be skeptical of claims like that?
I've mulled this over for a day.  A counter-argument might go like this:  A sack of potatoes has a very large number of atoms in it, and yet we can reduce those down to a single meaningful statistic (weight or mass) that is a statistical parameter determined from multiple measurements.  The true value of this parameter is presumed to exist, but we cannot know it except within some error bounds with some degree of probabilistic certainty.  This is not different from, say, an IQ test in those particulars.

I think that there is a difference, however.  Let's start with the basic assumption at work: that our neighborhood of the universe is reliable, meaning that if we repeat an experiment with the same initial conditions, we'll get the same outcomes.  Or, failing that, we'll get a well-defined distribution of outcomes (like the double slit experiment in quantum mechanics).  Moreover, we additionally assume that similar experiments yield similar results for a significant subset of all experiments.  This "smoothness" assumption grants us license to do inductive reasoning, to generalize results we have seen to ones we have not.  Without these assumptions, it's hard to see how we could do science. Restating the assumptions:
1. Reliability  Experiment under same conditions gives same results, or (weaker version) a frequency distribution with relatively low entropy

2. Continuity  Experiments with "nearby" initial conditions give "nearby"results.
Condition 1 grants us license assume the experiment relates to the physical universe.  If I'm the only one who ever sees unicorns in the yard, it's hard to justify the universality of the statement.  Condition 2 allows us to make inductive generalizations, which is necessary to make meaningful predictions about the future.  This why the laws of physics are so powerful--with just a few descriptions, validated by a finite number of experiments, we can predict an infinite number of outcomes accurately across a landscape of experimental possibilities. 

My implicit point in the quote above is that outcomes assessment may satisfy the first condition but not the second.  Let's look at an example or two.
Example.  A grade school teacher shows students how the times table works, and begins assessing them daily with a timed test to see how much they know.  This may be pretty reliable--if Tatiana doesn't know her 7s, she'll likely get them wrong consistently.  What is the continuity of the outcome?  Once a student routinely gets 100% on the test, what can we say?  We can say that Tatiana has learned her times tables (to 10 or whatever), and that seems like an accurate statement.  If I said instead that Tatiana can multiply numbers, this may or may not be true.  Maybe she doesn't know how to carry yet, and so can't multiply two-digit numbers.  Therefore, the result is not very generalizable. 
Example.  A university administers a general "critical thinking" standardized test to graduating students.  Careful trials have shown a reasonable level of reliability.  What is the continuity of the outcome?  If we say "our students who took the test scored x% on average," that's a statement of fact.  How far can we generalized?  I can argue statistically that the other students would have had similar scores.  I may be nervous about that, however, since I had to bribe students to take the test.  Can I make a general statement about the skill set students have learned?  Can I say "our graduates have demonstrated on average that they can think critically"? 
To answer the last question we have to know the connection between the test and what's more generally defined as critical thinking.  This is a validity question.  But what we see on standardized tests are very particular types of items, not a whole spectrum of "critical thinking" across disciplines.  In order to be generally administered, they probably have to be that way. 

Can I generalize from one of these tests and say that good critical thinkers in, say, forming an argument, are also good critical thinkers in finding a mathematical proof or synthesizing an organic molecule or translating from Sanskrit or creating an advertisement or critiquing a poem?  I don't think so.  I think there is little generality between these.  Otherwise disciplines would not require special study--just learn general critical thinking and you're good to go.

I don't think the issue of generalization (what I called continuity)  in testing gets enough attention.  We talk about "test validity," which wallpapers over the issue that validity is really about a proposition.   How general those propositions can be and still be valid should be the central question.  When test-makers tell us they're going to measure the "value added" by our curriculum, there ought to be a bunch of technical work that shows exactly what that means.  In the most narrow sense, it's some statistic that gets crunched, and is only a data-compressed snapshot of an empirical observation.  But the intent is clearly to generalize that statistic into something far grander in meaning, in relation to the real world. 

Test makers don't have to do that work because of the sleight of hand between technical language and everyday speech.  We naturally conjure an image of what "value added" means--we know what the words mean individually, and can put them together.  Left unanalyzed, this sense is misleading.  The obvious way to see if that generalization can be made would be to scientifically survey everyone involved to see if the general-language notion of "value added" lines up nicely with the technical one.  This wouldn't be hard to do.  Suppose they are negatively correlated.  Wouldn't we be interested in that?

Harking back to the example in the quote, weighing potatoes under normal conditions satisfies both conditions.  With a good scale, I'll get very similar results every time I weigh.  And if I add a bit more spud I get a bit more weight.  So it's pretty reliable and continuous.  But not under all conditions.  If I wait long enough, water will evaporate out or bugs will eat them, changing the measurement.  Or if I take them into orbit, the scale will read differently.  The limits of generalization are trickier when talking about learning outcomes.  Even if we assume that under identical conditions, identical results will occur (condition 1) the continuity condition is hard to argue for.  First, we have to say what we mean by "nearby" experiments.  This is simple for weight measurements, but not for thinking exercises.  Is performance on a standardized test "near" the same activity in a job capacity?  Is writing "near" reading?  It seems to me that this kind of topological mapping would be a really useful enterprise for higher education to do.  At the simplest level it could just be a big correlation matrix that is reliably verified.  As it is, the implicit claims of generalizability of the standardized tests of thinking ability are too much to take on faith. 

So, I stand by the quoted paragraph. It just took some thinking about why.

No comments:

Post a Comment