Higher Ed/: 2012

Friday, December 21, 2012

StudentLauncher: Kickstarter for Education

If you haven't yet heard of the Kickstarter phenomenon, it's a way to "crowdsource" funding for a project you may have. Instead of finding a large donor for your entrepreneurial venture, you can make a pitch to a wide audience and gather up lots of small (e.g. $25) contributions. In return each donor gets some small reward. For example, I've put together a Kickstarter project in order to pay for editing the first part of my sci-fi novel Life Artificial, but it's not quite ready to launch.

Now there is a site that enables the crowdsourcing of educational projects: StudentLauncher.org. The idea is simply explained in a graphic on the site's home page:

The projects can be quite ambitious. Here's the description of 'Pencils of Promise' from North Carolina State:

Currently 75 million children do not have access to education. Pencils of Promise aims to lower this number by as much as possible, building one school at a time in developing countries. These children have such promise in them and such a burning desire to learn. PoP at NCSU supports PoP's mission through awareness and fundraising. Only $25 sends a child to school for a year! You can make a difference. With $500 we can send 20 children to school for a year. Help us help others because a generation empowered empowers the world.

There's also a video.

If you read my post "The End of Preparation" (which we are turning into a book proposal), you'll be familiar with my argument that 1. students can do things that matter while they are students, and 2. education should primarily be about doing things that matter instead of preparing to do things that matter. StudentLauncher.org has the potential to be a powerful enabler of that idea. Kudos to site creator Tom Krieglstein for this great idea.

Monday, December 10, 2012

Slides for SACSCOC Session on CS 3.5.1

Thanks to everyone who showed up at 4:30pm after a long day of conferring. Here's the link to the presentation slides.

Monday, November 26, 2012

Generating Curricular Nets

I recently developed some code to take student enrollment information and convert that into a visual map of the curriculum, showing how enrollments flow from one course to another. For example, you'd expect a lot of BIO 101 students to take BIO 102 within the next two semesters. In order to 'x-ray' course offerings, I have to set thresholds for displaying links. For example, a minimum transfer of 30% of the enrollment from one course to another in order to show up. There are many ways to add meta-data in the form of text and color, for example using the thickness of the graph edges (the connecting lines) to signify the magnitude of the flow. This is a directed graph, so it has arrows you can't see at the resolution I've provided. Other data includes course name and enrollment statistics, and the college represented. It can be used to isolate part of the curriculum at a time to get more fine-grained graphs.

In the graph below, it's a whole institution's curriculum. The sciences, which are highly structure, clump together in the middle. Less strongly linked structures are visible as constellations around the center. I particularly like the dog shape lower left. This sort of thing can be used to see where the log-jams are, and to compare what advisors think is happening to what actually is.

Friday, November 09, 2012

Application Trajectories

For any tuition-driven college, the run up to the arrival of the fall incoming class can be an exciting time. There are ways to lessen that excitement, and one of the simplest is to track the S-curves associated with key enrollment indices. An example is shown below.

In this (made-up) example, historical accepted applications by application data are show as they accumulated from some initial week. It's a good exercise to find two or three years of data to see how stable this curve is:

1. Get a table of all accepted applicants, showing the date of their acceptance.

2. Use Excel's weeknum(x) function, or something similar, to convert the dates into weeks, and normalize so that that 1 = first week, etc. This first week doesn't have to correspond to the actual recruiting season. You just need a fixed point of comparison.

3. Accumulate these as a growing sum to create the S-curve by week.

4. Plot multiple years side-by-side.

5. If this is successful, the curves will be pretty close to multiples of one another. That is, they will have the same shape but perhaps different amplitudes. You can normalize these by dividing by the total, so that the sum is 1 at the right of the S. This is your distribution curve. You may want to average the last two.

Once you have a historical distribution curve, you can multiply it by your admit goal to get the trajectory you hope to see during the current cycle. The graph above illustrates the case where the current numbers are on track to meet the goal. If the current numbers drift off the curve, you'll have lots of warning, and can plan for it.

Note that this probably works with deposits and other indicators too. I've only ever used it with accepted applicants.

Sunday, November 04, 2012

Nominal Realities

‎"A poet's understanding of reality comes to him together with his verse, which always contains some element of anticipation of the future." -- N. Mandelstam

Last week I presented a paper "Nominal Realities and the Subversion of Intelligence" at the Southern Comparative Literature Association's annual meeting in Las Vegas. It was a strange turn of events that led me (a mathematician by training) to such an event, but well worth it. The ideas were influenced by my work in complex systems and assessment as well (not unrelated).

Friday, November 02, 2012

Inter-rater Reliability

At the Assessment Institute this week, I saw presentations on three different ways to examine inter-rater reliability. These included classical parametric approaches, such as Krippendorff's Alpha, a paired-difference t-test to detect bias over time, and a Rasch model. I'm hoping to talk the presenters into summarizing these approaches and their uses in future blog posts. In the meantime, I will describe below a fourth approach I developed in order to look at inter-rater reliability of (usually) rubric-based assessments.

I like non-parametric approaches because you can investigate the structure of a data set before you assume things about it. As a bonus you get pretty pictures to look at.

Even without inter-rater data, the frequency chart of how ratings are assigned can tell us interesting things. Ideally, all the ratings are used about equally often. (This is ideal for the descriptive power of the rubric, not necessarily for your learning outcomes goals!) This is for the same reasons that we want to create tests that give us a wide range of results instead of a pronounced ceiling effect, for example.

If we do have inter-rater data, then we can calculate the frequency of exact matches or near-misses and compare those rates to what we would expect to see if we just sampled ratings randomly from the distribution (or alternatively from a uniform distribution). When shown visually, this can tell us a lot about how the rubric is performing.

I'll give some examples, but first a legend to help you decipher the graphs.

The graph on the left shows how the raters distributed scores. Although the scale goes from one to four, one was hardly used at all, which is not great: it means we effectively have a three point scale instead of a four point scale. So one way to improve the rubric is to try to get some more discrimination between lower and middle values. For the three ratings that are used, the match frequencies (yellow line on the left graph) are higher than what we'd expect from random assignment (orange dots). The difference is shown at the top of each graph bar for convenience. Altogether, the raters agreed about 52% of the time versus about 35% from random assignment. That's good--it means the ratings probably aren't just random numbers.

The graph on the right shows the distribution of differences between pairs of raters so that you can see how frequent 'near misses' are. Perfect inter-rater reliability would lead to a single spike at zero. Differences of one are left and right of center, and so on. The green bars are actual frequencies, and the orange line is the distribution we would see if the choices were independently random, drawn from the frequencies in the left graph. So you can also see in this graph that exact matches (at zero on the graph) are much more frequent than one would expect from, say, rolling dice.

Another example

Here we have a reasonably good distribution of scores, although they clump in the middle. The real problem is that raters aren't agreeing in the middle: the frequencies are actually slightly below random. The ends of the scale are in good shape, probably because it's easier to agree about extremes than distinguish cases in the middle. When I interviewed the program director, he had already realized that the rubric was a problem and the program was already fixing it.

The same report also shows the distribution of scores for each student and each rater. The latter is helpful in finding raters who are biased relative to others. The graphs below show Rater 9's rating differences on three different SLOs. They seem weighted toward -1, meaning that he/she rates a point less that other raters much of the time.

When I get a chance, I will turn the script that creates this report into a web interface so you can try it yourself.

How Much Redundancy is Needed?

We can spend a lot of time doing multiple ratings--how much is enough? For the kind of analysis done above, it's easy to calculate the gain in adding redundancy. With n subjects who are each rated k times by different raters, there are C(k,2) pairs to check for matches.

C(2,2) = 1 match per subject

C(3,2) = 3 matches per subject

C(4,2) = 6 matches per subject

C(5,2) = 10 matches per subject

So using one redundancy (two raters per subject) is okay, but you get disproportionately more power from adding one more rating (three times as many pairs to check for matches). So rule of thumb: three is great, four is super, more than that is probably wasting resources.

Connection to Elections

Since it's November of an election year, I'll add an observation I made while writing the code. In thinking about how voters make up their minds, we might hypothesize that they are influenced by others who express their views. This could happen in many ways, but let me consider just two.

A voter may tend to take an opinion more seriously based on the frequency of its occurrence in general.

But perhaps raw frequencies are not as important as agreements. That is, hearing two people agree about an opinion is perhaps more convincing than just the frequency the opinion occurs, and that distribution is different--it's the same one used to calculate the 'random' inter-rater dots and graphs above. To calculate it, we square the frequencies and renorm. Small frequencies become smaller and larger ones larger. The graphs below illustrate an example with four opinions.

Opinion four is the largest, but still a sub-majority of all opinions. However, if you got everyone around the campfire and counted up the agreements between them, opinion four becomes much more visible--the number of these 'handshakes' is now greater than 50% of the total. If this influences other opinions, it would seem to lead to a feedback mechanism that accelerates adoption of opinion four.

I looked for research on this, and found some papers on how juries function, but nothing directly related to this subject (although the results on how juries work are frightening). It would be interesting to know the relative importance of frequency versus 'handshake' agreement.

Wednesday, June 27, 2012

Retrieving Named Arrays in Statistics::R

This is mostly a note to myself, but anyone else who uses Statistics::R from cpan to link Perl scripts to the statistics engine R can benefit. I find myself needing to retrieve named arrays from R for use in programs, and there's no easy way to do that with the built-in functions. So here's the code:

sub get_hash { # custom function that returns a vector in

               # a hash indexed by variable names 
   my ($self, $varname) = @_;
   my $values_str = $self->run(qq{cat($varname)});
   my $keys_str = $self->run(qq{cat(names($varname))});
   my @values = split(/ /,$values_str);
   my @keys = split(/ /,$keys_str);
   my %hash;
   my $v;
   my $k;
   while(@keys) {
    $k = pop(@keys);
    $v = pop(@values);
    $hash{$k} = $v;
   }
   return \%hash;
}

Here's an example, using a matrix:

#take the number of rows in the matrix minus the column sums of blanks
$R->send(qq{n2= nrow(cols2.mat) - colSums(is.na(cols2.mat))});

#now we have a vector (with variable names) that has the number of non-blanks

#retrieve it from R into perl 
$n2 = $R->get_hash('n2');

In the example, the return value is a reference to a hash. You get the values out with something like:

$val_for_variable = ${$n2}{$name_of_variable};

Tuesday, June 12, 2012

Ed Tech Blogs

This is just a pointer to a nice list of K-12 Education Technology blogs, courtesy of EdTech. [link]

Browsing through the links, I found out about The Leap. Watch the video:

And also an idea apparently from Apple called Challenge Learning [pdf]. Excerpt:

Challenge Based Learning provides:

A flexible framework with multiple entry points

A scalable model with no proprietary systems or subscriptions

A focus on global challenges with local solutions

An authentic connection between academic disciplines and real world experience

A framework and workflow to develop 21st century skills

The purposeful use of technology for researching, analyzing, organizing,

collaborating, communicating, publishing and reflecting.

The opportunity for learners to do something important now, rather than waiting until they are finished with their schooling

The documentation and assessment of the learning experience from challenge to solution

An environment for deep reflection on teaching and learning

A process that places students in charge of their learning

This sounds a lot like where I have been headed. The only thing missing is the connection to a professional portfolio.

Saturday, June 09, 2012

Outsourcing Prediction Models

Higher education can benefit from many sorts of predictors. One of the most common is modeling attrition: trying to figure out how to identify students who are going to leave before they actually do. Because you might be able to do something about it if you understand the problem, right?

Over the years, I've been underwhelmed by consultants who cater to higher education, and it seems to me that we get second-rate services. I have seen only a few consultants that are worth the money, and I'll just mention one: SEM Works out of Greenville, NC. Jim Black seems to know everything there is about admissions, and I've seen direct benefits from his expertise.

There are various attractions to going outside for expertise. Maybe you want to make cuts, and need a hatchetman. Or maybe the decision makers are convinced that a "secret ingredient" is required, and the consulting firm has enough mystique to sell it. I don't believe much in secret ingredients.

What's particularly galling is that the advanced methods of building predictive models come from universities. Then they get used at Amazon.com and Netflix and what have you, while our administrations still chug along on 50 year old technology. So we go out and buy consultants. Who also use 50 year old technology.

You may not be suspecting this twist, but I actually came across a solution for this problem. That's the whole reason I'm writing this post. I'll just let free market talk now, from kaggle.com.

So instead of contracting up front for who-knows what sort of analysis, you can actually bid out the work and get a solution before you pay! Of course, you have to think hard about what data to submit, and spend some time cleaning it and prepping for such a submission. But judging from the rates on the site, it would be a lot cheaper, with an almost guaranteed better result, than buying a big name consultant. I'll leave you with this quote from their About page:

The motivation behind Kaggle is simple: most organizations don't have access to the advanced machine learning and statistical techniques that would allow them to extract maximum value from their data. Meanwhile, data scientists crave real-world data to develop and refine their techniques. Kaggle corrects this mismatch by offering companies a cost-effective way to harness the 'cognitive surplus' of the world's best data scientists.

Update: Also see "Not so Expert" in The Economist.

Grade Transition Maps

There was a question on the ASSESS email list about evaluating grade distributions for courses. It's an interesting topic, and I dug out a data set and tried to construct some interesting non-parametric descriptions of it. Just looking at two years of fall and spring semester college grades, I first wrote a script to map transitions of grades that came in the same or earlier semesters with those that came later. You can download the code here (it's a perl script, but I had to save it as txt because of some weird Apache thing on the server).

The table below shows the fraction of grade transitions from the one on the top to the one down the side. So of students who got a C one semester (C column) nine percent of them got a D the same or subsequent semester. The columns sum to 100% because only A-F grades are considered here. I could have included Ws, but I didn't.

Note that in this data set, there's no such thing as a "D student." Those who get Ds don't tend to repeat the experience nearly as often as the other grades. Instead they tend to get As and Bs more often than Ds and Fs.

The first table was for all grades and all courses. The next one is just for computer science courses, and rather than showing the actual percentages as I did above, I took the difference with the first table. So positive numbers mean that the transition is higher in computer science (colored green). Red and negative means a lower rate. You can see that B students are less likely to get As in computer science, for example.

The next three tables show the same thing for the disciplines shown.

Psychology gives a LOT of Ds--almost double the 'background' rate for A, B, and C prior grades. These are likely displaced Cs, because they give fewer of those.

There's a nifty math trick we can play to do an average comparison. Imagine that a student took four classes in a row in the given discipline. The expected matrix of grade transitions would be the one I generated from the data, but raised to the fourth power. I ended up just doing this in Excel, but along the way I found a cool site for doing matrix math here. It will calculate eigenvalues and whatnot for you.

What happens is that the grade distributions converge and you can see the 'asymptotic' effect of the discipline grading scheme emerge, which is like a geometric average. This causes the matrix columns to converge to the same vector. I've taken those and differenced them against the overall transition matrix, and then put these "averages" side by side below in order to compare disciplines.

All of these disciplines push students away from the higher grades to the lower ones, with computer science accumulating Cs, and English and psychology accumulating Ds and Fs.

Because you don't have the overall matrix to compare, here's a relative percent change over the 'background' rate of transition. The numbers below are no longer straight percentages of grade frequency (as in the table above). They are the percent change over the overall average rate for each cell. The most dramatic one is the psychology course's 68% increase in the production of D's over the 'background' transition rate.

So if these subjects are all lowering grades, which ones are raising them? It wasn't hard to find one, but I won't name it. Someone might object. The table below shows the transitions on the right and the asymptotic behavior on the left. Everything is compared to the overall background rate.

You can see this discipline pushes grades from the bottom to the top, transitioning 4% more often to Bs over time, which is a 14% increase over the background rate. It's similarly disproportionate with the rate of Cs and Ds, giving 16% and 18% fewer than the background rate.

One could refine this type of analysis to look at majors courses only, or even individual instructors, although large sample sizes are obviously needed.

Beautiful Data Abstraction

I came across "Up and Down the Ladder of Abstraction" today on the Dataisbeautiful subReddit. I'll include one graphic to entice you to read it, but it speaks for itself. It's brilliant.

This same process could be applied to analyzing enrollment management, for example. Wheels are spinning...

Friday, June 08, 2012

Bad Reliability, Part Two

In the last article, I showed a numerical example of how to increase the accuracy of a test by splitting it in half and judging the sub-scores in combination. I'm sure there's a general theorem that can be derived from that, but haven't looked for it yet. I find it strange that in my whole career in education, I've never heard of anyone doing this in practice.

I first came across the idea that there is a tension between validity and reliability in Assessment Essentials by Paloma and Banta, page 89:

An […] issue related to the reliability of performance-based assessment deals with the trade-off between reliability and validity. As the performance task increases in complexity and authenticity, which serves to increase validity, the lack of standardization serves to decrease reliability.

So the idea is the reliability is generally good up to the point where it interferes with validity. To analyze that more closely, we have to ask what we mean by validity.

Validity is often misunderstood to be an intrinsic property of a test or other assessment method. But 'validity' just means 'truth', and it really refers to statements made about the results from using the instrument in question. We might want to say "the test results show that students are proficient writers," for example. The problem with this is that we probably don't have a way (independent of the test) to see if the statement is true or not, so we can't actually check the validity. There are all sorts of hacks to try to get around this, like comparisons and statistical deconstructions, but I don't find any of these particularly convincing. See the Wikipedia page for more. Here's a nice image from that site that shows the usual conception of validity and reliability.

In the picture, validity is the same as accuracy, for example when shooting a rifle. Reliability would be precision.

This is an over-simplified picture of the situation that usually applies to educational testing, though.The conclusion of the previous article was that the situation depicted top right is better (at least in some circumstances) than the one bottom left, because we can discover useful information by repeated testing. There's no point in repeating a perfectly reliable test.

But there's another kind of 'bad reliability'.

Questions like "Can Tatiana write effectively?" can only be seriously considered if we establish what it would mean for that to be actually true. And because in education, we're supposed to be preparing students for life outside the ivy-covered walls, the answer has to be a real-world answer. It can't simply be correlations and factor analysis results from purely internal data.

The tension between trying to control for variation in order to understand 'ability' and the applicability of the results to situations where such controls are not present is nicely illuminated on the wiki page in this statement:

To get an experimental design you have to control for all interfering variables. That's why you often conduct your experiment in a laboratory setting. While gaining internal validity (excluding interfering variables by keeping them constant) you lose ecological or external validity because you establish an artificial lab setting. On the other hand with observational research you can't control for interfering variables (low internal validity) but you can measure in the natural (ecological) environment, at the place where behavior normally occurs. However, in doing so, you sacrifice internal validity.

Educational testing doesn't seem to concern itself overly much with external (ecological) validity, which is ironic given that the whole purpose of education is external performance. There are some really nice studies like the Berkeley Law study "Identification, Development, and Validation of Predictors for Successful Lawyering" by Shultz and Zedeck (2008), which found little external validity for either grades or standardized test results. It amazes me that our whole K-12 system has been turned into a standardized test apparatus without this sort of external validity check.

All tests are wrong all the time. It's only a matter of degree. To me, the only sensible sorts of questions that can be checked for validity are ones like this: "The probability of X happening is bounded by Y." This involves a minimum amount of theoretical construction in order to talk about probabilities, but avoids the reification fallacy of statements like "Tatiana has good writing ability."

Now the latter sort of statement can avoid a reification fallacy by converting it into a probabilistic assertion: "When others read Tatiana's writing, X proportion will rate it as good." This is now a statement about external reality that can be checked, and with a few assumptions, the results extrapolated into a probabilistic statement, which can itself be validated over time with more checking.

Imagine a perfectly reliable test. Given the same inputs it always produces the same outputs, meaning that all variation of any kind has been removed. Is this a good thing?

If we're measuring time or distance or energy or velocity, then perfect reliability is a great thing, and scientists spend a lot of energy perfecting their instruments to this end. But those are singular physical dimensions that correspond to external reality amazingly (some say unreasonably) well. The point is that, until you get to the quantum level where things get strange, physical quantities can be considered to have no intrinsic variation. Another way to say it is that all variation has to be accounted for in the model. If energy input doesn't equal energy output, then the difference had to go somewhere as heat or something.

It took a long time and a lot of work to come upon exactly the right way to model these dimensions and measure them. You can't just start with any old measurement and imagine that it can be done with perfect reliability:

Just because you provide the same input conditions gives you no right to expect the same outputs. These sorts of relationships in the real world are privileged and hard to find.

Unlike electrons, humans are intrinsically variable. So in order to squeeze out all the variability in human performance, we have to imagine idealizing them, and that automatically sacrifices reality for a convenient theory (like economists assuming that humans are perfectly rational).

When my daughter went off to take her (standardized) history test Monday, here are some of the factors that probably influenced her score: the review pages she looked at last, how much time she spent on each of them, the ones she chose to have me go over with her, how much sleep she got, the quality and quantity of the eggs I scrambled for her breakfast, the social issues and other stresses on her mind (like moving to a new city over the summer), excitement about her study-abroad trip next week, what stuck in her mind from classroom instruction, the quality of the assignments, her time invested in completing said assignments, and the unknowable effects of combinations of all these factors in her short and long-term memory. Add to this the total experience of having lived for almost fifteen years, the books she's read and culture she's been exposed to, the choice of vocabulary we use, etc., and we get a really complex picture, not much like a black box at all.

All of these factors are real, and affect performance on tests and whatever it is that tests are supposed to predict. One would think that educators ought not to just be interested in snapshots of idealized 'ability' but also the variability that comes with it.

Squeezing Reliability

Given that variability is intrinsic to most everything we look at, we can't just say that variability represents a threat to validity. Casinos make their profits based on controlled variation in games that give perfectly assessed unequivocal measurements. No one is going to claim that a dice roll is invalid because of the variability involved.

There is a real zeal in educational assessment to squeeze out all forms of variability in results, however, and the fact that variability is an important part of performance easily gets lost. I wrote about about this in "The Philosophical Importance of Beans" a couple of years ago, after a disagreement on the ASSESS email list about whether or not there was a uniquely true assessment of tastiness of green beans. The conversation there was similar to ones that go on about the use of rubrics, where the idea is that all raters are ideally supposed to agree on their ratings of commonly assessed work. I recommend the book Making the Grades: My Misadventures in the Standardized Testing Industry by Todd Farley, for a detailed look at what it means to force such agreement.

What sort of rubric and rating system would be required to ensure perfectly reliable ratings on such things as "how good a movie is" or "whether P is a good public servant" or "the aesthetic value of this piece of art"?

All of these examples have intrinsic variability that cannot be removed without cost to validity. It's just a fact that people disagree about art and culture and performance of practically every kind. And this includes thinking and communications skills, which are commonly assessed in higher education.

We shouldn't assume that a test can be made more reliable without sacrificing validity. The phenomena where one can get nearly zero reliability and retain all meaning in some sort of predictive model (like physics) are a very special sort of knowledge that was hard won. It's a property of the universe that we don't get to dictate.

Rather than trying to find these special relationships (which frankly may not exist in learning assessment), researchers seem to assume that we are entitled to just assume them. In this happy view of the world, variability is just a nuisance that impedes the Platonic perfection of 'true ability.'

There are practical implications to the idea of "valid variability," as the previous post demonstrated. More on that next time.

Wednesday, June 06, 2012

Bad Reliability

You usually hear that validity implies reliability, but sometimes reliability is bad. Here's why.

Suppose a medical test exists for a certain disease, but it will only return a true positive result if there is a sufficient amount of antibody in the blood. Because the amount of the antibody changes over time, the test may return false negatives, saying that the subject does not have the disease when in fact they do. So the test does not reliably give the same answer every time. But this is actually an advantage because there is usefulness to repeating it. It would be bad if it gave you the same wrong answer reliably.

We can create a similar situation in educational testing. Suppose we want to know if students have mastered foreign language vocabulary, and have a list of 5,000 common words we want them to know. Suppose we give a test on 100 items. It's possible (but not likely) that a student could know 98% of the material and get a zero on the test.

If a student knows a fraction p of the 5000 word list, and the test items are randomly chosen, then we can compute the probability of overlap. Each item has an independent probability p of being chosen, and assuming there are no other kinds of errors, we can find a confidence interval around p using p^, the number the student gets correct on the test:

Courtesy of Wikipedia

For a 95% confidence interval and n = 100, we have the error bounds shown below.

If a student knows half the list, 5% of the time, their score will be wrong by more than 10%. If we are testing a lot of students, this is not great news. We can increase the size of the test, of course. But what if we just gave a retest to any student who fails with a 60% or less?

If it's the same test, with perfect reliability, the score shouldn't change. So that doesn't help. What if instead we gave the students a different test of randomly selected items? This reduces test reliability because it's, well, a different test. But it increase overall chances of eliminating false negatives, which is one measure of validity. So less reliability, more validity.

Here's a cleverer idea. What if we thought of the single 100 item test as two 50 item tests? Assume that we want to pass only students who know 60% or more of the words, and we accept a pass from either half of the test. How does that affect the accuracy of the results? I had to make some additional assumptions, in particular that the tested population had a mean ability of p=.7, with s=.1, distributed normally.

The results can be displayed visually on a ROC curve, shown below. If you want to download the spreadsheet and play with it, I put it here.

The single test method has a peak accuracy of 92% under these conditions, with an ideal cutoff score of 55%. This is a false positive rate of 34% and a true positive rate of 97%. The dual half test method has an accuracy of 93% with the cutoff score at 60%, which is more natural (since we want students to have mastered 60% of the material). This is a false positive rate of 18% and a true positive rate of 95%. The pass rate for the single test is 87% versus 83% for the dual test with the higher standard. The actual percentage who should pass under the stated conditions is 84%.

All in all, the dual-test method seems superior to the single test method. Given how many long tests students take, why aren't they generally scored this way? I'm certain someone has thought of this before, but for as long as I've been in education, I've never heard of split scoring.

Thursday, May 24, 2012

Making a Difference

You may have already heard about the NeverSeconds blog or the 9-year-old who created it to rate her school's cafeteria food. How after just a few blog posts, the photos and descriptions got national and then world attention. She got the food situation improved (with help from dad), and has created a way for other kids at other schools around the globe to become involved with school food. It's an amazing story. Here's one of her food reviews (source):

Today's meal was on the menu as Cheeseburger and ice cream/biscuit but as you can see I got an ice lolly. I prefer ice cream. I wish they had stuck to the menu. I did get 2 croquettes though only 3 pieces of cucumber when I said no thanks to the peas.

Food-o-meter-7/10
Mouthfuls- eating and counting and chatting to friends is hard!
Courses- main/dessert
Health Rating- 2/10
Price- £2
Pieces of hair- 0!

This story has made the rounds framed as a human-interest bit, which of course it is. But for me there's a larger story.

This couldn't have happened without the Internet and cheap consumer technology. Imagine just a couple of decades ago what it would have required for a 9-year-old to photograph her lunch every day and share her musings with hundreds of thousands of other people. And now we don't even notice the technological miracle--that's not the story at all. The story is not that this schoolgirl is a prodigy either. It's that she cared about something enough to do something about it, and it made a difference.

This winds back to my point about the medieval mindset that still permeates much of our educational systems: the "prepare and certify" model (see this post for more). In treating students like they are machines on an assembly line, we overlook the fact that they are already capable of doing very cool things, as this 9-year-old demonstrates. Yes, we have to prepare them by helping them learn about the world in ways they wouldn't do accidentally. But we can also lead students to an even more important realization: they can already begin to change the world with the preparation they already have.

The young nutritionist created her project because of intrinsic motivation, not because of a school assignment. Now it stands on its own for anyone to look at and assess as an accomplishment. This is very different from a classroom assignment with purely extrinsic motivation, which results ultimately in a letter grade assessment that gets averaged into others, losing most of the evidence that the event ever happened. As with most 'preparation' activities, it would be entirely transient--a momentary hurdle to be overcome on the way to a very distant graduation from university and ultimate 'certification' in the form of a diploma. I hope the contrast between these is sufficiently stark that you may wonder about how we might do things differently. At the very least, it ought to generate some doubt that it really requires more than twenty years to prepare a human being to be useful in society.

Thursday, May 17, 2012

A Social Media Metric Arrives

In "The End of Preparation" I wrote that:

The method of assessing a portfolio is deferred to the final observer. You may be interested in someone else's opinion or you may not be. It's simply there to inspect. Once this is established, third parties will undoubtedly create a business out of rating portfolios for suitability for your business if you're too busy to do it yourself.

It turns out that third parties aren't waiting on portfolios. If you read the I-ACT article, you may recall that participation in professional social networks like mathoverflow.net are part of what goes in my ideal portfolio. These professional networks are not developed yet for every discipline, but there is a company using lowest-common-denominator social networks to rate your overall social impact. The site is Klout.com. Here's what they say:

People have always had the power to influence others, and that power is being democratized with new social media tools. Klout's mission is to provide insights into everyone's influence. We measure your influence based on your ability to drive action in social networks. We process this data on a daily basis to give you an updated Klout Score each morning. Here are a few of the actions we use to measure influence:

I didn't want to give them my Twitter or Facebook password so they could calculate my score. In any event, it can't be very large on their log scale from 1-100:

The average Klout Score is actually 20, not 50. As your Score increases, it becomes exponentially harder to increase your Klout. That's why you see so many 20s and not as many 90s!

I learned about Klout from Reddit (see comments there), which points to a Wired article "What Your Klout Score Really Means" by Seth Stevenson. The article describe the experience of a candidate for a VP position at a marketing agency:

The interviewer pulled up the web page for Klout.com—a service that purports to measure users’ online influence on a scale from 1 to 100—and angled the monitor so that [the candidate] could see the humbling result for himself: His score was 34. “He cut the interview short pretty soon after that,” [the candidate] says. Later he learned that he’d been eliminated as a candidate specifically because his Klout score was too low. “They hired a guy whose score was 67.”

At present, using such a crude instrument probably only damages the hiring process, and is awfully shortsighted. But this is just the beginning. For mathematics, you can already browse mathoverflow.net to see the reputation on this social site devoted to research-level mathematics. The user with the highest assigned reputation is Joel David Hamkins. Take a look at his page there and see how rich it is with information about his professional life.

Means to What End?

The title for this article comes from a 4/22 Commentary in the The Chronicle entitled "Stop Telling Students to Study for Exams" by David Jaffee. Here's a bit of it:

If there is one student attitude that most all faculty bemoan, it is instrumentalism. This is the view that you go to college to get a degree to get a job to make money to be happy. Similarly, you take this course to meet this requirement, and you do coursework and read the material to pass the course to graduate to get the degree. Everything is a means to an end. Nothing is an end in itself. There is no higher purpose.

I put in bold the headline quote so you couldn't miss it. Instrumentalism is the idea that predicting cause and effect is more important than "understanding reality," and I'm not sure it's exactly the right concept for this argument. But the argument is still valid, and summed up in this ubiquitous practice:

On the one hand, we tell students to value learning for learning's sake; on the other, we tell students they'd better know this or that, or they'd better take notes, or they'd better read the book, because it will be on the next exam; if they don't do these things, they will pay a price in academic failure. This communicates to students that the process of intellectual inquiry, academic exploration, and acquiring knowledge is a purely instrumental activity—designed to ensure success on the next assessment.

This is the "prepare and certify" model that I dissected in "The End of Preparation." In theory, the preparation (the cause) enables students to be functional in graduate school, employment, entrepreneurship, performance, public service, or some other worthy human endeavor after graduation. The reason I don't think our prepare/certify model is instrumentalism is because it's rare for anyone to check this connection between the preparation and its ultimate impact. That's mainly because it's so hard to do. Yes, we get studies about how much an undergraduate degree is worth in terms of life wages, but that doesn't say anything causal about the education itself (correlation is not causation).

The whole article is worth a read. Maybe I'm saying that because it reaches the same conclusions I have:

Authentic assessments involve giving students opportunities to demonstrate their abilities in a real-world context. Ideally, student performance is assessed not on the ability to memorize or recite terms and definitions but the ability to use the repertoire of disciplinary tools—be they theories, concepts, or principles—to analyze and solve a realistic problem that they might face as practitioners in the field.

I have gone further and tried to show how we can do that. See "I-ACT: An Alternative to Prepare-and-Certify."

If you have some time to read it, there's a provocative article on the philosophy of science that is related to instrumentalism (my assessment), and does have a connection to the subject matter, albeit from the perspective of natural selection: "The Interface Theory of Perception: Natural Selection Drives True Perception To Swift Extinction" by Donald D. Hoffman. Here's the abstract:

A goal of perception is to estimate true properties of the world. A goal of categorization is to classify its structure. Aeons of evolution have shaped our senses to this end. These three assumptions motivate much work on human perception. I here argue, on evolutionary grounds, that all three are false. Instead, our perceptions constitute a species-speci c user interface that guides behavior in a niche. Just as the icons of a PC's interface hide the complexity of the computer, so our perceptions usefully hide the complexity of the world, and guide adaptive behavior. This interface theory of perception o ers a framework, motivated by evolution, to guide research in object categorization. This framework informs a new class of evolutionary games, called interface games, in which pithy perceptions often drive true perceptions to extinction.

I have added emphasis to the point I think connects to the current context: the way we think of the world, how this forms cause-effect models and the language we construct to process it, collectively form an "interface" that guides behavior in a niche, as the author puts it. If the 'niche' is negotiation of short-term hurdles using short-term memory and becoming skillful at doing minimal work to earn grades, that's a very different thing from being productive in the grandest way humans are capable of: through art and rhetoric, leadership and service--the effects we are actually hoping for when betassled students walk over the stage.

Wednesday, May 09, 2012

Memory Games

A while back I wrote "Memory as SLO," about the possibility of developing short term memory as an intentional learning outcome. I came across an article that has a particular program for doing that. It seems like this would be a relatively easy bit of research: randomly select some first year students and have them go through a program based on the methods describes.

Read it for yourself at NYT in "Can You Make Yourself Smarter?" Here's a quote:

In a 2008 study, Susanne Jaeggi and Martin Buschkuehl, now of the University of Maryland, found that young adults who practiced [this method] also showed improvement in a fundamental cognitive ability known as “fluid” intelligence: the capacity to solve novel problems, to learn, to reason, to see connections and to get to the bottom of things. The implication was that playing the game literally makes people smarter.

Tuesday, May 01, 2012

Teaching and Assessment (Video)

This is a long video, but worth the time.

Wednesday, April 11, 2012

Pattern Matching

A friend gave me a desk calendar with daily puzzles. Usually the numerical ones are easy, but last Friday the one above came up, and I couldn't immediately solve it. If you want to try to solve it yourself, you may not want to read any further until you're done with it.

I assumed the 'game' in the puzzle was to use the four numbers in the corners to mathematically derive the one in the middle, although there was the possibility that there was a non-numerical 'trick' answer that relied on the way numerals are spelled or something. I stuck with it for a while, trying simple calculations, but couldn't find a suitable pattern that worked for all three examples. Then I remembered reading about Eureqa, from Cornell Creative Machines Lab--a genetic pattern finder that should be able to chew this problem up in no time. It's also free.

I labeled the input values x1, x2, x3, and x4 for top left, top right, bottom left, and bottom right, respectively, and the center (output) number became y. It's easy to enter this in the program like a spreadsheet:

The y column was originally 4, 5, 3, which are the numbers in the puzzle. The 8, 9, 3 that appear in the image are explained below.

I skipped the "Prepare Data" tab because I didn't need to smooth data or fill in missing values, etc. Things started to get interesting with the "Prepare Data" tab. The program guessed my target expression correctly. And here I made a mistake. You get to decide what sorts of operators the solution is allowed to use. This is a meta-problem, where you have to think like the test designer. Is it likely that the solution uses the hyperbolic tangent function? Probably not. So I picked the operations one learns about in grade school: addition, subtraction, multiplication, and division. Part of the screen is shown below.

Notice that I left "Constant" and "Integer Constant" unchecked, and hence unavailable for the program to use in a solution. I reasoned that the puzzle designer would not have included an arbitrary number, as in y=(x1+x2-x3-x4+15). It seemed inelegant and made the problem space much bigger to include even integer constants. But this was a case of premature optimization, as we will see. I chose the "Minimize the absolute error" metric of success and started the search.

This was my first time using the interface, but it only took a moment to figure out what I was looking at. The "fitness" is the error function, which is optimized at zero, meaning no errors in predicting y. The formula that it found immediately was y=x4. If you look at the puzzle again, you'll notice that the lower right number is the same as the center one. This is the simplest pattern that matches, and the automatic searcher has a bias for low-complexity (i.e. small) formulas. I was sure the problem designer intended all four input numbers to be used, so this couldn't be it.

So I tried obfuscation to trick the solver into using all four inputs. I did this by transforming the problem by creating a function z, and asking the solver to find a formula f so that f(x1,x2,x3,x4) = y + z(x1,x2,x3,x4). The idea was that this would rule out the simple y = x4 solution, and force the solver to look for answers that used all four inputs. Then I could subtract off the z part and have my solution. For example, one function I tried was z = 2x1 - x2 - x3*x4.The input screenshot above shows one of these attempts, with the y column transformed in this way.

This yielded results. After using two different z functions, I got two new solutions. Here they are:

f(x1,x2,x3,x4) = x2+x3-x3*x3
f(x1,x2,x3,x4) = -2x1+x2+2x3x4-x3*x3+x4

These both work, and you can check them by plugging in the puzzle inputs to see that the calculations equal the number in the middle. But the first one above doesn't use x4, and although the second one does, it seems too complex for a casual puzzle.

At this point I looked at the "official" answer. Here it is:

f(x1,x2,x3,x4) = (10*x1+x2)/(10*x3+x4)

As you can see, there are integer constants in the solution, which I had ruled out when I set up the problem. Although the formula looks opaque as I have written it, for a human it's natural to read two digits across as a single number, so that the 9 and 6 on the top of the first example becomes 96, which is formulized as 9*10 + 6. The bottom is 24. I had tried that trick myself, but not noticed that 96/24 = 4.

I was naturally annoyed at myself for having ruled out the possibility of finding the official solution by unchecking "Integer Constant," so tried again with constants enabled. This time I got

f(x1,x2,x3,x4) = 24-2*x1-x2-x4-x3*x4

This was after 6931 generations. It still isn't what I'm looking for. My transformation was making sure all the inputs were being used, but it was a crude hack that ended up making the solution more complex than it needed to be.

What I really needed was a way to modify either the target expression or the error metric in order so that an optimal solution has exactly one occurrence of each of x1 through x4. I couldn't figure out how to do that with the interface, and it may not be possible with this version of the software. So I tried another approach.

The puzzle is flawed in that it allows the y = x4 solution, so I created some more examples for the solver to chew on: 22/11 = 2 and 26/12 = 3. With the original puzzle values plus these two extras, the solver began to converge to solutions with smaller and smaller errors:

Unfortunately, integer-based problems like this just aren't amenable to gradient-based approaches. Unlike horseshoes and hand grenades, close isn't good enough. After seven minutes, the formulas were all variations on high powers of the inputs, overfitting the limited data to find a solution.

I tried 'seeding' the search with y=x1/x3 This is the same as y = (10*x1)/(10*x3), which is pretty close to the solution. After more than 20,000 generations that resulted in a perfect solution, but not the one I wanted:

y = x4 + -1/x4 + (x2 + x3 - x3*x3)/(x4*x4)

It did recognize the 'correct' solution when I fed that directly in. This is pictured below (third one down).

The 'fit' is the error, which is zero for the right answer. It immediately lost the integer constants, however, and went off looking for lower-complexity solutions, which do exist (see the ones listed earlier).

To be fair, Eureqa isn't really designed for solving this sort of problem. Finding patterns with real world data usually means accounting for noise and missing values, and looking for simple approximate relationships that might tell you something about underlying relationships. In this case, y = x4 is still a pretty good approximation for the examples I put in. It occurred to me that if this were real data, the solution I'd be looking for is y = x1/x3, which would normally be a reasonable approximation to the perfect solution. When I added a few more rows of example data, the solver found this immediately. So it works as intended, which is not necessarily for the purpose of solving arbitrary puzzles humans design for each other.

Update: There are a few comments about the above on /r/machinelearning, which you can find here.

Wednesday, March 28, 2012

Education News Dot Org

I discovered Education News this week. It seems like a comprehensive, smart source for ed news, and has a clean design to boot. Check it out at EducationNews.org. For bloggers, one of the nice things is that they allow linked backtracks in comments. This isn't true at all sites, and I suppose there is a downside to it from the publisher's perspective (it invites spam links, which have to be weeded out).

Note that the site isn't new. In fact it's as old as my daughter, which means that the editor should keep a shotgun by the door and hide the car keys.

Here's a direct link to the section on higher education.

Wednesday, February 29, 2012

Test Fail

I came across two articles this morning with "test" and "fail" in the title. They're both worth a look.

"Failed tests" at The University of Chicago Magazine. Quote:

Neal, a professor in economics and the Committee on Education, insists it’s a “logical impossibility” that standardized tests, as they’re most often administered, could assess both teachers and students without compromising teacher integrity, student learning, or both. “The idea is that we want faculty held accountable for what students learn, so the tool that we use to measure what students learn is the tool that we should use to hold faculty accountable,” Neal says. “It’s all rhetorically very pleasing, but it has nothing to do with the economics of how you design incentive systems.”

Next is "Standardized Tests That Fail" in Inside HigherEd this morning. Quote:

“We find that placement tests do not yield strong predictions of how students will perform in college,” the researchers wrote. “In contrast, high school GPAs are useful for predicting many aspects of students’ college performance.”

You may want to contrast that with another article by a test score true believer (my term). This one has stuck in my craw for a long time it's so bad, but I'll say no more about it. Anyone with basic critical thinking skills can figure out what's wrong with it. "'Academically Adrift': The News Gets Worse and Worse" in The Chronicle. I can't disagree with the conclusion of the article, however, so I'll quote that:

For those who are dissatisfied with the methods or findings of Academically Adrift, who chafe at the way it has been absorbed by the politicians and commentariat, there is only one recourse: Get started on research of your own. Higher education needs a much broader examination of how and whether it succeeds in educating students. Some of that research will doubtless become fodder for reckless criticism. But there's no turning back now.

[Update 3/1/2012] Here's one more: "The True Story of Pascale Mauclair" in EdWize. It's a horror story about the abuses of publishing standardized test results that are used to rate public school teachers. The bold in the quote is mine.

On Friday evening, New York Post reporters appeared at the door of the father of Pascale Mauclair, a sixth grade teacher at P.S. 11, the Kathryn Phelan School, which is located in the Woodside section of Queens. They told Mauclair’s father that his daughter was one of the worst teachers in New York City, based solely on the [Teacher Data Results] reports, and that they were looking to interview her.

Monday, February 06, 2012

Links on Learning

"The State of Science Standards 2012" maps out an analysis of pre-college science instruction in the United States.

Quote:

A majority of the states’ standards remain mediocre to awful. In fact, the average grade across all states is—once again—a thoroughly undistinguished C.

There are individual state reports at the site.

Next up is a high school student's ambitious "The education system is broken, and here's how to fix it." One of his complaints relates to the way analytical loading is done (see my previous post):

So, this causes students to go about the "textbook" skipping everything but the formulas, and then memorizing those. Then, when the test comes along, those who had time to memorize their formulas do excellent, and those that had something going on get low grades.

Then, as soon as they're done with the test, they put in all of their efforts into memorizing the next set of formulas and have nothing left from the last set that they memorized except "I love *whatever the topic is*, I got a 96% on that test".

In a similar vein is a post from mathalicious.com: "Khan Academy: It's Different This Time." The author is critical of the eponymous video-based instruction site and its methods, claiming that:

Khan Academy may be one of the most dangerous phenomenon in education today. Not because of the site itself, but because of what it — or more appropriately, our obsession with it — says about how we as a nation view education, and what we’ve come to expect.

I think the author assumes too much about the implications of this. Video-based instruction is just a tool, which can be used correctly or abused.

A thread that runs through these three pieces is that intrinsic motivation is important. The report in the first link mentions the excitement that science generated during the Space Race, and how that's lacking today. The high school math student in the second article wants to know why, not just how. And the critique of the Khan Academy is alarmed at a potential "view and spew" pedagogy (my term, not the author's).

Learning terms, rules, methods, facts, connections, and so on can be pretty dry. This constitutes what I'm calling the analytical load required to do something more interested. Learning to play chords on a guitar is slow and painful, but then you get to play songs, which is fun.

I see tight, focused on-demand instruction like the Khan Academy as an essential resource for learning and reinforcing an analytical load. This can be augmented by additional material that motivates learners. There are plenty of ways to do that. Anything that looks like a story is good (history of science, for example). Applications that involve creativity are the ultimate objective.

All of the sources linked above are rightly critical of the prepare-and-certify model of education, which in practice turns into drill-and-test, with almost entirely external motivation. Teachers face a battle in winning back student enthusiasm against this machine. There's nothing wrong with being concerned about grades, but if that's all there is to it, students face a rude awakening after graduation.

More on that theme in an article at Common Dreams: "Wild Dreams: Anonymous, Arne Duncan, and High-Stakes Testing."

Finally, a link from Education Week on the subject of testing students who want to be teachers: "Analysis Raises Questions About Rigor of Teacher Tests." This is the meta-problem.

Saturday, February 04, 2012

I-ACT: An Alternative to Prepare-and-Certify

In "The End of Preparation" I gave an alternative to the factory-like "prepare and certify" philosophy evident in the current practice of formal education. The purpose of the present article is to develop a modest trial program to test the portfolio approach.

In order to have something specific to talk about, I'm going to outline a course that could fit into many curricula, and could be scaled in sophistication to meet the level of the student. It might be most at home in a general education program, or as a topics course in the sciences. Here's the description:

Rong! Mistakes in Scientific Thought This course explores important conceptual mistakes in the history of scientific understanding. Even the most brilliant thinkers made bad assumptions, over-simplified, and took the wrong path occasionally. Major breakthroughs are as often related to getting rid of errant beliefs as finding better ones. Students will learn about some of these milestones, and perhaps develop some modesty about the certainty of their own beliefs.

On the first day of class, the instructor can explain that the neologism "rong" comes from Wolfgang Pauli's reputed remark that a line of reasoning was "not even wrong." Since rong is literally not even "wrong," it fits. I will use it in a noble sense, not as disparagement. An idea is rong for some fundamental reason that when understood, advances knowledge. The belief that the sun goes around the Earth is not just wrong, it's rong. The rongness may be conceptual (as in the case of geo-centricism) or methological (as with astrology). Both are important, and somewhat humbling to read about. Our forebears weren't stupid after all. As P. L. Seidel wrote in 1847: "[M]ethodological discoverers are very badly treated. Before their method is accepted it is treated like a cranky theory; after it is treated as a trivial commonplace." Louis Pasteur comes to mind.

A Prepare-and-Certify Approach
The normal way to teach a course is to find a textbook and other source materials, set up a syllabus with major events like reading deadlines and test dates, outline a grading scheme, and list office hours. Students would write papers, get feedback, and ultimately a course grade. Then most of this effort would be forgotten and lost to posterity. In theory, the experience would have incrementally added to the "preparation" of the student for some eventuality that happens after graduation (the event horizon of education). In practice, no one would ever know if this is true because there are no objective measures for it. (See "An Index for Test Accuracy" to see what would be involved.)

Now that I have set the straw man in place, we can proceed to whack the silage out of him. It's clear that the course description cries out for a seminar-type approach, and that these sort of courses already exist. What I'll do below is enlarge the conception of a traditional seminar course. I need a name for this mutation, so let's call it I-ACT, which stands in for Analyze-Create-Publish-Interact (because I-ACT means something, and ACPI sounds like an economic index).

The I-ACT Approach

The role of the instructor is to help students pick good projects, and guide them through the steps Analyze-Create-Publish-Interact, which are described below in turn. But first a schematic, to break up this wall of text with a busy, colorful graphic.

1. Analysis

In order to produce new knowledge (that is, new to the student), one has to start somewhere. Academic disciplines comprise all sorts of knowledge, but here we are interested in whatever raw material can be turned into something new. In History, it might be original sources and a philosophy of history. In chemistry it might be a certain kind of molecule and knowledge of basic chemistry. In common games, like chess, the starting place is understanding of the rules and pieces.

In this bundle is sometimes a list of deductive rules that check for correctness. These would include accepted spelling of words (at a basic level), rules of logic, physics formulas, or any other deterministic method of turning one thing into another, which can be done correctly or incorrectly. A haiku has a particular structure, and a blues song has a certain scale and beat. A knight in chess can only move in a certain way.

The instructor assigns a problem (or the student is tasked to find one) that has an acceptable analytical load. We shouldn't expect kindergartners to solve systems of linear differential equations. The student doesn't need to be a master of this analytical domain, but it must be within reach. Resources include anything on the Internet plus the instructor, peers, and appropriate social networks.

As an example, I will use a real assignment I used for an undergraduate research project in math. The source is Proofs and Refutations by Imre Lakatos. One chapter in the book describes how the great mathematician Augustin-Louis Cauchy was rong about something, and how it got noticed and fixed. Cauchy provided a proof that when an infinite sum of continuous functions converges to a new function, then that new function would be continuous as well. In this case, the student's analytical load includes basic math analysis techniques (delta-epsilon proofs), and familiarity with infinite series and functions. She should be able to read Cauchy's proof (probably with some difficulty, and needing help), and understand the issue. She should be able to create examples of the ingredients for the proof, such as series of continuous functions, and be able to test them and their sum for continuity.

An outline of Cauchy's errant proof, from page 132 of Proofs and Refutations.

The analysis box is never really complete. In any discipline there's always more to be learned, and part of the I-ACT learning process is to go back to the well to seek clarification, examples, related concepts, and so forth. Learning this self-help process is an important objective.

2. Creativity

The analytical load need not be huge. Games generally have a small set of rules to make them accessible, and in fact you don't need a lot of rules to be creative. In the creative step, we help the student work with the analytical tools in a trial-and-error exploration. This is only possible if there are wrong answers. Another way to say that is if everything that the student can possibly produce is just fine, there's nothing to be learned from the exercise. A pilot should know a good landing from a bad landing. A doctor should know a live patient from a dead one. A musician should know a major chord from a minor seventh. And so on. The student is likely to make mistakes in this, which is where the instructor, peers, and social network can help.

Even in aesthetic subjects, we don't have to accept total relativism (and hence lost learning opportunities). In a photography class, instead of trying to figure out the exact artistic merits of a photo, one can examine technique. Is the subject in focus or not? Does the rule of thirds apply or not? Are the whites white or not?

To continue the example, the student was asked to find some series of continuous functions that converge to some new function, and then see if that new function was continuous. This took some work, and really exercised what she knew about functions, limits, and continuity. This strengthened her analytical skills, and her confidence increased so that she started to feel like she knew what she was doing. At this point, she was ready to tackle the question of why Cauchy was rong. Once that is discovered, it becomes a question of how to fix it.

From Proofs and Refutations.

There are plenty of other creative exercises here, such as conjecture and proof of properties of uniformly continuous functions. Why don't Fourier Series work? This one example of rongness can be a point of departure for many analysis topics.

The exercise of checking steps in an argument is purely analytical, but creating a solution to a problem or finding other connections is not. The creative step should produce new knowledge for the student. Internet resources (like wolframalpha.com in the case of my example) can be used to find new connections, examples, and explanations.

3. Publication

Once a student has made some progress, it's time to write it up, or otherwise prepare the material for public display in electronic form on an intranet or world wide. Here, 'public' should at least mean that the instructor can see it, but that's not an advance over traditional delivery. Student peers in the same class or program, other faculty members at the same or other institutions, social networks, and the whole world wide web are possible audiences.

Publishing interesting questions or intermediate results can be as useful as a finished piece of work.

When I supervised the student project on uniform convergence, social networks didn't exist like they do now. Now I could encourage a student to use reddit.com/r/math or stackoverflow.com to pose questions or try out ideas. This is not certain to succeed--these communities have to be engaged, not just spammed with drive-by questions.

4. Interaction

Interaction is a natural consequence of publishing. Over the summer I came across a delightful paper from Scott Aaronson at MIT entitled "Why Philosophers Should Care About Computational Complexity." I found it through a social network I frequent that scans for interesting stuff like this. After Scott posted the draft of his paper, he received a number of comments and suggestions. This feedback resulted in new drafts that clarified his thinking and fixed problems. Here's a quote from his blog:

Thanks to everyone who offered useful feedback! I uploaded a slightly-revised version, adding a “note of humility” to the introduction, correcting the footnote about Cramer’s Conjecture, incorporating Gil Kalai’s point that an efficient program to pass the Turing Test could exist but be computationally intractable to find, adding some more references, and starting the statement of Valiant’s sample-size theorem with the word “Consider…” instead of “Fix…”

Then there's the meta-commentary from philosophers about the paper on reddit, which adds a perspective and some new references.

Interaction this rich can illuminate everything about the work, including analysis and creativity. It can critique or endorse, dismiss or expand scope.

Desired Outcomes

How is this an improvement over traditional classes or seminars? For me, there are several answers:

Learning to rely on a self-help network of resources.
Public display of one's work can lead to intrinsic motivation that is greater than the extrinsic "I need to get a C in this course."
The above point is magnified by emphasizing that this work forms part of a life-work portfolio that will be useful for a very long time. In addition to certifications (eventually instead of certifications), students have authentic work to display.
Engagement in social networks and general audiences that care about the topic is a good long-term investment. It leverages ones own abilities and adds credibility to one's published works.
Student work competes on merit, not on who they are or what institution they're from, and can help them assess their own skill and knowledge.
By separating analytical techniques from creativity, we can prepare students for both. Creativity takes self-confidence and practice. This can be nurtured in a controlled environment.
All the tools and methods used can be applied outside of a college setting. It's a practical real-world skill to develop a skill set, create something with it, publish it online professionally, and generate feedback for improvement.

Next Steps

I am looking for a handful of science departments to try some of these ideas out. The course description above is one of many that could be used. Once the details are in order, I'll seek some external funding for travel, some implementation costs (setting up a portfolio system maybe), and money to run a small conference. If you're interested, click on my profile or link to my vita and email me.