Monday, February 06, 2012

Links on Learning

"The State of Science Standards 2012" maps out an analysis of pre-college science instruction in the United States.
Quote:
A majority of the states’ standards remain mediocre to awful. In fact, the average grade across all states is—once again—a thoroughly undistinguished C.
There are individual state reports at the site.


Next up is a high school student's ambitious "The education system is broken, and here's how to fix it." One of his complaints relates to the way analytical loading is done (see my previous post):
So, this causes students to go about the "textbook" skipping everything but the formulas, and then memorizing those. Then, when the test comes along, those who had time to memorize their formulas do excellent, and those that had something going on get low grades.
Then, as soon as they're done with the test, they put in all of their efforts into memorizing the next set of formulas and have nothing left from the last set that they memorized except "I love *whatever the topic is*, I got a 96% on that test".
 In a similar vein is a post from mathalicious.com: "Khan Academy: It's Different This Time." The author is critical of the eponymous video-based instruction site and its methods, claiming that:
Khan Academy may be one of the most dangerous phenomenon in education today. Not because of the site itself, but because of what it — or more appropriately, our obsession with it — says about how we as a nation view education, and what we’ve come to expect.
I think the author assumes too much about the implications of this. Video-based instruction is just a tool, which can be used correctly or abused.

A thread that runs through these three pieces is that intrinsic motivation is important. The report in the first link mentions the excitement that science generated during the Space Race, and how that's lacking today. The high school math student in the second article wants to know why, not just how. And the critique of the Khan Academy is alarmed at a potential "view and spew" pedagogy (my term, not the author's).

Learning terms, rules, methods, facts, connections, and so on can be pretty dry. This constitutes what I'm calling the analytical load required to do something more interested. Learning to play chords on a guitar is slow and painful, but then you get to play songs, which is fun.

I see tight, focused on-demand instruction like the Khan Academy as an essential resource for learning and reinforcing an analytical load. This can be augmented by additional material that motivates learners. There are plenty of ways to do that. Anything that looks like a story is good (history of science, for example). Applications that involve creativity are the ultimate objective.

All of the sources linked above are rightly critical of the prepare-and-certify model of education, which in practice turns into drill-and-test, with almost entirely external motivation. Teachers face a battle in winning back student enthusiasm against this machine. There's nothing wrong with being concerned about grades, but if that's all there is to it, students face a rude awakening after graduation.

More on that theme in an article at Common Dreams: "Wild Dreams: Anonymous, Arne Duncan, and High-Stakes Testing."

Finally, a link from Education Week on the subject of testing students who want to be teachers: "Analysis Raises Questions About Rigor of Teacher Tests." This is the meta-problem.

Saturday, February 04, 2012

I-ACT: An Alternative to Prepare-and-Certify

In "The End of Preparation" I gave an alternative to the factory-like "prepare and certify" philosophy evident in the current practice of formal education. The purpose of the present article is to develop a modest trial program to test the portfolio approach.

In order to have something specific to talk about, I'm going to outline a course that could fit into many curricula, and could be scaled in sophistication to meet the level of the student. It might be most at home in a general education program, or as a topics course in the sciences. Here's the description:
Rong! Mistakes in Scientific Thought This course explores important conceptual mistakes in the history of scientific understanding. Even the most brilliant thinkers made bad assumptions, over-simplified, and took the wrong path occasionally. Major breakthroughs are as often related to getting rid of errant beliefs as finding better ones. Students will learn about some of these milestones, and perhaps develop some modesty about the certainty of their own beliefs.
On the first day of class, the instructor can explain that the neologism "rong" comes from Wolfgang Pauli's reputed remark that a line of reasoning was "not even wrong." Since rong is literally not even "wrong," it fits. I will use it in a noble sense, not as disparagement. An idea is rong for some fundamental reason that when understood, advances knowledge. The belief that the sun goes around the Earth is not just wrong, it's rong. The rongness may be conceptual (as in the case of geo-centricism) or methological (as with astrology). Both are important, and somewhat humbling to read about. Our forebears weren't stupid after all. As P. L. Seidel  wrote in 1847: "[M]ethodological discoverers are very badly treated. Before their method is accepted it is treated like a cranky theory; after it is treated as a trivial commonplace." Louis Pasteur comes to mind.

A Prepare-and-Certify Approach
The normal way to teach a course is to find a textbook and other source materials, set up a syllabus with major events like reading deadlines and test dates, outline a grading scheme, and list office hours. Students would write papers, get feedback, and ultimately a course grade. Then most of this effort would be forgotten and lost to posterity. In theory, the experience would have incrementally added to the "preparation" of the student for some eventuality that happens after graduation (the event horizon of education). In practice, no one would ever know if this is true because there are no objective measures for it. (See "An Index for Test Accuracy" to see what would be involved.)

Now that I have set the straw man in place, we can proceed to whack the silage out of him. It's clear that the course description cries out for a seminar-type approach, and that these sort of courses already exist. What I'll do below is enlarge the conception of a traditional seminar course. I need a name for this mutation, so let's call it I-ACT, which stands in for Analyze-Create-Publish-Interact (because I-ACT means something, and ACPI sounds like an economic index).

The I-ACT Approach

The role of the instructor is to help students pick good projects, and guide them through the steps Analyze-Create-Publish-Interact, which are described below in turn. But first a schematic, to break up this wall of text with a busy, colorful graphic.



1. Analysis


In order to produce new knowledge (that is, new to the student), one has to start somewhere. Academic disciplines comprise all sorts of knowledge, but here we are interested in whatever raw material can be turned into something new. In History, it might be original sources and a philosophy of history. In chemistry it might be a certain kind of molecule and knowledge of basic chemistry. In common games, like chess, the starting place is understanding of the rules and pieces.

In this bundle is sometimes a list of deductive rules that check for correctness. These would include accepted spelling of words (at a basic level), rules of logic, physics formulas, or any other deterministic method of turning one thing into another, which can be done correctly or incorrectly. A haiku has a particular structure, and a blues song has a certain scale and beat. A knight in chess can only move in a certain way.

The instructor assigns a problem (or the student is tasked to find one) that has an acceptable analytical load. We shouldn't expect kindergartners to solve systems of linear differential equations. The student doesn't need to be a master of this analytical domain, but it must be within reach. Resources include anything on the Internet plus the instructor, peers, and appropriate social networks.
As an example, I will use a real assignment I used for an undergraduate research project in math. The source is Proofs and Refutations by Imre Lakatos. One chapter in the book describes how the great mathematician Augustin-Louis Cauchy was rong about something, and how it got noticed and fixed. Cauchy provided a proof  that when an infinite sum of continuous functions converges to a new function, then that new function would be continuous as well. In this case, the student's analytical load includes basic math analysis techniques (delta-epsilon proofs), and familiarity with infinite series and functions. She should be able to read Cauchy's proof (probably with some difficulty, and needing help), and understand the issue. She should be able to create examples of the ingredients for the proof, such as series of continuous functions, and be able to test them and their sum for continuity.
An outline of Cauchy's errant proof, from page 132 of Proofs and Refutations.

The analysis box is never really complete. In any discipline there's always more to be learned, and part of the  I-ACT learning process is to go back to the well to seek clarification, examples, related concepts, and so forth. Learning this self-help process is an important objective.

2. Creativity

The analytical load need not be huge. Games generally have a small set of rules to make them accessible, and in fact you don't need a lot of rules to be creative. In the creative step, we help the student work with the analytical tools in a trial-and-error exploration. This is only possible if there are wrong answers. Another way to say that is if everything that the student can possibly produce is just fine, there's nothing to be learned from the exercise. A pilot should know a good landing from a bad landing. A doctor should know a live patient from a dead one. A musician should know a major chord from a minor seventh. And so on. The student is likely to make mistakes in this, which is where the instructor, peers, and social network can help.

Even in aesthetic subjects, we don't have to accept total relativism (and hence lost learning opportunities). In a photography class, instead of trying to figure out the exact artistic merits of a photo, one can examine technique. Is the subject in focus or not? Does the rule of thirds apply or not? Are the whites white or not?
To continue the example, the student was asked to find some series of continuous functions that converge to some new function, and then see if that new function was continuous. This took some work, and really exercised what she knew about functions, limits, and continuity. This strengthened her analytical skills, and her confidence increased so that she started to feel like she knew what she was doing. At this point, she was ready to tackle the question of why Cauchy was rong. Once that is discovered, it becomes a question of how to fix it. 
From Proofs and Refutations.
There are plenty of other creative exercises here, such as conjecture and proof of properties of uniformly continuous functions. Why don't Fourier Series work? This one example of rongness can be a point of departure for many analysis topics.
The exercise of checking steps in an argument is purely analytical, but creating a solution to a problem or finding other connections is not. The creative step should produce new knowledge for the student. Internet resources (like wolframalpha.com in the case of my example) can be used to find new connections, examples, and explanations.

3. Publication

Once a student has made some progress, it's time to write it up, or otherwise prepare the material for public display in electronic form on an intranet or world wide. Here, 'public' should at least mean that the instructor can see it, but that's not an advance over traditional delivery. Student peers in the same class or program, other faculty  members at the same or other institutions, social networks, and the whole world wide web are possible audiences.

Publishing interesting questions or intermediate results can be as useful as a finished piece of work.

When I supervised the student project on uniform convergence, social networks didn't exist like they do now. Now I could encourage a student to use reddit.com/r/math or stackoverflow.com to pose questions or try out ideas. This is not certain to succeed--these communities have to be engaged, not just spammed with drive-by questions.

4. Interaction

Interaction is a natural consequence of publishing. Over the summer I came across a delightful paper from Scott Aaronson at MIT entitled "Why Philosophers Should Care About Computational Complexity." I found it through a social network I frequent that scans for interesting stuff like this. After Scott posted the draft of his paper, he received a number of comments and suggestions. This feedback resulted in new drafts that clarified his thinking and fixed problems. Here's a quote from his blog:
Thanks to everyone who offered useful feedback! I uploaded a slightly-revised version, adding a “note of humility” to the introduction, correcting the footnote about Cramer’s Conjecture, incorporating Gil Kalai’s point that an efficient program to pass the Turing Test could exist but be computationally intractable to find, adding some more references, and starting the statement of Valiant’s sample-size theorem with the word “Consider…” instead of “Fix…”
Then there's the meta-commentary from philosophers about the paper on reddit, which adds a perspective and some new references.



Interaction this rich can illuminate everything about the work, including analysis and creativity. It can critique or endorse, dismiss or expand scope.

Desired Outcomes

How is this an improvement over traditional classes or seminars?  For me, there are several answers:

  • Learning to rely on a self-help network of resources.
  • Public display of one's work can lead to intrinsic motivation that is greater than the extrinsic "I need to get a C in this course."
  • The above point is magnified by emphasizing that this work forms part of a life-work portfolio that will be useful for a very long time. In addition to certifications (eventually instead of certifications), students have authentic work to display.
  • Engagement in social networks and general audiences that care about the topic is a good long-term investment. It leverages ones own abilities and adds credibility to one's published works.
  • Student work competes on merit, not on who they are or what institution they're from, and can help them assess their own skill and knowledge.
  • By separating analytical techniques from creativity, we can prepare students for both. Creativity takes self-confidence and practice. This can be nurtured in a controlled environment.
  • All the tools and methods used can be applied outside of a college setting. It's a practical real-world skill to develop a skill set, create something with it, publish it online professionally, and generate feedback for improvement. 

Next Steps


I am looking for a handful of science departments to try some of these ideas out. The course description above is one of many that could be used. Once the details are in order, I'll seek some external funding for travel, some implementation costs (setting up a portfolio system maybe), and money to run a small conference. If you're interested, click on my profile or link to my vita and email me.

Saturday, January 28, 2012

Assessing a QEP

On Wednesday, Guilford College hosted a NCICU meeting about SACSCOC accreditation. I had volunteered to do a very short introduction to my experience with the Quality Enhancement Plan (QEP) at Coker College, since I had seen the thing from inception to impact report. I got permission from Coker to release the report publicly, so here it is:


The whole fifth year report passed with no recommendations, and the letter said nice things about the QEP, so it's reasonable to assume that it's an acceptable exemplar to use in guiding your own report.

Assessment of the QEP program is an important part of the impact report, and this is a good place to record how that worked. The QEP at Coker was about improving writing effectiveness in students, and we tried several ways of assessing success. Only one of these really worked, so I will describe them in enough detail so you don't repeat my mistakes. Unless you just feel compelled to.

Portfolio Review.
I hand-built a web-based document repository (see "The Dropbox Idea" for details) to capture student writing. After enough samples were accumulated, I spent a whole day randomly sampling students in four categories: first year/fourth year vs day/evening. There were 30 of each, for 120 students. Then I sampled three writing samples from each to create a student portfolio. There was some back and forth because some students didn't have three samples at that point. I used a box cutter to redact student names, just like I imagine the CIA does. Each portfolio got an ID number that would allow me to look up who it was. The Composition coordinator created a rubric for rating the samples, and one Saturday we brought in faculty, administrators, adjuncts, and a high school English teacher to rate the portfolios. We spent a good part of the day applying the rubric to the papers, and many of the papers were rated three times. All were rated at least twice by different raters.

The results were disappointing. There were some faint indications of trends, but mostly it was noise, and not useful for steering a writing program. In retrospect, there were two conceptual problems. First, the papers we were looking at were not standardized. It's hard to compare a business plan to a short story. Second, the rubrics were not used in the assignments, but conjured later when we wanted to assess. It's essential for rubrics to be effective that they be as integrated as possible into the construction of the assignment.

So this was a lot of work for a dud of a report, most of which is probably my fault.

Pre-post Test
One of the administrators decided we should apply a writing placement test, which we already had data for, as a measure of writing gain by giving it again as a post-test after students took the ENG 101 class. The assignment was to find and correct errors in sample sentences. The English instructors told us it wouldn't work and it didn't. More noise.

Discipline-Specific Rubrics
We did, in fact, learn something from the rubric fiasco. We allowed programs to create their own rubrics, which could be applied to assignments in the repository. So an instructor could look at a work, pull up the custom rubric, and rate it right there and then. Since the prof knew the assignment, this seemed like a way to get more meaningful results. I think this would have worked, but by the time we got all the footwork done, the QEP program was a couple of years under way. I left the college before it was possible to do a large-scale analysis of the results that were in the database. In summary: good idea, executed too late.

Direct Observation by Faculty
Back in 2001, when I got the job of being SACSCOC liason, I got a copy of the brand new Principles and started reading. The more I read, the more I was terrified. And nothing frightened me more than CS 3.5.1, the standard on general education. I didn't know at the time that the standard said one thing, but everyone interpreted in a completely different way (it was written as a minimum standard requirement, but everyone looked for continuous improvement). So I was one of those people you see at the annual meeting who look like they are on potent narcotics, drifting around with a dazed look at the enormousness of the challenge. (Note: I think they should hand out mood rings at the annual meeting so you can see how stressed someone is before you talk to them.)

In an act of desperation, I led an effort to create what we now call the Faculty Assessment of Core Skills (FACS), which is nothing more than subjective faculty ratings of liberal arts skills demonstrated by students in their classes. The skills included writing effectiveness. At the end of the semester, each instructor was supposed to give a subjective rating to each student taught for observed skills on the list. You can read all about this in the Assessing the Elephant manuscript, or on this blog, or in one of the three books I wrote chapters for on the subject.

Because we had started the FACS before the QEP, we had baseline data, plus data for every semester during the project's life. Thousands and thousands of data points about student writing abilities. When we started the FACS I didn't have much hope for it--it was a "Hail Mary" pass at CS 3.5.1. But as it turns out, it was exactly what we needed. We were able to show that FACS scores improved faster for students who had used the writing lab than those students who didn't. Moreover, this effect was sensitive to the overall ability of the student, as judged by high school grades.  See "Assessing Writing" for the details.

I have given many talks about the FACS over the years, and get interesting reactions. One pair of psychologists seemed amazed that anything so blatantly subjective could be useful for anything at all, but they were very nice about it. When I post FACS results on the ASSESS-L list serve, you can hear the crickets chirping afterwards. I guess it doesn't seem dignified because it doesn't have a reductionist pedigree.

So I was shocked at the NCICU meeting, when SACSCOC Vice President Steve Sheeley said things like (my notes, probably not his exact words) "Professors' opinions as professionals are more important than standardized tests," and "Professors know what students are good at and what they are not good at."

The reason for my reaction is that when one hears official statements about assessment, it's almost always emphasized that it has to be suitably scientific. "Proven valid and reliable" is a standard formula, and certainly "measurable" figures (see "Measurement Smesurement" for my opinion on that). However it is stated, there isn't much room for something as touchy-feely as subjective opinions of course instructors. I do give good arguments for both validity and reliability in Assessing the Elephant, but FACS is never going to look like a psychometrician's version of assessment. So it was a shock and a very pleasant surprise to hear a note of common sense in the assessment symphony. I think when Steve made that remark, he assumed that this special knowledge professors acquire after working with students was simply inaccessible as assessment data. But it's not, and by now Coker has many thousands of data points over more than a decade to prove it. And it turned out to be the key to showing the QEP actually worked.

I have implemented the FACS at JCSU, and created a cool dashboard for it. I showed this off at the meeting, and you can download a sample of it here if you want. The real one is interactive so you can disaggregate the data down to the level you want to look at, even generating individual student reports for advisors. Setting up and running the FACS is trivial. It costs no money, takes no time, and you get rich data back that can be used for all kinds of things. Everyone should do this as a first, most basic, method of assessment.

Wednesday, January 25, 2012

Closed and Open Thinking

Most readers will know William of Occam's principle about not multiplying eventualities unnecessarily. It's commonly thought of as "the simplest explanation is the best explanation." I learned about a countervailing principle in Arora and Barak's Computational Complexity: A Modern Approach. It's even older than the venerable Mr. Occam, dating back to the Epicureans, and it states that we should not abandon any explanation that is consistent with the facts. I have mentioned this before, but I had an interesting thought at lunch today: what if this tension between efficiency and open-mindedness is at the heart of the Dunning-Kruger effect? In case you've missed that bit of news, here's the introduction from the Wikipedia entry:
The Dunning–Kruger effect is a cognitive bias in which unskilled people make poor decisions and reach erroneous conclusions, but their incompetence denies them the metacognitive ability to recognize their mistakes.[1] The unskilled therefore suffer from illusory superiority, rating their ability as above average, much higher than it actually is, while the highly skilled underrate their own abilities, suffering from illusory inferiority. 
Actual competence may weaken self-confidence, as competent individuals may falsely assume that others have an equivalent understanding. As Kruger and Dunning conclude, "the miscalibration of the incompetent stems from an error about the self, whereas the miscalibration of the highly competent stems from an error about others" (p. 1127).[2] The effect is about paradoxical defects in cognitive ability, both in oneself and as one compares oneself to others.
This just puts some research behind what Bertrand Russell is quoted as having said:
The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt.
So what we have is two epistemologies, and we shouldn't be hasty to choose one as better than the other, despite the obvious bias of the quotes above.

Method 1 (Closed). Obtain a small amount of evidence, and create the most restrictive explanation that fits the facts. Subsequent facts that come to surface do not affect the conclusion.

William of Occam would probably sue me for defamation if he were around to read this. I have intentionally restated his principle in a very narrow sense in order to contrast it with:

Method 2 (Open). Continually gather information and create increasingly complex explanations that account for all the observations. Although the current explanation may be the simplest one that fits the facts, no explanation is ever final--all the others that are consistent with facts are kept in reserve.

I have given the methods intuitive names for convenience (closed vs open), not as prejudgments. The closed method will be the better one in situations where observations can be explained simply. This may be because the underlying cause and effect relationship is of low complexity, or perhaps that the variance in observed characteristics is small.  "All dogs have four legs" would be an example of the latter. "Stuff falls when you drop it" applies to the former.

The most basic structure of language is a verb applied to a noun, which is a model for the closed epistemology. "Birds fly," "Fire burns," and so on, are summaries of real world observations that can be arrived at accurately from just a few examples and without much error. It's an easy conjecture that these simple relationships became so integral to understanding that exceptions were met with challenge. Such as: "If an ostrich doesn't fly, then it can't be a bird." This is what school children encounter when they learn that a whale isn't a fish. The language we use rather gracelessly allows these exceptions in the form of conjunctive appendices, but this is clearly a hack. I will suggest below that a formal language is required to overcome that difficulty (for example, expressions of formal logic, which defines a consistent way of using "or" and "and," and allows unlimited nesting of exceptions, so that any true/false relationship can be expressed unambiguously).

Quickly assembling a set of closed rules for a new environment seems like a good idea. It's a fast best-guess approach to finding useful cause and effect relationships.

Of course, the closed method is not suitable to doing science. Khun's The Structure of Scientific Revolutions suggests that closed outlooks solidify at any level of complexity, and require some bashing to break up. An example would be the certainty (due to Aristotle) that celestial bodies move in perfect circles. This is like Gould's idea of "punctuated equilibrium" in biological evolution. I graphed the associated relationship between predictability and complexity recently in "Randomness and Prediction."

The question is when to use the open versus closed approach.  Historically, I think the closed approach may have had a blanket "explanation" in the form of mystical associations of cause and effect, which provides a putative low-complexity relationship. "Joe got struck by lightning because he displeased the weather god" has the appearance of an explanation, except that it's not actually predictive. It takes a dedicated effort to discover that fact, however. For that we need an open method.

The disadvantages of the open method make a long list. First, it's energy intensive--you have to continually be making observations, comparing what you see to what you think you should see (e.g. three-legged cat), and updating the every-growing explanation.  It also takes more energy to use or communicate the current explanation, and as soon as you do, it's out of date again.

These are not fatal flaws, but ones to be considered. For some phenomena, this is probably how we naturally reason, if in a limited way. For example, our memory and minds do something like Bayesian reasoning (updating the probability of an event based on how frequently we encounter it), although our on-board system has been shown to be deeply flawed (see Daniel Kahneman's recent book, for this and a lot more).

Perhaps the open process needs a kind of empirical 'clean-up' to be really useful. Elegant explanations generally only work with clean data. That is, if you want to discover Newtonian mechanics, it's unlikely that you can do this with just your eyes and ears. When Galileo began measuring the "drop" times on an inclined plane, he was onto something.

In addition to a solid empirical methodology, an open method also needs a way to reduce the size of an explanation while retaining its predictive power. In my graphs in "Randomness and Prediction," I plotted predictability versus complexity, not size. It works like this.

Suppose I have an observed relationship that I have cataloged like this: (1,2), (2,4), (3,8), (4,16), where this might be thought of as a cause and effect. A one 'causes' a two, and so on. Because my empirical methods are sound, I trust that there's not too much error in the observed values. As the list grows by using the open method, I have a better and better 'explanation' of past events and a better and better predictor of future ones (fine print about the inductive hypothesis goes here...). But the list will become too unwieldy to remember, communicate, or use effectively, as the observations accumulate. What I need is a kind of data compression to reduce the list to a manageable size. If I do this correctly, the explanation doesn't change, nor does the complexity, but the size does. I can reduce it to effect = 2^cause if I have the idea of an exponential function. We might call this data reduction the creation of a formal theory.

Conclusions
I started by wondering if people who don't know things, and further don't know that they don't know them, could be attributed to one of the two epistemologies mentioned at the beginning. I think the argument above shows that it's possible that the two barriers of empiricism and abstract thinking needed to effectively use an open method are too formidable for a lot of people. For one thing, it's not hard to get by using closed systems, and it may require formal education in scientific method and meta-cognition to effectively use open systems.

One final note appropriate to the calendar in the US: it's a lot easier to communicate closed explanations than open ones. Even with data compression, "things fall" is less complex than Newton's laws. So in a debate made with sound bites from political candidates, the closed epistemology wins. It's easier, it's comfortable to the listener--the whole construct of English is build to 'hack' a closed way of thinking by adding a few contingencies ("Cats have four legs, but I once saw one with three.")--and the explanations take up less time to say. You have to expand "Drill!" into "Drill, baby drill!" to make it bigger because the basic message can be summed up in one word, and that may seem too short for some audiences as a serious thought.

This is just another reason why we should be deliberate about teaching science and meta-cognition in school, not as alien ways of thinking that only people in white coats use at work, but as the mode of thinking that differentiates us from the other mammals, and might allow us someday to collectively make good decisions.

Sunday, January 22, 2012

Assorted Links

You can file www.sightmap.com under "novel data representation." It's a heat map overlay of Google Maps that shows the most popular spots for taking photos, using the upload site Panoramio as the source.

This could be good fodder for a student research project. The only disappointment for me was not being able to zoom all the way down to street level resolution.

There's a new journal for those interested in the intersection of empiricism and computer science, in the spirit of Wolfram's A New Kind of Science. EPJ.org's new "Data Science" title seeks to address these challenges:

  • how to extract meaningful data from systems with ever increasing complexity
  •  how to analyse them in a way that allows new insights
  •  how to generate data that is needed but not yet available
  •  how to find new empirical laws, or more fundamental theories, concerning how any natural or artificial (complex) systems work

  • Now I have one less excuse for not organizing my research notes into actual articles. While we're at it, here's a  list of "Best Paper" awards in computer science.

    Game theory is a fascinating and powerful set of ideas.  Ever notice at the baggage carousel in the airport how everyone crowds up as close as they can, which means no one can see anything? If everyone took three steps back, the whole group would benefit. Paradoxes like these are the subject matter for this subject from mathematics and economics. There's a site that maps out the field in an easily accessible format. It's even easy to remember: GameTheory101.com.

    While browsing for study tips for my daughter Epsilon, I found Study Hacks, with this bit of non-cognitive wisdom, originally quoted from a Reddit discussion thread:

    The people who fail to graduate from MIT, fail because they come in, encounter problems that are harder than anything they’ve had to do before, and not knowing how to look for help or how to go about wrestling those problems, burn out. 
    The students who are successful, by contrast, look at that challenge, wrestle with feelings of inadequacy and stupidity, and then begin to take steps hiking that mountain, knowing that bruised pride is a small price to pay for getting to see the view from the top. They ask for help, they acknowledge their inadequacies. They don’t blame their lack of intelligence, they blame their lack of motivation.

    Check out this guy's portfolio as a case study.

    From University of Portland comes a fascinating case study "Why the Vasa Sunk: 10 Lessons Learned." From the introduction:

    Around 4:00 PM on August 10th, 1628 the warship Vasa set sail in Stockholm harbor on its maiden voyage as the newest ship in the Royal Swedish Navy.  After sailing about 1300 meters, a light gust of wind caused the Vasa to heel over on its side. Water poured in through the gun portals and the ship sank with a loss of 53 lives. 

    The rest is a case study in how not to manage a complex project. As Ashleigh Brilliant wrote, "It could be that  purpose of your life is only to serve as a warning to others."

    Finally, a more positive spin on leadership from The Atlantic: "Humble Leaders are More Liked and More Effective." Take it with a grain of salt (it's a small study), but be proud of your humility.

    Thursday, January 19, 2012

    An Index for Test Accuracy

    This post is an overdue follow-up to "Randomness and Prediction," which takes up the question of how we should judge the quality of a test. There are many kinds of tests, but for the moment I'm only interested in ones that are supposed to predict future performance. Since education is in the preparation business, the measure of success should be "did we prepare the student?" If that question can be answered satisfactorily with a yes or no, this feedback can be used to determine the accuracy of tests that are supposed to predict this outcome.

    As an example, I used the College Board's SAT benchmarks (pdf) , in which a test taken during high school years is used to predict first year college grades. The benchmark study is interesting because it is one of the few examples of test-makers who actually check the accuracy of their instruments and report that information publicly. You can find my first thoughts on this in "SAT Error Rates." The source material mainly consists of Table 1a on page 3 of the College Board report:



    We can use this to see the power of the SAT to predict first year college grades at any cut-off score on the table. If we picked 1200, for example, we can see that 73% of the students we admit will have a first year grade overage at 2.7 or above. In other words, a 73% true positive rate and a 27% false positive rate. Because we are helpfully given the number of samples in each bin (the N column), we can also calculate the false positive and true negative rates for the test. Just multiply N by the percentage of students with FGPA > 2.7 to find the number of students in that bin who were successful in their first year (by that definition), and subtract that from N to get the number who were not. The graph below shows this visually.

    The two graphs look roughly like normal distributions with means about 150 SAT points apart. This is all quite interesting, but for my purposes here I just want to pull one number from this: the total percentage of students with FGPA > 2.7, which we can get by summing up all the heights on the blue line and dividing by the total of all samples. This turns out to be 59%. 

    The College Board's benchmark has 65% accuracy. In other words:
    • If a student's SAT score exceeds the benchmark, there is a 65% chance they will have FGPA > 2.7
    • Of all students, 59% will have FGPA > 2.7
    The difference between these numbers is not large: .65 - .59 = 6%. Using the benchmark to select "winners", we can do six percent better than just randomly sampling. If all we care about is the percentage of "good" students we get, that's the end of the story. But there's another dimension: the rate of unfair rejections, or false negatives. 

    If we randomly sample whom we accept, then 59% of those we reject would have had FGPA > 2.7 (assuming this is the rate for the whole population). Since it's unfair to reject qualified candidates, we might call 1-.59 = 41% the fairness of the method of selection. Another name for fairness is the true negative rate. I plotted it against the accuracy (true positive rate) in the previous article. Here it is again. 



    The blue line is accuracy, and the red line is fairness. They meet at 65%. So we can see that although using the SAT benchmark is only six percent more accurate than random sampling, it is .65 - .49 = 16% more fair. How do we make sense of how good this is?

    One overall measure of test predictive power is the average rate of correct predictions, taking into account both true positives and true negatives. We might call that the "correctness rate" of the cut-off benchmark. Where the lines cross above in the above graph, both the rates for true positives and true negatives is 65%, so the correctness rate is also 65%. In general, the formula for the correctness rate c at a give cut-off benchmark  is:
    c = (number of actual positives that meet the benchmark + number of actual negatives that do not meet the benchmark) / (total number of all observations)
    Below is a graph that adds the correctness rate to the accuracy and fairness plots.


    The correctness rate potentially solves the problem of considering accuracy and fairness separately. It does not, however, give us an absolute measure to compare the quality of tests with. This is because the fraction of actual positives in the population can vary, making detection easier or more difficult. If we are interested in comparing different tests over different kinds of detection environments, we need something different. In the next section we will derive an index to try to address this problem.

    A Comparative Index

    In general, there is not a good way to turn results about predictability into results about complexity. However, using ideas from computational complexity, I stumbled upon a transformation that gives us another way to think about the predictive power of a test.

    In order to proceed, imagine an even better version of the test. In this fantasy, a proportion p of the test benchmark results come back marked with an asterisk. Imagine that this notation means that the result is known to be true. The unmarked ones have no guarantee--some will be correct and some not. In this way we imagine separating out the good and useful work of the test in to the p group, whereas the rest is just random guessing.

    It's just like a multiple choice test. Some answers you know you know, and others you guess at. By working backwards we can find that "known true" fraction:
    Correctness rate = (fraction known correct) + (fraction not known correct)*(rate of correct responses with random sampling)
    Using the numbers from the SAT benchmark in the previous section gives us:
    .65 = p + (1- p) * .59
    p = (.65 - .59)/(1-.59)
    The fraction that would have to be "known true" is p = 14.6%. The advantage of this transformation is that we have a single number that is easy to visualize, and takes the context into account. If you wanted to explain it to someone, it would go like this:
    The SAT benchmark prediction is like having a perfect understanding of 14.6% of test-takers and guessing at the rest.
    The graphs below show the linear relationship between average test accuracy, the larger of the percent of positives or negatives in the population (the "guess rate"), and the index p--the equivalent proportion of "perfect understanding" outcomes.

    The "guess rate" is just the bigger of the fraction of negatives or positives in the population. If there are more positives, then without more information, you would guess than any randomly chosen outcome would be positive. If there are more negatives, the best guess (without any other information) is that the outcome would be negative. In formulas, we will call this guess rate "r." For the SAT example, the real positive rate is 59%, so r = .59. If the real positive rate had been 45%, we'd use r = 1 - .45 = 55%.

    As an example to illustrate the graph above, if the number of actual positives and negatives are evenly split at r = 50%, then a test that can predict with 80% correctness has the equivalent "perfect understanding" index of 60%. But if the proportion of positives is r = 70% instead of 50%, the index drops to 33%. It's reasonable to say that even though the correctness rate is the same, the first test is almost twice as good as the second one.

    Note that if the guess rate equals the test accuracy, the test explains exactly nothing, which is as it should be.

    Here's a general formula for computing the index p, which is the proportion of "perfect understanding" test results. The other two variables are c = the test's average correct classification rate, and r = the larger of the proportions of negatives or positive actual outcomes. In the SAT example, 59% were successful according to the FGPA criterion, so r = 59. If it had been 45% successful, then we'd use r = 1-.45 = 55%.  Given these inputs, we have a simple formula for the index p:
    p = (c  - r)/(1 - r)
    On the last graph, p is the height of the line, c is the bottom axis, and four values of r (guess rate) are given, one for each curve as noted on the legend.

    (Note: edited 1/20/2012 for clarity)

    Friday, December 16, 2011

    Free Hypothesis-Generating Software

    In the last year there have been announcements of two free software packages that use machine learning techniques to mine data for relationships. The resulting mathematical formulas can be used to form hypotheses about the underlying phenomena (i.e. whatever the data represents).

    The first one I have mentioned before. It's Eureqa from Cornell, which uses symbolic regression. There is an example on the Eureqa site that poses this sample problem:
    This page describes an illustrative run of genetic programming in which the goal is to automatically create a computer program whose output is equal to the values of the quadratic polynomial x2+x+1 in the range from –1 to +1. That is, the goal is to automatically create a computer program that matches certain numerical data. This process is sometimes called system identification or symbolic regression.
    The program proceeds as an evolutionary search. The graph pictured below is a schematic of the way the topology of the evolved "critters" is formed.
    A family tree of mathematical functions. (Image Source: geneticprogramming.com)
    There is a limitation to genetic programming that is also a threat to any intelligent endeavor: the problem may not be amenable to evolutionary strategies. There are some problems where the only way to solve them is exhaustive search. Only if the solution space is "smooth" in the sense that good solutions are "near" almost-good solutions is the genetic approach going to find solutions faster than exhaustive search. On a philosophical note, modern successes with physical sciences suggest that the universe is kind to us in this regard. The "unreasonable effectiveness" of mathematics (the title of an article by Eugene Wigner) in producing formulas that model real world physics is a hopeful sign that we may be able to decode the external environment so that we can predict it before it kills us. (The internal organization of complex systems is another matter, and there's not much success to look to there.). Note, however, that even here formulas have not really been evolutionary, but revolutionary. The formulation of Newton's laws of motion are derivable from Einstein's relativity, but not vice versa. The "minor tweak" approach doesn't work very often, Einstein's Cosmological Constant notwithstanding.

    The second data miner is aptly called MINE, and comes from the Broad Institute of Harvard and MIT. You can read about it on their site broadinstitute.org. The actual program is hosted at exploredata.net, where you can download a java implementation with an R interface. Here's a description from the site:
    One way of beginning to explore a many-dimensional dataset is to calculate some measure of dependence for each pair of variables, rank the pairs by their scores, and examine the top-scoring pairs. For this strategy to work, the statistic used to measure dependence should have the following two heuristic properties. 
    Generality: with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships. 
    Equitability: the statistic should give similar scores to equally noisy relationships of different types. For instance, a linear relationship with an R2 of 0.80 should receive approximately the same score as a sinusoidal relationship with an R2 of 0.80.
    It's interesting that this is a generalized approach to my correlation mapper software, the difference being that I have only considered linear relationships. For survey data, it's probably not useful to look beyond linear relationships, but I look forward to trying the package out to see what pops up. It looks easy to install and run, and I can plug it into my Perl script to automatically produce output that complements my existing methods. A project for Christmas break, which is coming up fast.

    Update: I came across an article at RealClimate.org that illustrates the danger of models without explanations. Providing a correlation between items, or a more sophisticated pattern based on Fourier analysis or the like, isn't a substitute for a credible explanatory mechanism. Take a look at the article and comments for more.

    By coincidence, I am reading Emanaul Derman's book Models.Behaving.Badly: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life. It has technical parts, which I find quite interesting, and more philosophical parts that leave me scratching my head. The last chapter, which I haven't read yet, advertises "How to cope with the inadequacies of models, via ethics and pragmatism." Stay tuned...

    Update 2: You can read technical information about MINE in this article and supplementary material.