Saturday, January 28, 2012

Assessing a QEP

On Wednesday, Guilford College hosted a NCICU meeting about SACSCOC accreditation. I had volunteered to do a very short introduction to my experience with the Quality Enhancement Plan (QEP) at Coker College, since I had seen the thing from inception to impact report. I got permission from Coker to release the report publicly, so here it is:


The whole fifth year report passed with no recommendations, and the letter said nice things about the QEP, so it's reasonable to assume that it's an acceptable exemplar to use in guiding your own report.

Assessment of the QEP program is an important part of the impact report, and this is a good place to record how that worked. The QEP at Coker was about improving writing effectiveness in students, and we tried several ways of assessing success. Only one of these really worked, so I will describe them in enough detail so you don't repeat my mistakes. Unless you just feel compelled to.

Portfolio Review.
I hand-built a web-based document repository (see "The Dropbox Idea" for details) to capture student writing. After enough samples were accumulated, I spent a whole day randomly sampling students in four categories: first year/fourth year vs day/evening. There were 30 of each, for 120 students. Then I sampled three writing samples from each to create a student portfolio. There was some back and forth because some students didn't have three samples at that point. I used a box cutter to redact student names, just like I imagine the CIA does. Each portfolio got an ID number that would allow me to look up who it was. The Composition coordinator created a rubric for rating the samples, and one Saturday we brought in faculty, administrators, adjuncts, and a high school English teacher to rate the portfolios. We spent a good part of the day applying the rubric to the papers, and many of the papers were rated three times. All were rated at least twice by different raters.

The results were disappointing. There were some faint indications of trends, but mostly it was noise, and not useful for steering a writing program. In retrospect, there were two conceptual problems. First, the papers we were looking at were not standardized. It's hard to compare a business plan to a short story. Second, the rubrics were not used in the assignments, but conjured later when we wanted to assess. It's essential for rubrics to be effective that they be as integrated as possible into the construction of the assignment.

So this was a lot of work for a dud of a report, most of which is probably my fault.

Pre-post Test
One of the administrators decided we should apply a writing placement test, which we already had data for, as a measure of writing gain by giving it again as a post-test after students took the ENG 101 class. The assignment was to find and correct errors in sample sentences. The English instructors told us it wouldn't work and it didn't. More noise.

Discipline-Specific Rubrics
We did, in fact, learn something from the rubric fiasco. We allowed programs to create their own rubrics, which could be applied to assignments in the repository. So an instructor could look at a work, pull up the custom rubric, and rate it right there and then. Since the prof knew the assignment, this seemed like a way to get more meaningful results. I think this would have worked, but by the time we got all the footwork done, the QEP program was a couple of years under way. I left the college before it was possible to do a large-scale analysis of the results that were in the database. In summary: good idea, executed too late.

Direct Observation by Faculty
Back in 2001, when I got the job of being SACSCOC liason, I got a copy of the brand new Principles and started reading. The more I read, the more I was terrified. And nothing frightened me more than CS 3.5.1, the standard on general education. I didn't know at the time that the standard said one thing, but everyone interpreted in a completely different way (it was written as a minimum standard requirement, but everyone looked for continuous improvement). So I was one of those people you see at the annual meeting who look like they are on potent narcotics, drifting around with a dazed look at the enormousness of the challenge. (Note: I think they should hand out mood rings at the annual meeting so you can see how stressed someone is before you talk to them.)

In an act of desperation, I led an effort to create what we now call the Faculty Assessment of Core Skills (FACS), which is nothing more than subjective faculty ratings of liberal arts skills demonstrated by students in their classes. The skills included writing effectiveness. At the end of the semester, each instructor was supposed to give a subjective rating to each student taught for observed skills on the list. You can read all about this in the Assessing the Elephant manuscript, or on this blog, or in one of the three books I wrote chapters for on the subject.

Because we had started the FACS before the QEP, we had baseline data, plus data for every semester during the project's life. Thousands and thousands of data points about student writing abilities. When we started the FACS I didn't have much hope for it--it was a "Hail Mary" pass at CS 3.5.1. But as it turns out, it was exactly what we needed. We were able to show that FACS scores improved faster for students who had used the writing lab than those students who didn't. Moreover, this effect was sensitive to the overall ability of the student, as judged by high school grades.  See "Assessing Writing" for the details.

I have given many talks about the FACS over the years, and get interesting reactions. One pair of psychologists seemed amazed that anything so blatantly subjective could be useful for anything at all, but they were very nice about it. When I post FACS results on the ASSESS-L list serve, you can hear the crickets chirping afterwards. I guess it doesn't seem dignified because it doesn't have a reductionist pedigree.

So I was shocked at the NCICU meeting, when SACSCOC Vice President Steve Sheeley said things like (my notes, probably not his exact words) "Professors' opinions as professionals are more important than standardized tests," and "Professors know what students are good at and what they are not good at."

The reason for my reaction is that when one hears official statements about assessment, it's almost always emphasized that it has to be suitably scientific. "Proven valid and reliable" is a standard formula, and certainly "measurable" figures (see "Measurement Smesurement" for my opinion on that). However it is stated, there isn't much room for something as touchy-feely as subjective opinions of course instructors. I do give good arguments for both validity and reliability in Assessing the Elephant, but FACS is never going to look like a psychometrician's version of assessment. So it was a shock and a very pleasant surprise to hear a note of common sense in the assessment symphony. I think when Steve made that remark, he assumed that this special knowledge professors acquire after working with students was simply inaccessible as assessment data. But it's not, and by now Coker has many thousands of data points over more than a decade to prove it. And it turned out to be the key to showing the QEP actually worked.

I have implemented the FACS at JCSU, and created a cool dashboard for it. I showed this off at the meeting, and you can download a sample of it here if you want. The real one is interactive so you can disaggregate the data down to the level you want to look at, even generating individual student reports for advisors. Setting up and running the FACS is trivial. It costs no money, takes no time, and you get rich data back that can be used for all kinds of things. Everyone should do this as a first, most basic, method of assessment.

Wednesday, January 25, 2012

Closed and Open Thinking

Most readers will know William of Occam's principle about not multiplying eventualities unnecessarily. It's commonly thought of as "the simplest explanation is the best explanation." I learned about a countervailing principle in Arora and Barak's Computational Complexity: A Modern Approach. It's even older than the venerable Mr. Occam, dating back to the Epicureans, and it states that we should not abandon any explanation that is consistent with the facts. I have mentioned this before, but I had an interesting thought at lunch today: what if this tension between efficiency and open-mindedness is at the heart of the Dunning-Kruger effect? In case you've missed that bit of news, here's the introduction from the Wikipedia entry:
The Dunning–Kruger effect is a cognitive bias in which unskilled people make poor decisions and reach erroneous conclusions, but their incompetence denies them the metacognitive ability to recognize their mistakes.[1] The unskilled therefore suffer from illusory superiority, rating their ability as above average, much higher than it actually is, while the highly skilled underrate their own abilities, suffering from illusory inferiority. 
Actual competence may weaken self-confidence, as competent individuals may falsely assume that others have an equivalent understanding. As Kruger and Dunning conclude, "the miscalibration of the incompetent stems from an error about the self, whereas the miscalibration of the highly competent stems from an error about others" (p. 1127).[2] The effect is about paradoxical defects in cognitive ability, both in oneself and as one compares oneself to others.
This just puts some research behind what Bertrand Russell is quoted as having said:
The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt.
So what we have is two epistemologies, and we shouldn't be hasty to choose one as better than the other, despite the obvious bias of the quotes above.

Method 1 (Closed). Obtain a small amount of evidence, and create the most restrictive explanation that fits the facts. Subsequent facts that come to surface do not affect the conclusion.

William of Occam would probably sue me for defamation if he were around to read this. I have intentionally restated his principle in a very narrow sense in order to contrast it with:

Method 2 (Open). Continually gather information and create increasingly complex explanations that account for all the observations. Although the current explanation may be the simplest one that fits the facts, no explanation is ever final--all the others that are consistent with facts are kept in reserve.

I have given the methods intuitive names for convenience (closed vs open), not as prejudgments. The closed method will be the better one in situations where observations can be explained simply. This may be because the underlying cause and effect relationship is of low complexity, or perhaps that the variance in observed characteristics is small.  "All dogs have four legs" would be an example of the latter. "Stuff falls when you drop it" applies to the former.

The most basic structure of language is a verb applied to a noun, which is a model for the closed epistemology. "Birds fly," "Fire burns," and so on, are summaries of real world observations that can be arrived at accurately from just a few examples and without much error. It's an easy conjecture that these simple relationships became so integral to understanding that exceptions were met with challenge. Such as: "If an ostrich doesn't fly, then it can't be a bird." This is what school children encounter when they learn that a whale isn't a fish. The language we use rather gracelessly allows these exceptions in the form of conjunctive appendices, but this is clearly a hack. I will suggest below that a formal language is required to overcome that difficulty (for example, expressions of formal logic, which defines a consistent way of using "or" and "and," and allows unlimited nesting of exceptions, so that any true/false relationship can be expressed unambiguously).

Quickly assembling a set of closed rules for a new environment seems like a good idea. It's a fast best-guess approach to finding useful cause and effect relationships.

Of course, the closed method is not suitable to doing science. Khun's The Structure of Scientific Revolutions suggests that closed outlooks solidify at any level of complexity, and require some bashing to break up. An example would be the certainty (due to Aristotle) that celestial bodies move in perfect circles. This is like Gould's idea of "punctuated equilibrium" in biological evolution. I graphed the associated relationship between predictability and complexity recently in "Randomness and Prediction."

The question is when to use the open versus closed approach.  Historically, I think the closed approach may have had a blanket "explanation" in the form of mystical associations of cause and effect, which provides a putative low-complexity relationship. "Joe got struck by lightning because he displeased the weather god" has the appearance of an explanation, except that it's not actually predictive. It takes a dedicated effort to discover that fact, however. For that we need an open method.

The disadvantages of the open method make a long list. First, it's energy intensive--you have to continually be making observations, comparing what you see to what you think you should see (e.g. three-legged cat), and updating the every-growing explanation.  It also takes more energy to use or communicate the current explanation, and as soon as you do, it's out of date again.

These are not fatal flaws, but ones to be considered. For some phenomena, this is probably how we naturally reason, if in a limited way. For example, our memory and minds do something like Bayesian reasoning (updating the probability of an event based on how frequently we encounter it), although our on-board system has been shown to be deeply flawed (see Daniel Kahneman's recent book, for this and a lot more).

Perhaps the open process needs a kind of empirical 'clean-up' to be really useful. Elegant explanations generally only work with clean data. That is, if you want to discover Newtonian mechanics, it's unlikely that you can do this with just your eyes and ears. When Galileo began measuring the "drop" times on an inclined plane, he was onto something.

In addition to a solid empirical methodology, an open method also needs a way to reduce the size of an explanation while retaining its predictive power. In my graphs in "Randomness and Prediction," I plotted predictability versus complexity, not size. It works like this.

Suppose I have an observed relationship that I have cataloged like this: (1,2), (2,4), (3,8), (4,16), where this might be thought of as a cause and effect. A one 'causes' a two, and so on. Because my empirical methods are sound, I trust that there's not too much error in the observed values. As the list grows by using the open method, I have a better and better 'explanation' of past events and a better and better predictor of future ones (fine print about the inductive hypothesis goes here...). But the list will become too unwieldy to remember, communicate, or use effectively, as the observations accumulate. What I need is a kind of data compression to reduce the list to a manageable size. If I do this correctly, the explanation doesn't change, nor does the complexity, but the size does. I can reduce it to effect = 2^cause if I have the idea of an exponential function. We might call this data reduction the creation of a formal theory.

Conclusions
I started by wondering if people who don't know things, and further don't know that they don't know them, could be attributed to one of the two epistemologies mentioned at the beginning. I think the argument above shows that it's possible that the two barriers of empiricism and abstract thinking needed to effectively use an open method are too formidable for a lot of people. For one thing, it's not hard to get by using closed systems, and it may require formal education in scientific method and meta-cognition to effectively use open systems.

One final note appropriate to the calendar in the US: it's a lot easier to communicate closed explanations than open ones. Even with data compression, "things fall" is less complex than Newton's laws. So in a debate made with sound bites from political candidates, the closed epistemology wins. It's easier, it's comfortable to the listener--the whole construct of English is build to 'hack' a closed way of thinking by adding a few contingencies ("Cats have four legs, but I once saw one with three.")--and the explanations take up less time to say. You have to expand "Drill!" into "Drill, baby drill!" to make it bigger because the basic message can be summed up in one word, and that may seem too short for some audiences as a serious thought.

This is just another reason why we should be deliberate about teaching science and meta-cognition in school, not as alien ways of thinking that only people in white coats use at work, but as the mode of thinking that differentiates us from the other mammals, and might allow us someday to collectively make good decisions.

Sunday, January 22, 2012

Assorted Links

You can file www.sightmap.com under "novel data representation." It's a heat map overlay of Google Maps that shows the most popular spots for taking photos, using the upload site Panoramio as the source.

This could be good fodder for a student research project. The only disappointment for me was not being able to zoom all the way down to street level resolution.

There's a new journal for those interested in the intersection of empiricism and computer science, in the spirit of Wolfram's A New Kind of Science. EPJ.org's new "Data Science" title seeks to address these challenges:

  • how to extract meaningful data from systems with ever increasing complexity
  •  how to analyse them in a way that allows new insights
  •  how to generate data that is needed but not yet available
  •  how to find new empirical laws, or more fundamental theories, concerning how any natural or artificial (complex) systems work

  • Now I have one less excuse for not organizing my research notes into actual articles. While we're at it, here's a  list of "Best Paper" awards in computer science.

    Game theory is a fascinating and powerful set of ideas.  Ever notice at the baggage carousel in the airport how everyone crowds up as close as they can, which means no one can see anything? If everyone took three steps back, the whole group would benefit. Paradoxes like these are the subject matter for this subject from mathematics and economics. There's a site that maps out the field in an easily accessible format. It's even easy to remember: GameTheory101.com.

    While browsing for study tips for my daughter Epsilon, I found Study Hacks, with this bit of non-cognitive wisdom, originally quoted from a Reddit discussion thread:

    The people who fail to graduate from MIT, fail because they come in, encounter problems that are harder than anything they’ve had to do before, and not knowing how to look for help or how to go about wrestling those problems, burn out. 
    The students who are successful, by contrast, look at that challenge, wrestle with feelings of inadequacy and stupidity, and then begin to take steps hiking that mountain, knowing that bruised pride is a small price to pay for getting to see the view from the top. They ask for help, they acknowledge their inadequacies. They don’t blame their lack of intelligence, they blame their lack of motivation.

    Check out this guy's portfolio as a case study.

    From University of Portland comes a fascinating case study "Why the Vasa Sunk: 10 Lessons Learned." From the introduction:

    Around 4:00 PM on August 10th, 1628 the warship Vasa set sail in Stockholm harbor on its maiden voyage as the newest ship in the Royal Swedish Navy.  After sailing about 1300 meters, a light gust of wind caused the Vasa to heel over on its side. Water poured in through the gun portals and the ship sank with a loss of 53 lives. 

    The rest is a case study in how not to manage a complex project. As Ashleigh Brilliant wrote, "It could be that  purpose of your life is only to serve as a warning to others."

    Finally, a more positive spin on leadership from The Atlantic: "Humble Leaders are More Liked and More Effective." Take it with a grain of salt (it's a small study), but be proud of your humility.

    Thursday, January 19, 2012

    An Index for Test Accuracy

    This post is an overdue follow-up to "Randomness and Prediction," which takes up the question of how we should judge the quality of a test. There are many kinds of tests, but for the moment I'm only interested in ones that are supposed to predict future performance. Since education is in the preparation business, the measure of success should be "did we prepare the student?" If that question can be answered satisfactorily with a yes or no, this feedback can be used to determine the accuracy of tests that are supposed to predict this outcome.

    As an example, I used the College Board's SAT benchmarks (pdf) , in which a test taken during high school years is used to predict first year college grades. The benchmark study is interesting because it is one of the few examples of test-makers who actually check the accuracy of their instruments and report that information publicly. You can find my first thoughts on this in "SAT Error Rates." The source material mainly consists of Table 1a on page 3 of the College Board report:



    We can use this to see the power of the SAT to predict first year college grades at any cut-off score on the table. If we picked 1200, for example, we can see that 73% of the students we admit will have a first year grade overage at 2.7 or above. In other words, a 73% true positive rate and a 27% false positive rate. Because we are helpfully given the number of samples in each bin (the N column), we can also calculate the false positive and true negative rates for the test. Just multiply N by the percentage of students with FGPA > 2.7 to find the number of students in that bin who were successful in their first year (by that definition), and subtract that from N to get the number who were not. The graph below shows this visually.

    The two graphs look roughly like normal distributions with means about 150 SAT points apart. This is all quite interesting, but for my purposes here I just want to pull one number from this: the total percentage of students with FGPA >2.7, which we can get by summing up all the heights on the blue line and dividing by the total of all samples. This turns out to be 59%. 

    The College Board's benchmark has 65% accuracy. In other words:
    • If a student's SAT score exceeds the benchmark, there is a 65% chance they will have FGPA > 2.7
    • Of all students, 59% will have FGPA > 2.7
    The difference between these numbers is not large: .65 - .59 = 6%. Using the benchmark to select "winners", we can do six percent better than just randomly sampling. If all we care about is the percentage of "good" students we get, that's the end of the story. But there's another dimension: the rate of unfair rejections, or false negatives. 

    If we randomly sample whom we accept, then 59% of those we reject would have had FGPA > 2.7 (assuming this is the rate for the whole population). Since it's unfair to reject qualified candidates, we might call 1-.59 = 41% the fairness of the method of selection. Another name for fairness is the true negative rate. I plotted it against the accuracy (true positive rate) in the previous article. Here it is again. 



    The blue line is accuracy, and the red line is fairness. They meet at 65%. So we can see that although using the SAT benchmark is only six percent more accurate than random sampling, it is .65 - .49 = 16% more fair. How do we make sense of how good this is?

    One overall measure of test predictive power is the average rate of correct predictions, taking into account both true positives and true negatives. We might call that the "correctness rate" of the cut-off benchmark. Where the lines cross above in the above graph, both the rates for true positives and true negatives is 65%, so the correctness rate is also 65%. In general, the formula for the correctness rate c at a give cut-off benchmark  is:
    c = (number of actual positives that meet the benchmark + number of actual negatives that do not meet the benchmark) / (total number of all observations)
    Below is a graph that adds the correctness rate to the accuracy and fairness plots.


    The correctness rate potentially solves the problem of considering accuracy and fairness separately. It does not, however, give us an absolute measure to compare the quality of tests with. This is because the fraction of actual positives in the population can vary, making detection easier or more difficult. If we are interested in comparing different tests over different kinds of detection environments, we need something different. In the next section we will derive an index to try to address this problem.

    A Comparative Index

    In general, there is not a good way to turn results about predictability into results about complexity (see "Randomness and Prediction"). However, using ideas from computational complexity, I stumbled upon a transformation that gives us another way to think about the predictive power of a test.

    In order to proceed, imagine an even better version of the test. In this fantasy, a proportion p of the test benchmark results come back marked with an asterisk. Imagine that this notation means that the result is known to be true. The unmarked ones have no guarantee--some will be correct and some not. In this way we imagine separating out the good and useful work of the test in to the p group, whereas the rest is just random guessing.

    It's just like a multiple choice test. Some answers you know you know, and others you guess at. By working backwards we can find that "known true" fraction:
    Correctness rate = (fraction known correct) + (fraction not known correct)*(rate of correct responses with random sampling)
    Using the numbers from the SAT benchmark in the previous section gives us:
    .65 = p + (1- p) * .59
    p = (.65 - .59)/(1-.59)
    The fraction that would have to be "known true" is p = 14.6%. The advantage of this transformation is that we have a single number that is easy to visualize, and takes the context into account. If you wanted to explain it to someone, it would go like this:
    The SAT benchmark prediction is like having a perfect understanding of 14.6% of test-takers and guessing at the rest.
    The graphs below show the linear relationship between average test accuracy, the larger of the percent of positives or negatives in the population (the "guess rate"), and the index p--the equivalent proportion of "perfect understanding" outcomes.

    The "guess rate" is just the bigger of the fraction of negatives or positives in the population. If there are more positives, then without more information, you would guess than any randomly chosen outcome would be positive. If there are more negatives, the best guess (without any other information) is that the outcome would be negative. In formulas, we will call this guess rate "r." For the SAT example, the real positive rate is 59%, so r = .59. If the real positive rate had been 45%, we'd use r = 1 - .45 = 55%.

    As an example to illustrate the graph above, if the number of actual positives and negatives are evenly split at r = 50%, then a test that can predict with 80% correctness has the equivalent "perfect understanding" index of 60%. But if the proportion of positives is r = 70% instead of 50%, the index drops to 33%. It's reasonable to say that even though the correctness rate is the same, the first test is almost twice as good as the second one.

    Note that if the guess rate equals the test accuracy, the test explains exactly nothing, which is as it should be.

    Here's a general formula for computing the index p, which is the proportion of "perfect understanding" test results. The other two variables are c = the test's average correct classification rate, and r = the larger of the proportions of negatives or positive actual outcomes. In the SAT example, 59% were successful according to the FGPA criterion, so r = 59. If it had been 45% successful, then we'd use r = 1-.45 = 55%.  Given these inputs, we have a simple formula for the index p:
    p = (c  - r)/(1 - r)
    On the last graph, p is the height of the line, c is the bottom axis, and four values of r (guess rate) are given, one for each curve as noted on the legend.

    (Note: edited 1/20/2012 for clarity)