Monday, November 26, 2012

Generating Curricular Nets

I recently developed some code to take student enrollment information and convert that into a visual map of the curriculum, showing how enrollments flow from one course to another. For example, you'd expect a lot of BIO 101 students to take BIO 102 within the next two semesters. In order to 'x-ray' course offerings, I have to set thresholds for displaying links. For example, a minimum transfer of 30% of the enrollment from one course to another in order to show up. There are many ways to add meta-data in the form of text and color, for example using the thickness of the graph edges (the connecting lines) to signify the magnitude of the flow. This is a directed graph, so it has arrows you can't see at the resolution I've provided. Other data includes course name and enrollment statistics, and the college represented. It can be used to isolate part of the curriculum at a time to get more fine-grained graphs.

In the graph below, it's a whole institution's curriculum. The sciences, which are highly structure, clump together in the middle. Less strongly linked structures are visible as constellations around the center. I particularly like the dog shape lower left. This sort of thing can be used to see where the log-jams are, and to compare what advisors think is happening to what actually is.


Friday, November 09, 2012

Application Trajectories

For any tuition-driven college, the run up to the arrival of the fall incoming class can be an exciting time. There are ways to lessen that excitement, and one of the simplest is to track the S-curves associated with key enrollment indices. An example is shown below.


In this (made-up) example, historical accepted applications by application data are show as they accumulated from some initial week. It's a good exercise to find two or three years of data to see how stable this curve is:

1. Get a table of all accepted applicants, showing the date of their acceptance.
2. Use Excel's weeknum(x) function, or something similar, to convert the dates into weeks, and normalize so that that 1 = first week, etc. This first week doesn't have to correspond to the actual recruiting season. You just need a fixed point of comparison. 
3. Accumulate these as a growing sum to create the S-curve by week.
4. Plot multiple years side-by-side.
5. If this is successful, the curves will be pretty close to multiples of one another. That is, they will have the same shape but perhaps different amplitudes. You can normalize these by dividing by the total, so that the sum is 1 at the right of the S. This is your distribution curve. You may want to average the last two.

Once you have a historical distribution curve, you can multiply it by your admit goal to get the trajectory you hope to see during the current cycle. The graph above illustrates the case where the current numbers are on track to meet the goal. If the current numbers drift off the curve, you'll have lots of warning, and can plan for it.

Note that this probably works with deposits and other indicators too. I've only ever used it with accepted applicants.



Sunday, November 04, 2012

Nominal Realities


‎"A poet's understanding of reality comes to him together with his verse, which always contains some element of anticipation of the future." -- N. Mandelstam

Last week I presented a paper "Nominal Realities and the Subversion of Intelligence" at the Southern Comparative Literature Association's annual meeting in Las Vegas. It was a strange turn of events that led me (a mathematician by training) to such an event, but well worth it. The ideas were influenced by my work in complex systems and assessment as well (not unrelated).

Friday, November 02, 2012

Inter-rater Reliability

At the Assessment Institute this week, I saw presentations on three different ways to examine inter-rater reliability. These included classical parametric approaches, such as Krippendorff's Alpha, a paired-difference t-test to detect bias over time, and a Rasch model. I'm hoping to talk the presenters into summarizing these approaches and their uses in future blog posts. In the meantime, I will describe below a fourth approach I developed in order to look at inter-rater reliability of (usually) rubric-based assessments.

I like non-parametric approaches because you can investigate the structure of a data set before you assume things about it. As a bonus you get pretty pictures to look at.

Even without inter-rater data, the frequency chart of how ratings are assigned can tell us interesting things. Ideally, all the ratings are used about equally often. (This is ideal for the descriptive power of the rubric, not necessarily for your learning outcomes goals!) This is for the same reasons that we want to create tests that give us a wide range of results instead of a pronounced ceiling effect, for example. 

If we do have inter-rater data, then we can calculate the frequency of exact matches or near-misses and compare those rates to what we would expect to see if we just sampled ratings randomly from the distribution (or alternatively from a uniform distribution). When shown visually, this can tell us a lot about how the rubric is performing.

I'll give some examples, but first a legend to help you decipher the graphs.


The graph on the left shows how the raters distributed scores. Although the scale goes from one to four, one was hardly used at all, which is not great: it means we effectively have a three point scale instead of a four point scale. So one way to improve the rubric is to try to get some more discrimination between lower and middle values. For the three ratings that are used, the match frequencies (yellow line on the left graph) are  higher than what we'd expect from random assignment (orange dots). The difference is shown at the top of each graph bar for convenience.  Altogether, the raters agreed about 52% of the time versus about 35% from random assignment. That's good--it means the ratings probably aren't just random numbers.

The graph on the right shows the distribution of differences between pairs of raters so that you can see how frequent 'near misses' are. Perfect inter-rater reliability would lead to a single spike at zero. Differences of one are left and right of center, and so on. The green bars are actual frequencies, and the orange line is the distribution we would see if the choices were independently random, drawn from the frequencies in the left graph. So you can also see in this graph that exact matches (at zero on the graph) are much more frequent than one would expect from, say, rolling dice.

Another example

Here we have a reasonably good distribution of scores, although they clump in the middle. The real problem is that raters aren't agreeing in the middle: the frequencies are actually slightly below random. The ends of the scale are in good shape, probably because it's easier to agree about extremes than distinguish cases in the middle. When I interviewed the program director, he had already realized that the rubric was a problem and the program was already fixing it. 

The same report also shows the distribution of scores for each student and each rater. The latter is helpful in finding raters who are biased relative to others. The graphs below show Rater 9's rating differences on three different SLOs. They seem weighted toward -1, meaning that he/she rates a point less that other raters much of the time.



When I get a chance, I will turn the script that creates this report into a web interface so you can try it yourself.

How Much Redundancy is Needed?

We can spend a lot of time doing multiple ratings--how much is enough? For the kind of analysis done above, it's easy to calculate the gain in adding redundancy. With n subjects who are each rated k times by different raters, there are C(k,2) pairs to check for matches.

C(2,2) = 1 match per subject

C(3,2) = 3 matches per subject

C(4,2) = 6 matches per subject

C(5,2) = 10 matches per subject

So using one redundancy (two raters per subject) is okay, but you get disproportionately more power from adding one more rating (three times as many pairs to check for matches). So rule of thumb: three is great, four is super, more than that is probably wasting resources.

Connection to Elections

Since it's November of an election year, I'll add an observation I made while writing the code. In thinking about how voters make up their minds, we might hypothesize that they are influenced by others who express their views. This could happen in many ways, but let me consider just two.

A voter may tend to take an opinion more seriously based on the frequency of its occurrence in general.

But perhaps raw frequencies are not as important as agreements. That is, hearing two people agree about an opinion is perhaps  more convincing than just the frequency the opinion occurs, and that distribution is different--it's the same one used to calculate the 'random' inter-rater dots and graphs above. To calculate it, we square the frequencies and renorm. Small frequencies become smaller and larger ones larger. The graphs below illustrate an example with four opinions. 


Opinion four is the largest, but still a sub-majority of all opinions. However, if you got everyone around the campfire and counted up the agreements between them, opinion four becomes much more visible--the number of these 'handshakes' is now greater than 50% of the total. If this influences other opinions, it would seem to lead to a feedback mechanism that accelerates adoption of opinion four.

I looked for research on this, and found some papers on how juries function, but nothing directly related to this subject (although the results on how juries work are frightening). It would be interesting to know the relative importance of frequency versus 'handshake' agreement.