The internet has revolutionized intelligence. I've seen articles about how it's making us dumber, and I don't know if that's true, but it's certainly made me spoiled. In the old days if I had a computer problem I would just use brute trial and error, often giving up before finding a solution. Now, I just assume that someone else has already had the same problem and kindly posted the solution on a message board somewhere. So a few Google searches almost always solves the problem. Not this time.
This problem is a bothersome thing that comes up occasionally, but not often enough that I've taken action on it. It happens when I have a large data set to analyze and I want to see what's related to what. It's easy enough in SPSS to generate a correlation table with everything I want to know, but it's too much information. If there are 100 items on a survey, the correlation matrix is 100x100 = 10,000 cells. Half of them are repeats, but that's still a lot to look at. So I wanted a way to filter out all the results except the ones with a certain significance level.
I poked around at scripting sites for SPSS, but couldn't find what I was looking for. The idea of writing code in a Basic-like language gives me hives too (don't get me wrong--I grew up on AppleSoft Basic, but somehow using it for this sort of thing just seems wrong).
So without further ado, here's the solution I found. I'm sure someone has a more elegant one, but this has the virtue of being simple.
How-to: Finding Significant Correlates
The task: take a set of numerical data (possibly with missing values) with column labels in a comma-separated file and produce a list of what is correlated with what other variables at some given cut-off for the correlation coefficients. Usually we would want to look for ones larger than a certain value.
Note that some names are definable. I was using CIRP data, so I called my data set that. I'll put the names you can define in bold. Everything else is verbatim. The hash # lines denote a comments, which you don't need to enter--it's just to explain what's going on.
Step One
Download R, the free stats package, if you don't have it already. Launch it to get the command prompt and run these commands (cribbed mostly from this site).
# choose a file for input data and name it something
cirp.data=read.csv(file.choose())
# import the columns into R for analysis
attach(cirp.data)
# create a correlation matrix, using pairwise complete observation. other options can be found here
cirp.mat = cor(cirp.data, use ="pairwise.complete.obs")
# output this potentially huge table to a text file. Note that here you use forward slashes even in Windows
write.table(cirp.mat,"c:/cirpcor.txt")
Step Two
Download ActiveState Perl if you don't have it (that's for Windows). Run the following script to filter the table. You can change the file names and the threshold value as you like. [Edit: I had to replace the code below with an image because it wasn't rendering right. You can download the script here.]
Step Three
Go find the output file you just created. It will look like this:
YRSTUDY2 <-> YRSTUDY1 (0.579148687526634)
YRSTUDY3 <-> SATV (0.434618737520563)
YRSTUDY3 <-> SATW (0.491389963307668)
DISAB2 <-> ACTCOMP (-0.513776993632538)
DISAB4 <-> SATV (0.540769639192817)
DISAB4 <-> SATM (0.468981872216475)
DISAB4 <-> DISAB1 (0.493333333333333)
The variable names are linked by the <-> symbol to show a correlation, and the significance level (that is, the coefficient) is show in parenthesis. If you want the p-value, you'll have to do that separately.
Step Four (optional)
Find a nice way to display the results. I am preparing for a board report, and used Prezi to create graphs of connections showing self-reported behaviors, attitudes, and beliefs of a Freshman class. Here's a bit of it. A way to improve this display would be to incorporate the frequency of responses as well as the connections between items, perhaps using font size or color. [Update: see my following post on this topic.]
Friday, September 30, 2011
Wednesday, September 28, 2011
SAT Error Rates
In "The Economics of Imperfect Tests" I explored the consequences of errors when making decisions with a test. By coincidence, the College Board came out with something very similar a few days later: their revised College and Career Readiness Benchmark [1]. This article and a related study [2] from 2007 give statistics that can illuminate the ideas in my prior post.
When testing for a given criterion, it's essential to be able to check how well the test is working. This is a nice thing about the SAT: since it predicts success in college, we can actually see how well it works. This new benchmark isn't intended for predicting individual student performance, but groups of them. It looks like a bid for the SAT to become a standardized assessment of how well states, school districts, and the like, are preparing students for college. One caveat is mentioned in [2] on page 24:
As with the test for good/counterfeit coins in my prior post, the benchmark is based on a binary decision:
Using the numbers in the two reports, I tried to find all the conditional probabilities needed to populate the tree diagram I used in the prior post to illustrate test quality. For example, I needed to calculate the proportion of "B- or better" students. I did this three ways, using data from tables in the article, and got 62% to within a percentage point all three times. The article [1] notes that this would be less than 50% if we include those who don't enroll, but that must be an estimate because a student obviously doesn't have a college GPA if they don't enroll.
Here are the results of my interpretation of the data given. It's easiest to derive the conditional probabilities this direction:
The fraction of 'good' students comes out to 62%, which agrees closely with a calculation from the mean and standard deviation of sampled FYGPA on page 10 of [1], assuming a normal distribution (the tail of "C or worse" grades ends .315 standard deviations left of the mean). It also agrees with the data on page 16 of [1], recalling that high school GPAs are about half a point higher than college GPAs in the aggregate.
Assuming my reading of the data is right, the benchmark is classifying 35% + 28% = 63% of students correctly, doing a much better job with "C or worse" students than with "B- or better" students. Notice that if we don't use any test at all, and assume that everyone is a "B- or better" student, we'll be right 62% of the time, having a perfect record with the good students and zero accuracy with the others. Accepting only students who exceed the benchmark nets us 79% good students, a 17% increase in performance due to the test, but it means rejecting a lot of good students unnecessarily (44% of them).
In the prior post I used the analogy of finding counterfeit coins using an imperfect test. If we use the numbers above, it would be a case where there is a lot of bad currency floating around (38% of it), and on average our transactions, using the test, would leave use with 79 cents on the dollar instead of 62. We have to subtract the cost of the test from this bonus, however. It's probably still worth using, but no one would call it an excellent test of counterfeit coins. Nearly half of all good coins are kicked back because they fail the test, which is pretty inefficient, and half the coins that are rejected are good.
We can create a performance curve using Table 1a from page 3 of [2]. The percentage of B- students is lower here, at about 50% near the benchmark, so I'm not sure how this relates to the numbers in [1] that were used to derive the tree diagrams. But the curves below should be self-consistent at least, coming all from the same data set. They show the ability of the SAT to distinguish between the two types of students.
If we set the bar very high, we can be relatively sure that those who meet the threshold are good (B- or better) students, but this comes at a cost in false negatives as we saw before. The "sweet spot" seems to be at 1100, with a 65% rate of classifying both groups correctly. Using this criterion, it's 15% better than a random coin toss for predicting both good and poor academic performers.
It's clear that although the SAT has some statistically detectable merit as a screening test via this benchmark, it's not really very good at predicting college grades.As others have pointed out, this test has decades of development behind it, and may represent the best there is in standardized testing. Another fact makes the SAT (and ACT) unusual in the catalog of learning outcomes tests: we can check its predictive validity in order to ascertain error rates like the ones above.
Unlike the SAT, most assessments don't have a way to find error rates because there is no measurable outcome beyond the test itself. For example, tests of "complex thinking skills" or "effective writing," or the like. These are not designed to predict outcomes that have their own intrinsic scalar outputs like college GPA. They often use GPAs as correlates to make the case for validity (ironically sometimes simultaneously declaring that grades are not good assessments of anything), but what exactly is being assessed by the test is left to the imagination. This is a great situation for test-makers because there is ultimately no accountability for the test itself. Recall from my previous post that test makers can help their customers show increased performance in two ways: either by helping them improve the product so that the true positives increase (which is impossible if you can't test for a true positive), or by introducing changes that increase the number of positives without regard to whether they are true or not.
It's ironic that standardized tests of learning are somehow seen as leading to accountability when the tests themselves generally have no accountability for their accuracy.
When testing for a given criterion, it's essential to be able to check how well the test is working. This is a nice thing about the SAT: since it predicts success in college, we can actually see how well it works. This new benchmark isn't intended for predicting individual student performance, but groups of them. It looks like a bid for the SAT to become a standardized assessment of how well states, school districts, and the like, are preparing students for college. One caveat is mentioned in [2] on page 24:
One limitation of the proposed SAT benchmark is that students intending to attend college are more likely to take the SAT and generally have stronger academic credentials than those not taking the exam. This effect is likely to be magnified in states where a low percentage of the student population take the exam, since SAT takers in those states are likely to be high achievers and are less representative of the total student population.The solution there would be to mandate that everyone has to take the test.
As with the test for good/counterfeit coins in my prior post, the benchmark is based on a binary decision:
Logistic regression was used to set the SAT benchmarks, using as a criterion a 65 percent probability of obtaining an FYGPA of a B- or higher [...]The idea is to find an SAT score that gives us statistical assurance that students above this threshold have a 65% probability of having a college GPA of 2.67 or better their first year of college. There are some complexities in the analysis, including the odd fact that this 65% figure includes students who do not enroll. Of the students who do enroll, table 4 on page 15 of [1] shows that of those who met the benchmark, 79% of them were 'good' students having FYGPA of B- or better (i.e. 2.67 or more). For the purposes of rating the quality of large groups of students, I suppose including non-enrolled students makes sense, but I will look at the benchmark from the perspective of trying to understand incoming student abilities or engineer the admissions stream, which means only being concerned with enrolled students.
Using the numbers in the two reports, I tried to find all the conditional probabilities needed to populate the tree diagram I used in the prior post to illustrate test quality. For example, I needed to calculate the proportion of "B- or better" students. I did this three ways, using data from tables in the article, and got 62% to within a percentage point all three times. The article [1] notes that this would be less than 50% if we include those who don't enroll, but that must be an estimate because a student obviously doesn't have a college GPA if they don't enroll.
Here are the results of my interpretation of the data given. It's easiest to derive the conditional probabilities this direction:
In the tree diagram above, 44% of students pass the benchmark, which I calculated from table 5 on page 16 of [1]. The conditional probabilities on the branches of the tree come from table 4 on the previous page. Note that there's a bit of rounding in both these displays.Using Bayes Rule, it's easy to transform the tree to the form I used in the first post.
The fraction of 'good' students comes out to 62%, which agrees closely with a calculation from the mean and standard deviation of sampled FYGPA on page 10 of [1], assuming a normal distribution (the tail of "C or worse" grades ends .315 standard deviations left of the mean). It also agrees with the data on page 16 of [1], recalling that high school GPAs are about half a point higher than college GPAs in the aggregate.
Assuming my reading of the data is right, the benchmark is classifying 35% + 28% = 63% of students correctly, doing a much better job with "C or worse" students than with "B- or better" students. Notice that if we don't use any test at all, and assume that everyone is a "B- or better" student, we'll be right 62% of the time, having a perfect record with the good students and zero accuracy with the others. Accepting only students who exceed the benchmark nets us 79% good students, a 17% increase in performance due to the test, but it means rejecting a lot of good students unnecessarily (44% of them).
In the prior post I used the analogy of finding counterfeit coins using an imperfect test. If we use the numbers above, it would be a case where there is a lot of bad currency floating around (38% of it), and on average our transactions, using the test, would leave use with 79 cents on the dollar instead of 62. We have to subtract the cost of the test from this bonus, however. It's probably still worth using, but no one would call it an excellent test of counterfeit coins. Nearly half of all good coins are kicked back because they fail the test, which is pretty inefficient, and half the coins that are rejected are good.
We can create a performance curve using Table 1a from page 3 of [2]. The percentage of B- students is lower here, at about 50% near the benchmark, so I'm not sure how this relates to the numbers in [1] that were used to derive the tree diagrams. But the curves below should be self-consistent at least, coming all from the same data set. They show the ability of the SAT to distinguish between the two types of students.
It's clear that although the SAT has some statistically detectable merit as a screening test via this benchmark, it's not really very good at predicting college grades.As others have pointed out, this test has decades of development behind it, and may represent the best there is in standardized testing. Another fact makes the SAT (and ACT) unusual in the catalog of learning outcomes tests: we can check its predictive validity in order to ascertain error rates like the ones above.
Unlike the SAT, most assessments don't have a way to find error rates because there is no measurable outcome beyond the test itself. For example, tests of "complex thinking skills" or "effective writing," or the like. These are not designed to predict outcomes that have their own intrinsic scalar outputs like college GPA. They often use GPAs as correlates to make the case for validity (ironically sometimes simultaneously declaring that grades are not good assessments of anything), but what exactly is being assessed by the test is left to the imagination. This is a great situation for test-makers because there is ultimately no accountability for the test itself. Recall from my previous post that test makers can help their customers show increased performance in two ways: either by helping them improve the product so that the true positives increase (which is impossible if you can't test for a true positive), or by introducing changes that increase the number of positives without regard to whether they are true or not.
It's ironic that standardized tests of learning are somehow seen as leading to accountability when the tests themselves generally have no accountability for their accuracy.
Friday, September 09, 2011
What to Expect When You're Assessing
Along with Kaye Crook and Terri Flateby, I will be leading a one-day pre-institute workshop at the 2011 Assessment Institute in Indianapolis. This is a large national conference led by Trudy Banta and her team at IUPUI. It runs from October 31- November 1, with the pre-institute workshops on October 30.
The description of our workshop "What to Expect When You're Assessing" is:
I welcome comments or suggestions. More materials will be forthcoming.
Edit: In addition to the ASSESS-L archive, there is a wonderful site Internet Resources for Higher Education Outcomes Assessment hosted by University of North Carolina and maintained by Ephraim Schechter, a familiar name in assessment circles. That page is a familiar open window on my browser, and an essential bookmark for anyone interested in assessment.
The description of our workshop "What to Expect When You're Assessing" is:
This workshop is intended for faculty and administrators who have responsibility for administering assessment activities at the program, department, or higher level. Through hands-on activities, participants will learn essential skills for supervision of the whole assessment cycle, including good reporting, tips for data analysis, avoiding assessment pitfalls, good practices with tools like rubrics and curriculum maps, as well as management approaches to get the best out of your team using calendars, policies, and institutional readiness assessment. The workshop is appropriate for those with little assessment experience as well as those who would like to further develop their existing practices to create sustainable and meaningful assessment programs.
The reason for offering the workshop is to help institutions grow their own expertise in leading assessment processes. Because gaining trust of faculty and administrators within the organization is so important, it's a good strategy to find someone who already has that trust and teach them about assessment rather than hiring an assessment expert from outside who then has to win everyone's trust.
I asked the ASSESS-L email list for their "Must-knows for a new assessment coordinator" (thanks to Katy Hill, Sean A McKitrick, and
Rhonda A. Waskeiwicz for their responses). The results were interesting for a noticeable dearth of technical items, and an emphasis on political and personal skills, some of which actually de-emphasize technical knowledge, including:
- It's okay for things to not be perfect.
- One has to 'suspend disbelieve' at times with regard to rigor
The lists are insightful, and have helped me think about the one-day program we're putting together. Roughly, it's about one half technical stuff:
- The basic idea of assessment loops
- Common terms and what they mean in practice
- How to write good reports
- Use of rubrics and curriculum maps
- Data analysis and presentation
- Appreciative inquiry
- Responding to specific challenges (this was a topic on ASSESS-L too)
- Setting expectations
- Assessing institutional readiness
- Calendars
- Working with other groups on campus (e.g. faculty senate, center for teaching and learning)
- Administrative buy-in
- What software tools can do and what they can't do
I welcome comments or suggestions. More materials will be forthcoming.
Edit: In addition to the ASSESS-L archive, there is a wonderful site Internet Resources for Higher Education Outcomes Assessment hosted by University of North Carolina and maintained by Ephraim Schechter, a familiar name in assessment circles. That page is a familiar open window on my browser, and an essential bookmark for anyone interested in assessment.
Monday, September 05, 2011
The Economics of Imperfect Tests
It's fascinating to me how attracted people are to rankings: colleges, sports teams, best cities to live in, most beautiful people, and so on can seemingly be put in an order. Of course, it's ridiculous if you stop to ask if the quality in question could really be as simply as a one-dimensional scalar that can be measured with such precision. But it doesn't stop new lists from being generated. Rank order statistics (e.g. saying that Denver is number one and Charlotte is number two on the list) come with their own sort of confidence intervals, so that we really should be saying "City C is rank R plus or minus E, with probability P." Computing these confidence bounds is not easy, and I've never seen it done on one of these lists.
Leaving aside the issue of bounding error, the generation of the numbers themselves is highly questionable. Often, as with US News rankings of colleges, a bunch of statistics are microwaved together in Frankenstein's covered dish to create the master ranking number. You can read the FAQ on the US News rankings here. It seems that consumers of these reports are in such a hurry, or have such limited attention spans, that we can only consider one comparative index. That is, we can't simultaneously consider graduation rate in one column and net cost in another to make compromise decisions. Rather, all the variables have assigned weights (by experts, of course), and everything is cooked down into a mush.
A more substantive example is the use of SAT scores for making decisions about admissions to college (in combination with other factors). In conversations in higher ed circles, SATs are sometimes used as a proxy for the academic potential of a student. It's inarguable that although there is some slight predictive validity for, say, first year college grades, tests like these aren't very good as absolute indicators of ability. And so it would seem on the surface of it that the tests are over-valued in the market place. I've argued that this is an opportunity for colleges that want to investigate non-cognitive indicators, or other alternative ways of better valuing student potential.
But the question I've entertained for the last week is what is the economic effect of an imperfect test. I imagine some economist has dealt with this somewhere, but here's a simple approach.
How do we make decisions with imperfect information? We can apply the answer to any number of situations, including college admissions or learning outcomes assessment, but let me take a simpler application as an analogue. Suppose we operate in a market where there is a proportion $g$ of good money and counterfeit, or bad, money $1-g := b$. [You need javascript on to see the formulas properly.] We also have a test that can imperfectly distinguish between the two. I've sketched a diagram to show how well the test works below.
A perfect test would avoid two types of errors--false negatives and false positives. These may happen with different rates. Suppose an agent we call the Source goes to the market to exchange coins for goods or services with another agent, the Receiver. Assume that the Receiver only accepts coins that test "good." It's interesting to see what happens.
The fraction of good coins that the Source apparently gives the receiver is: $gt+b(1-f)$
The fraction of good coins actually received by the Receiver is $gt$
So the Source will obtain more goods and services in return than are warranted, an excess of $b(1-f)$. The inefficiency can be expressed as the ratio $\frac{b(1-f)}{gt+b(1-f)}$, which is also the conditional probability Pr[false positive|test = "true"]. (Pr means probability, and the vertical bar is read "given.")
There are two factors in $b(1-f)$: the fraction of bad coins and the false positive rate. So the Sender has an incentive to increase both. Increasing the fraction of bad coins is easy to understand. Increasing $1-f$ means trying to fool the test. Students who take SAT prep tests are manipulating this fraction, for example.
So we have mathematical proof that it's better to give than to receive!
In a real market, it might work the other direction too. The Receiver might try to fool the Sender with a fraction of worthless goods or services in return. In that case, the best test wins, which should lead to an evolutionary advantage to good tests. In fact, in real markets we see that at least for easily-measured quantities like weight.
In many cases, the test only goes one direction. When you buy a house, for example, there's no issue about counterfeiting money since it all comes from the bank anyway. The only issue is whether your assessment of the value of the house is good. The seller (the Source) has economic incentive to fool your tests by hiding defects and exaggerating the value of upgrades, for example. It's interesting to me how poor the tests for real estate value seem to be.
In terms of college admissions or learning outcomes assessment, we don't hear much about testing inefficiency. The effects are readily apparent, however. For example, the idea of "teaching to the test" that crops up in conversations about standardized testing. If teachers receive economic or other benefit from delivering students that score above certain thresholds on a standardized test, then they are the Source, and the school system (or public) is the Receiver. It's somewhat nebulous what the actual quality the tests are testing for because there isn't any discussion that I can find about the inherent $b(1-f)$ inefficiency that must accompany any less-than-perfect test. For teachers, the supply of "currency" (their students) is fixed, and they don't have any incentive to keep back the "good" currency for themselves. It's a little different from the market scenario described above, but we can easily make the switch. The teachers are motivated to increase the number of tested positives, whether these are true or not. They would also find false negatives galling, which they would see as cheating them out of goods they have delivered. As opposed to the exchange market, they want to increase both $gt$ and $b(1-f)$, not just the latter. They are presumed to have the ability to transmute badly-performing students into academic achievers (shifting the ratio to a higher $g$), and they can also try to fool the test by decreasing $f$. It is generally assumed that the ethical solution is to do the former.
A good case study in this instance is the New York City 2009 Regents exam results, as described in this Wall Street Journal article. The charge is made that the teachers manipulated test results to get higher "true" rates, and the evidence given clearly indicates this possibility. The graphs show that students are somehow getting push over the gap from not acceptable to acceptable, which is analogous to receiving a "good" result on the test. One of these is reproduced below.
Quoting from the article:
Consider an analogous situation on an assembly line for bags of grain. Your job is to make sure that each bag has no less than a kilo of grain, and you have a scale at your disposal to test. Your strategy would probably be to just top up the bags so they have a kilo of grain, and then move on to the next one. Mr. Rockoff (and probably most of us) assumes that this is not possible for educators. It's an admission that teaching and testing are not very well connected. Otherwise, we would expect to see teachers "topping off" the educational level of students to pass the test and then moving on to other students, to maximize the $gt$ part of their payoff. This quote shows that the school system administrators don't even believe this is possible:
This situation is designed for teachers to try to affect $f$ as the most sensible approach. Teaching students how to take standardized tests is one way of doing that. This critique of one of the tests makes fascinating reading. Here's a short quote:
In education, the gold standard is predictive validity: we don't really care whether or not Tatiana can multiply single digit numbers on a test because that's not going to happen in real life. We care about whether she can use multiplication in some real world situation like calculating taxes, and that she be able to actually do that when it's needed, not in some artificial testing environment. If we identified these outcomes, we could ascertain the efficiency of the test. The College Board publishes reports of this nature, relating test scores to first year college grades, for example. From these reports we can see that the efficiency of the test is quite low, and it's a good guess that most academic standardized tests are equally poor.
Yet the default assumption is often that the test is 100% efficient, that is $f=t=1$: we always perfectly distinguish $g$ from $b$.
The perspective from the commercial test makers is enlightening. If the teachers, faced with little way of relating teaching to testing, and no hope of relating that to (probably hypothetical) predicted real outcomes, choose to modify $1-f$ as a strategy, what is the likely motivation of the test makers?
The efficiency of the test is certainly a selling point. Given the usual vagueness about real predictable outcomes, test makers can perhaps sell the idea that their products really are 100% efficient (no false positives or negatives). An informed consumer would have a clear goal as to the real outcomes that are to be predicted, and demand a ROC curve to show how effective the test is. In some situations, like K-12 testing, we have the confused situation of teachers not having a direct relationship to the tests, which have no accountability to predict real outcomes. It's similar to trying to be the "best city to live in," by optimizing the formula that produces the rankings.
Since there is an assumed (but ironically unacknowledged) gap between teaching and testing, even the testing companies have no real incentive to try to improve teaching, improving the $g:b$ ratio. It's far easier for them to sell the means to fool their own tests. By teaching students how to optimize their time, and deal with complexities of the test itself, which very likely has nothing to do with either the material or any real predictable outcomes, test makers can sell to schools the means to increase the number of reported positives. They are selling ways to raise $t$ and $f$ without affecting $g$ at all. Of course, this is an economic advantage and doesn't come for free. Quoting again from the critique of one test:
It's as if instead of trying to distinguish counterfeit coins from good ones, we all engage in trying to fool the test that imperfectly distinguishes between the two. That way we can pretend that there's more good money in circulation than there actually is.
Leaving aside the issue of bounding error, the generation of the numbers themselves is highly questionable. Often, as with US News rankings of colleges, a bunch of statistics are microwaved together in Frankenstein's covered dish to create the master ranking number. You can read the FAQ on the US News rankings here. It seems that consumers of these reports are in such a hurry, or have such limited attention spans, that we can only consider one comparative index. That is, we can't simultaneously consider graduation rate in one column and net cost in another to make compromise decisions. Rather, all the variables have assigned weights (by experts, of course), and everything is cooked down into a mush.
A more substantive example is the use of SAT scores for making decisions about admissions to college (in combination with other factors). In conversations in higher ed circles, SATs are sometimes used as a proxy for the academic potential of a student. It's inarguable that although there is some slight predictive validity for, say, first year college grades, tests like these aren't very good as absolute indicators of ability. And so it would seem on the surface of it that the tests are over-valued in the market place. I've argued that this is an opportunity for colleges that want to investigate non-cognitive indicators, or other alternative ways of better valuing student potential.
But the question I've entertained for the last week is what is the economic effect of an imperfect test. I imagine some economist has dealt with this somewhere, but here's a simple approach.
How do we make decisions with imperfect information? We can apply the answer to any number of situations, including college admissions or learning outcomes assessment, but let me take a simpler application as an analogue. Suppose we operate in a market where there is a proportion $g$ of good money and counterfeit, or bad, money $1-g := b$. [You need javascript on to see the formulas properly.] We also have a test that can imperfectly distinguish between the two. I've sketched a diagram to show how well the test works below.
A perfect test would avoid two types of errors--false negatives and false positives. These may happen with different rates. Suppose an agent we call the Source goes to the market to exchange coins for goods or services with another agent, the Receiver. Assume that the Receiver only accepts coins that test "good." It's interesting to see what happens.
The fraction of good coins that the Source apparently gives the receiver is: $gt+b(1-f)$
The fraction of good coins actually received by the Receiver is $gt$
So the Source will obtain more goods and services in return than are warranted, an excess of $b(1-f)$. The inefficiency can be expressed as the ratio $\frac{b(1-f)}{gt+b(1-f)}$, which is also the conditional probability Pr[false positive|test = "true"]. (Pr means probability, and the vertical bar is read "given.")
There are two factors in $b(1-f)$: the fraction of bad coins and the false positive rate. So the Sender has an incentive to increase both. Increasing the fraction of bad coins is easy to understand. Increasing $1-f$ means trying to fool the test. Students who take SAT prep tests are manipulating this fraction, for example.
So we have mathematical proof that it's better to give than to receive!
In a real market, it might work the other direction too. The Receiver might try to fool the Sender with a fraction of worthless goods or services in return. In that case, the best test wins, which should lead to an evolutionary advantage to good tests. In fact, in real markets we see that at least for easily-measured quantities like weight.
In many cases, the test only goes one direction. When you buy a house, for example, there's no issue about counterfeiting money since it all comes from the bank anyway. The only issue is whether your assessment of the value of the house is good. The seller (the Source) has economic incentive to fool your tests by hiding defects and exaggerating the value of upgrades, for example. It's interesting to me how poor the tests for real estate value seem to be.
In terms of college admissions or learning outcomes assessment, we don't hear much about testing inefficiency. The effects are readily apparent, however. For example, the idea of "teaching to the test" that crops up in conversations about standardized testing. If teachers receive economic or other benefit from delivering students that score above certain thresholds on a standardized test, then they are the Source, and the school system (or public) is the Receiver. It's somewhat nebulous what the actual quality the tests are testing for because there isn't any discussion that I can find about the inherent $b(1-f)$ inefficiency that must accompany any less-than-perfect test. For teachers, the supply of "currency" (their students) is fixed, and they don't have any incentive to keep back the "good" currency for themselves. It's a little different from the market scenario described above, but we can easily make the switch. The teachers are motivated to increase the number of tested positives, whether these are true or not. They would also find false negatives galling, which they would see as cheating them out of goods they have delivered. As opposed to the exchange market, they want to increase both $gt$ and $b(1-f)$, not just the latter. They are presumed to have the ability to transmute badly-performing students into academic achievers (shifting the ratio to a higher $g$), and they can also try to fool the test by decreasing $f$. It is generally assumed that the ethical solution is to do the former.
A good case study in this instance is the New York City 2009 Regents exam results, as described in this Wall Street Journal article. The charge is made that the teachers manipulated test results to get higher "true" rates, and the evidence given clearly indicates this possibility. The graphs show that students are somehow getting push over the gap from not acceptable to acceptable, which is analogous to receiving a "good" result on the test. One of these is reproduced below.
Quoting from the article:
Mr. Rockoff, who reviewed the Regents data, said, "It looks like teachers are pushing kids over the edge. They are very reluctant to fail a kid who needs just one or two points to pass."This could be construed as either teachers trying to fix false negatives or trying to create false positives. The judgments come down as if it were the latter. The argument that this effect is due to fiddling with $1-f$ instead of increasing $g$ is bolstered by this:
Mr. Rockoff points to the eighth-grade math scores in New York City for 2009, which aren't graded by the students' own teachers. There is no similar clustering at the break point for passing the test.I find this very interesting because it is assumed that this is the normal situation--that if there were no $1-f$ "test spoofing" going on, then we should see a smooth curve with no bump. The implication is that teachers don't have a good understanding of how students will test--that is, despite the incentive to increase the skill levels of students so that they will convert from $b$ to $g$ (and have a better chance of testing "good"), they just don't know how to do it.
Consider an analogous situation on an assembly line for bags of grain. Your job is to make sure that each bag has no less than a kilo of grain, and you have a scale at your disposal to test. Your strategy would probably be to just top up the bags so they have a kilo of grain, and then move on to the next one. Mr. Rockoff (and probably most of us) assumes that this is not possible for educators. It's an admission that teaching and testing are not very well connected. Otherwise, we would expect to see teachers "topping off" the educational level of students to pass the test and then moving on to other students, to maximize the $gt$ part of their payoff. This quote shows that the school system administrators don't even believe this is possible:
After the audit, the state said it took a series of actions and plans to conduct annual "spike/cluster analysis of scores to identify schools with suspicious results."It's ironic that on the one hand, this latent assumption questions the value of the tests themselves, and at the same time the system is built around their use. Other language in the article includes such expressions of certitude as this:
Michelle Costa, a high-school math teacher in New York City, said she often hears from friends who teach at other schools who [bump up scores on] tests, though she doesn't do it. "They are really doing the student a disservice since the student has so obviously not mastered the material," she said.Missing the mark by a couple of points is equated to "obviously not mastering the material." There is no discussion about the inherent inefficiencies in the test, although there seems to be a review process that allows for some modification of scores (called scrubbing in the article).
This situation is designed for teachers to try to affect $f$ as the most sensible approach. Teaching students how to take standardized tests is one way of doing that. This critique of one of the tests makes fascinating reading. Here's a short quote:
Do not attempt to write an original essay. You don't have time. Points are awarded and subtracted on the basis of a formula. Write the five-paragraph essay, even though you will never again have a personal or professional occasion to use this format. It requires no comprehension of the text you are citing, and you can feel smart for having wasted no time reflecting on the literature selections.We don't usually know if the tests are meaningful. If we did, we would know the ratio $g:b$ both before and after the educational process, and we would be able to tell what the efficiency of the test was. This is essentially a question of test validity, which seems to get short shrift. Maybe it seems like a technicality, and maybe the consumers of the tests don't really understand the problem, but it's essential. Imagine buying a test for counterfeit currency without some known standard against which to judge it!
In education, the gold standard is predictive validity: we don't really care whether or not Tatiana can multiply single digit numbers on a test because that's not going to happen in real life. We care about whether she can use multiplication in some real world situation like calculating taxes, and that she be able to actually do that when it's needed, not in some artificial testing environment. If we identified these outcomes, we could ascertain the efficiency of the test. The College Board publishes reports of this nature, relating test scores to first year college grades, for example. From these reports we can see that the efficiency of the test is quite low, and it's a good guess that most academic standardized tests are equally poor.
Yet the default assumption is often that the test is 100% efficient, that is $f=t=1$: we always perfectly distinguish $g$ from $b$.
The perspective from the commercial test makers is enlightening. If the teachers, faced with little way of relating teaching to testing, and no hope of relating that to (probably hypothetical) predicted real outcomes, choose to modify $1-f$ as a strategy, what is the likely motivation of the test makers?
The efficiency of the test is certainly a selling point. Given the usual vagueness about real predictable outcomes, test makers can perhaps sell the idea that their products really are 100% efficient (no false positives or negatives). An informed consumer would have a clear goal as to the real outcomes that are to be predicted, and demand a ROC curve to show how effective the test is. In some situations, like K-12 testing, we have the confused situation of teachers not having a direct relationship to the tests, which have no accountability to predict real outcomes. It's similar to trying to be the "best city to live in," by optimizing the formula that produces the rankings.
Since there is an assumed (but ironically unacknowledged) gap between teaching and testing, even the testing companies have no real incentive to try to improve teaching, improving the $g:b$ ratio. It's far easier for them to sell the means to fool their own tests. By teaching students how to optimize their time, and deal with complexities of the test itself, which very likely has nothing to do with either the material or any real predictable outcomes, test makers can sell to schools the means to increase the number of reported positives. They are selling ways to raise $t$ and $f$ without affecting $g$ at all. Of course, this is an economic advantage and doesn't come for free. Quoting again from the critique of one test:
[S]everal private, for-profit companies have already developed Regents-specific test-preparation courses for students who can afford their fees [...]
It's as if instead of trying to distinguish counterfeit coins from good ones, we all engage in trying to fool the test that imperfectly distinguishes between the two. That way we can pretend that there's more good money in circulation than there actually is.
Sunday, September 04, 2011
Understanding Assessment through Language
Over the summer I wrote a paper that explores the role of language in outcomes assessment: how it can help and how it can get in the way of understanding what's going on. This exercise clarified for me the long-term importance of eportfolios, and the risks inherent to abstract forms of assessment. I will look for a publisher eventually, but in the meantime I welcome your comments.
"Understanding Assessment through Language" (.5MB pdf)
"Understanding Assessment through Language" (.5MB pdf)
Saturday, September 03, 2011
The Most Important Problem in Higher Education
Arrow's Impossibility Theorem is a fascinating application of mathematics to social science. Arrow took up the problem of how to find a voting system that meets certain reasonable criteria, such as taking into account the preferences of more than one voter. He showed that not all of the criteria can be met simultaneously in a single system because it creates a logical contradiction.
I have thought that it would be interesting to try the same trick with higher education. Informally, we might think of a list of best-case qualities we would wish for an educational system, such as universal access, a mechanism for cost deferral (like loans), public subsidization at a given level, and so on. In these lists, which I scribble on napkins at diners, the hardest quality to come to grips with is the idea of certification: a way of knowing that a particular student showed a certain level of accomplishment.
The idea of accomplishment is subtle. Here are at least three ways of looking at it:
This last issue is confounded by the fact that accomplishments are often judged subjectively. Someone might think the screenplay I wrote is wonderful, but that would be a minority view. We can disagree about subjective assessments, so 'accomplishment' itself can be very fuzzy. Even in the case where some putative objective fact is presented (e.g. swimming the English Channel), there is room for debate about the significance of that accomplishment, and what it means in terms of the effects listed above. The common denominator is to keep track of evidence of the accomplishment itself (we would call it authentic assessment data in higher ed land), and let the endorsements and comments ebb and flow with the tides. So if Lincoln's Gettysburg Address receives poor reviews the day after, we still have it around 150 years later to review for ourselves. The historiography of accomplishments is almost as interesting as the things themselves.
Existing systems for tracking accomplishment are inadequate to the task. Let's look at some of them.
In higher education, we issue diplomas and certificates, sometimes with decorations like sum laude, and identifying an area of study. The reputation of the issuing institution lends itself to the holder of the degree, and some subject areas are worth more than others. This is rather like a guild system, a gate-keeper approach. Some of the problems are:
If we were interested only in predictive validity of statements like "Tatiana can speak French fluently," then the ideal case would be to have a perfect testing system to ascertain proficiency only in areas of interest. This is the Western Governors University approach, and it also exists in the form of professional board exams. However, these are coarse-grained hurdles that have to be overcome to get credit for a course or enter a profession, and the skills actually needed for a job may have only a passing resemblance to the test. For example, there are many, many types of engineering jobs, and the board exams cannot possibly cover the skill sets in that level of detail. This results in a rather arbitrary, and certainly inefficient, standard of accomplishment.
What are some requirements for an accomplishment tracking system?
Here's a starter list.
The system or systems should be web-based and easily accessed. There shouldn't be fees or artificial walls to viewing profiles. Of course, per number four above, profiles should be able to be restricted by their owners, like current social media sites sort of allow. It also implies that there is no negative information about an individual, only positive. This seems like a drawback, and I might be wrong about it, but I think allowing negative feedback (which necessarily allows others to say things about you publicly that you don't like) is fraught with problems.
So one possibility is a lightweight endorsement system that works something like the science pre-print site arxiv.org mashed up with Facebook. Well, like academia.edu, come to think of it.
On the surface, it seems like reciprocal recommendations would be a problem, as it tends to be in journals (citing each other's papers or including as co-authors). But with transparency, I suspect that clever people would create metrics that would attempt to summarize the evidence and connections that exist in the body of work. By tracing links of recommendations to create a network of association, it ought to be possible to see how worthwhile they are. It would be far from perfect, of course. But the possibility exists for a real "market" of reputation, where one's own standing is linked to those whom one has endorsed. It would be like picking stocks. In this way, there is an incentive for high-standing individuals to endorse newcomers who show promise. It probably doesn't prevent fads from distorting the 'reputation market', but if there is a relationship to the real world, eventually it becomes self-correcting. (This doesn't apply to all fields, and that problem may be unsolvable there simply because the discipline is more or less defined by the fads that exist at that time.)
The beauty of a transparent system is that a company or university wanting to evaluate a person's record could either use an off-the-shelf metric provided by one of these hypothetical companies that would spring up, or they could just look at the evidence themselves. Or they could hire their own experts to look at the portfolios. There is a sliding scale between how quick the evaluation is and how customized it is, and the best solution for one circumstance may be different from another. I assume this would be followed by an interview, which could be very substantive and deal directly with the evidence of performance that pertains to the position.
A good test case would be teaching evaluation. I haven't seen a public teaching portfolio. Maybe I just haven't come across one, but it's never occurred to me to post mine either, until now. I have big binders with student projects I've supervised. Why not post the best of those? The creation of a teaching-record portal would be a real service to higher education to standardize and validate that important part of academia.
Although I talked about 'a system', it would almost have to be many systems, each associated with some professional endeavor. This still allows for a resume to link all of the pieces together if necessary. In education, learning outcomes assessments would be a necessary component.
The rewards for figuring out how to accurately and efficiently certify accomplishment are worth the trouble. Education systems could be revolutionized to be more flexible and transparent, less bureaucratic and guild-like. Someone who teaches herself would not be valued less than someone who learned at a high-priced university--the proof would be in the sweet dairy desert dish, as the saying goes.
I have thought that it would be interesting to try the same trick with higher education. Informally, we might think of a list of best-case qualities we would wish for an educational system, such as universal access, a mechanism for cost deferral (like loans), public subsidization at a given level, and so on. In these lists, which I scribble on napkins at diners, the hardest quality to come to grips with is the idea of certification: a way of knowing that a particular student showed a certain level of accomplishment.
The idea of accomplishment is subtle. Here are at least three ways of looking at it:
- We might be interested in what a person can do in the future, as a prospective employer would. Can Tatiana synthesize organic compounds? In this sense, accomplishment has to do with the predictive validity of inducing future performance from past performance. Sports statistics embody this idea: a batting average in baseball, for example.
- Another effect of accomplishment is the usefulness of experience in being a consultant. If Tatiana swam the English Channel back in 1981, she might not be able to do it again today, but she could probably tell you some important things to know about it.
- A third effect is social standing. Accomplishment brings its own rewards in terms of access to more connected people, the media, and so on.
This last issue is confounded by the fact that accomplishments are often judged subjectively. Someone might think the screenplay I wrote is wonderful, but that would be a minority view. We can disagree about subjective assessments, so 'accomplishment' itself can be very fuzzy. Even in the case where some putative objective fact is presented (e.g. swimming the English Channel), there is room for debate about the significance of that accomplishment, and what it means in terms of the effects listed above. The common denominator is to keep track of evidence of the accomplishment itself (we would call it authentic assessment data in higher ed land), and let the endorsements and comments ebb and flow with the tides. So if Lincoln's Gettysburg Address receives poor reviews the day after, we still have it around 150 years later to review for ourselves. The historiography of accomplishments is almost as interesting as the things themselves.
Existing systems for tracking accomplishment are inadequate to the task. Let's look at some of them.
In higher education, we issue diplomas and certificates, sometimes with decorations like sum laude, and identifying an area of study. The reputation of the issuing institution lends itself to the holder of the degree, and some subject areas are worth more than others. This is rather like a guild system, a gate-keeper approach. Some of the problems are:
- It doesn't allow for autodidacts who learn outside the system, and so demands a significant investment in time and money.
- There isn't much 'partial credit': completing 99% of requirements does not translate to 99% accomplishment since you don't get a diploma.
- It's too coarse-grained: we need more information than "John got an engineering degree from State U," or even transcripts.
- The lending of reputation from a school's name to the individual distorts the actual accomplishment of the individual.
- It's very expensive and time-consuming to become certified.
- There is no limit to how many letters one can write.
- There is incentive to be overly generous in letters you do write, or at least there is a disincentive to be negative.
- It's very difficult to compare letters from different endorsers since they are not standardized.
If we were interested only in predictive validity of statements like "Tatiana can speak French fluently," then the ideal case would be to have a perfect testing system to ascertain proficiency only in areas of interest. This is the Western Governors University approach, and it also exists in the form of professional board exams. However, these are coarse-grained hurdles that have to be overcome to get credit for a course or enter a profession, and the skills actually needed for a job may have only a passing resemblance to the test. For example, there are many, many types of engineering jobs, and the board exams cannot possibly cover the skill sets in that level of detail. This results in a rather arbitrary, and certainly inefficient, standard of accomplishment.
What are some requirements for an accomplishment tracking system?
Here's a starter list.
- Identity is valid. We have to be confident that the accomplishments weren't outsourced or bought on eBay and then the credit transferred to someone else.
- The system has to be open and transparent, so that we don't create a bottleneck that is just another gatekeeper.
- The system has to be self-correcting with respect to inflation of accomplishments and outright errors.
- Individuals have the final say over their own accomplishment profiles.
The system or systems should be web-based and easily accessed. There shouldn't be fees or artificial walls to viewing profiles. Of course, per number four above, profiles should be able to be restricted by their owners, like current social media sites sort of allow. It also implies that there is no negative information about an individual, only positive. This seems like a drawback, and I might be wrong about it, but I think allowing negative feedback (which necessarily allows others to say things about you publicly that you don't like) is fraught with problems.
So one possibility is a lightweight endorsement system that works something like the science pre-print site arxiv.org mashed up with Facebook. Well, like academia.edu, come to think of it.
- I do some work that I think shows off my skills and upload a record of it to the system under my profile. This would necessarily be electronic, but could consist of video or any other kind of media--not just papers.
- A portfolio-like central index keeps track of all my stuff, with pointers and meta-data about the works I have listed in my resume.
- The index system also allows for endorsements by other registered users. Everyone's ID is real, verified by credit cards and phone numbers or something (optionally a unique SSL certificate).
- Endorsements from experts would claim to authenticate that this is indeed my work, and say something appropriate about it that demonstrates a connection with the work itself.
- All of this is standardized to the extent possible, and if the user allows it, fed out to a data export.
On the surface, it seems like reciprocal recommendations would be a problem, as it tends to be in journals (citing each other's papers or including as co-authors). But with transparency, I suspect that clever people would create metrics that would attempt to summarize the evidence and connections that exist in the body of work. By tracing links of recommendations to create a network of association, it ought to be possible to see how worthwhile they are. It would be far from perfect, of course. But the possibility exists for a real "market" of reputation, where one's own standing is linked to those whom one has endorsed. It would be like picking stocks. In this way, there is an incentive for high-standing individuals to endorse newcomers who show promise. It probably doesn't prevent fads from distorting the 'reputation market', but if there is a relationship to the real world, eventually it becomes self-correcting. (This doesn't apply to all fields, and that problem may be unsolvable there simply because the discipline is more or less defined by the fads that exist at that time.)
The beauty of a transparent system is that a company or university wanting to evaluate a person's record could either use an off-the-shelf metric provided by one of these hypothetical companies that would spring up, or they could just look at the evidence themselves. Or they could hire their own experts to look at the portfolios. There is a sliding scale between how quick the evaluation is and how customized it is, and the best solution for one circumstance may be different from another. I assume this would be followed by an interview, which could be very substantive and deal directly with the evidence of performance that pertains to the position.
A good test case would be teaching evaluation. I haven't seen a public teaching portfolio. Maybe I just haven't come across one, but it's never occurred to me to post mine either, until now. I have big binders with student projects I've supervised. Why not post the best of those? The creation of a teaching-record portal would be a real service to higher education to standardize and validate that important part of academia.
Although I talked about 'a system', it would almost have to be many systems, each associated with some professional endeavor. This still allows for a resume to link all of the pieces together if necessary. In education, learning outcomes assessments would be a necessary component.
The rewards for figuring out how to accurately and efficiently certify accomplishment are worth the trouble. Education systems could be revolutionized to be more flexible and transparent, less bureaucratic and guild-like. Someone who teaches herself would not be valued less than someone who learned at a high-priced university--the proof would be in the sweet dairy desert dish, as the saying goes.
Subscribe to:
Posts (Atom)
-
The student/faculty ratio, which represents on average how many students there are for each faculty member, is a common metric of educationa...
-
(A parable for academic workers and those who direct their activities) by David W. Kammler, Professor Mathematics Department Southern Illino...
-
The annual NACUBO report on tuition discounts was covered in Inside Higher Ed back in April, including a figure showing historical rates. (...
-
Introduction Stephen Jay Gould promoted the idea of non-overlaping magisteria , or ways of knowing the world that can be separated into mutu...
-
In the last article , I showed a numerical example of how to increase the accuracy of a test by splitting it in half and judging the sub-sco...
-
Introduction Within the world of educational assessment, rubrics play a large role in the attempt to turn student learning into numbers. ...
-
I'm scheduled to give a talk on grade statistics on Monday 10/26, reviewing the work in the lead article of JAIE's edition on grades...
-
Inside Higher Ed today has a piece on " The Rise of Edupunk ." I didn't find much new in the article, except that perhaps mai...
-
"How much data do you have?" is an inevitable question for program-level data analysis. For example, assessment reports that attem...
-
I just came across a 2007 article by Daniel T. Willingham " Critical Thinking: Why is it so hard to teach? " Critical thinking is ...