Wednesday, September 28, 2011

SAT Error Rates

In "The Economics of Imperfect Tests" I explored the consequences of errors when making decisions with a test. By coincidence, the College Board came out with something very similar a few days later: their revised College and Career Readiness Benchmark [1]. This article and a related study [2] from 2007 give statistics that can illuminate the ideas in my prior post.

When testing for a given criterion, it's essential to be able to check how well the test is working. This is a nice thing about the SAT: since it predicts success in college, we can actually see how well it works. This new benchmark isn't intended for predicting individual student performance, but groups of them. It looks like a bid for the SAT to become a standardized assessment of how well states, school districts, and the like, are preparing students for college. One caveat is mentioned in [2] on page 24:
One limitation of the proposed SAT benchmark is that students intending to attend college are more likely to take the SAT and generally have stronger academic credentials than those not taking the exam. This effect is likely to be magnified in states where a low percentage of the student population take the exam, since SAT takers in those states are likely to be high achievers and are less representative of the total student population.
The solution there would be to mandate that everyone has to take the test.

As with the test for good/counterfeit coins in my prior post, the benchmark is based on a binary decision:
Logistic regression was used to set the SAT benchmarks, using as a criterion a 65 percent probability of obtaining an FYGPA of a B- or higher [...]
The idea is to find an SAT score that gives us statistical assurance that students above this threshold have a 65% probability of having a college GPA of 2.67 or better their first year of college. There are some complexities in the analysis, including the odd fact that this 65% figure includes students who do not enroll. Of the students who do enroll, table 4 on page 15 of [1] shows that of those who met the benchmark, 79% of them were 'good' students having FYGPA of B- or better (i.e. 2.67 or more). For the purposes of rating the quality of large groups of students, I suppose including non-enrolled students makes sense, but I will look at the benchmark from the perspective of trying to understand incoming student abilities or engineer the admissions stream, which means only being concerned with enrolled students.

Using the numbers in the two reports, I tried to find all the conditional probabilities needed to populate the tree diagram I used in the prior post to illustrate test quality. For example, I needed to calculate the proportion of "B- or better" students. I did this three ways, using data from tables in the article, and got 62% to within a percentage point all three times. The article [1] notes that this would be less than 50% if we include those who don't enroll, but that must be an estimate because a student obviously doesn't have a college GPA if they don't enroll.

Here are the results of my interpretation of the data given. It's easiest to derive the conditional probabilities this direction:
In the tree diagram above, 44% of students pass the benchmark, which I calculated from table 5 on page 16 of [1]. The conditional probabilities on the branches of the tree come from table 4 on the previous page. Note that there's a bit of rounding in both these displays.Using Bayes Rule, it's easy to transform the tree to the form I used in the first post.
 

The fraction of 'good' students comes out to 62%, which agrees closely with a calculation from the mean and standard deviation of sampled FYGPA on page 10 of [1], assuming a normal distribution (the tail of "C or worse" grades ends .315 standard deviations left of the mean). It also agrees with the data on page 16 of [1], recalling that high school GPAs are about half a point higher than college GPAs in the aggregate.

Assuming my reading of the data is right, the benchmark is classifying 35% + 28% = 63% of students correctly, doing a much better job with "C or worse" students than with "B- or better" students. Notice that if we don't use any test at all, and assume that everyone is a "B- or better" student, we'll be right 62% of the time, having a perfect record with the good students and zero accuracy with the others. Accepting only students who exceed the benchmark nets us 79% good students, a 17% increase in performance due to the test, but it means rejecting a lot of good students unnecessarily (44% of them).

In the prior post I used the analogy of finding counterfeit coins using an imperfect test. If we use the numbers above, it would be a case where there is a lot of bad currency floating around (38% of it), and on average our transactions, using the test, would leave use with 79 cents on the dollar instead of 62. We have to subtract the cost of the test from this bonus, however. It's probably still worth using, but no one would call it an excellent test of counterfeit coins. Nearly half of all good coins are kicked back because they fail the test, which is pretty inefficient, and half the coins that are rejected are good.

We can create a performance curve using Table 1a from page 3 of [2]. The percentage of B- students is lower here, at about 50% near the benchmark, so I'm not sure how this relates to the numbers in [1] that were used to derive the tree diagrams. But the curves below should be self-consistent at least, coming all from the same data set. They show the ability of the SAT to distinguish between the two types of students.
If we set the bar very high, we can be relatively sure that those who meet the threshold are good (B- or better) students, but this comes at a cost in false negatives as we saw before. The "sweet spot" seems to be at 1100, with a  65% rate of classifying both groups correctly. Using this criterion, it's 15% better than a random coin toss for predicting both good and poor academic performers.

It's clear that although the SAT has some statistically detectable merit as a screening test via this benchmark, it's not really very good at predicting college grades.As others have pointed out, this test has decades of development behind it, and may represent the best there is in standardized testing. Another fact makes the SAT (and ACT) unusual in the catalog of learning outcomes tests: we can check its predictive validity in order to ascertain error rates like the ones above.

Unlike the SAT, most assessments don't have a way to find error rates because there is no measurable outcome beyond the test itself. For example, tests of "complex thinking skills" or "effective writing," or the like. These are not designed to predict outcomes that have their own intrinsic scalar outputs like college GPA. They often use GPAs as correlates to make the case for validity (ironically sometimes simultaneously declaring that grades are not good assessments of anything), but what exactly is being assessed by the test is left to the imagination. This is a great situation for test-makers because there is ultimately no accountability for the test itself. Recall from my previous post that test makers can help their customers show increased performance in two ways: either by helping them improve the product so that the true positives increase (which is impossible if you can't test for a true positive), or by introducing changes that increase the number of positives without regard to whether they are true or not.

It's ironic that standardized tests of learning are somehow seen as leading to accountability when the tests themselves generally have no accountability for their accuracy.

No comments:

Post a Comment