Wednesday, June 06, 2012

Bad Reliability

You usually hear that validity implies reliability, but sometimes reliability is bad.  Here's why.

Suppose a medical test exists for a certain disease, but it will only return a true positive result if there is a sufficient amount of antibody in the blood. Because the amount of the antibody changes over time, the test may return false negatives, saying that the subject does not have the disease when in fact they do. So the test does not reliably give the same answer every time. But this is actually an advantage because there is usefulness to repeating it. It would be bad if it gave you the same wrong answer reliably.

We can create a similar situation in educational testing. Suppose we want to know if students have mastered foreign language vocabulary, and have a list of 5,000 common words we want them to know. Suppose we give a test on 100 items. It's possible (but not likely) that a student could know 98% of the material and get a zero on the test.

If a student knows a fraction p of the 5000 word list, and the test items are randomly chosen, then we can compute the probability of overlap. Each item has an independent probability p of being chosen, and assuming there are no other kinds of errors, we can find a confidence interval around p using p^, the number the student gets correct on the test:
Courtesy of Wikipedia
For a 95% confidence interval and n = 100, we have the error bounds shown below.

If a student knows half the list, 5% of the time, their score will be wrong by more than 10%. If we are testing a lot of students, this is not great news. We can increase the size of the test, of course. But what if we just gave a retest to any student who fails with a 60% or less?

If it's the same test, with perfect reliability, the score shouldn't change. So that doesn't help. What if instead we gave the students a different test of randomly selected items? This reduces test reliability because it's, well, a different test. But it increase overall chances of eliminating false negatives, which is one measure of validity. So less reliability, more validity.

Here's a cleverer idea. What if we thought of the single 100 item test as two 50 item tests? Assume that we want to pass only students who know 60% or more of the words, and we accept a pass from either half of the test. How does that affect the accuracy of the results? I had to make some additional assumptions, in particular that the tested population had a mean ability of p=.7, with s=.1, distributed normally.

The results can be displayed visually on a ROC curve, shown below. If you want to download the spreadsheet and play with it, I put it here.

The single test method has a peak accuracy of 92% under these conditions, with an ideal cutoff score of 55%. This is a false positive rate of 34% and a true positive rate of 97%. The dual half test method has an accuracy of 93% with the cutoff score at 60%, which is more natural (since we want students to have mastered 60% of the material). This is a false positive rate of 18% and a true positive rate of 95%. The pass rate for the single test is 87% versus 83% for the dual test with the higher standard. The actual percentage who should pass under the stated conditions is 84%.

All in all, the dual-test method seems superior to the single test method. Given how many long tests students take, why aren't they generally scored this way? I'm certain someone has thought of this before, but for as long as I've been in education, I've never heard of split scoring.


  1. Anonymous2:00 PM

    What about 10 10-item tests? It seems like more of a problem with the standard-setting than with the reliability of the test. I would guess your false-positive rate would begin to increase dramatically with such a method.

  2. Thanks for the comment. I'm sure there's a general result that would answer your question quickly, but I haven't had time to think about it. The spreadsheet is linked to play around with if you want. The comment about false positives is beside the point, I think, because what really matters is the shape of the ROC curve. We don't have to choose the same cut-off for each version of sub-testing--we can choose the best one that gives maximum accuracy (or whatever other metric we want).