Suppose a medical test exists for a certain disease, but it will only return a true positive result if there is a sufficient amount of antibody in the blood. Because the amount of the antibody changes over time, the test may return false negatives, saying that the subject does not have the disease when in fact they do. So the test does not reliably give the same answer every time. But this is actually an advantage because there is usefulness to repeating it. It would be bad if it gave you the same wrong answer reliably.
We can create a similar situation in educational testing. Suppose we want to know if students have mastered foreign language vocabulary, and have a list of 5,000 common words we want them to know. Suppose we give a test on 100 items. It's possible (but not likely) that a student could know 98% of the material and get a zero on the test.
If a student knows a fraction p of the 5000 word list, and the test items are randomly chosen, then we can compute the probability of overlap. Each item has an independent probability p of being chosen, and assuming there are no other kinds of errors, we can find a confidence interval around p using p^, the number the student gets correct on the test:
|Courtesy of Wikipedia|
If it's the same test, with perfect reliability, the score shouldn't change. So that doesn't help. What if instead we gave the students a different test of randomly selected items? This reduces test reliability because it's, well, a different test. But it increase overall chances of eliminating false negatives, which is one measure of validity. So less reliability, more validity.
Here's a cleverer idea. What if we thought of the single 100 item test as two 50 item tests? Assume that we want to pass only students who know 60% or more of the words, and we accept a pass from either half of the test. How does that affect the accuracy of the results? I had to make some additional assumptions, in particular that the tested population had a mean ability of p=.7, with s=.1, distributed normally.
The results can be displayed visually on a ROC curve, shown below. If you want to download the spreadsheet and play with it, I put it here.
All in all, the dual-test method seems superior to the single test method. Given how many long tests students take, why aren't they generally scored this way? I'm certain someone has thought of this before, but for as long as I've been in education, I've never heard of split scoring.