Sunday, December 12, 2021

Are you calculating GPA wrong?

Sorry for the clickbait headline, but it's accurate: the way we assign grade points (A = 4, B = 3, etc.) is arbitrary, which makes me wonder if it's sub-optimal. That statement--GPA being sub-optimal--is a little mind-bending. What does that even mean?

We certainly treat GPA like it means something. The statistic is used as a measure of Satisfactory Academic Progress (SAP), which has the weight of federal regulation behind it (34 CFR § 668.34). The regulation doesn't specify how GPA is to be calculated, however (or even that GPA must be the measure of progress).

Student grade averages are also used to determine financial aid, program admittance, and create informal barriers to experiences like study abroad and internships. Employers may use GPA as a filter for screening applicants.

Such uses suggest a measurement study, where we work backwards from the uses of GPA to assess the predictive validity and other measurement properties. For example, if employers are using GPA > 3.0 as a screen, is that meaningful for students from a particular institution/program? Is it valid across institutions? Such studies are routinely done by institutional researchers, for example in using GPA to predict retention, but this probably doesn't often include alternative definitions of GPA.

I'll take a different approach here, to focus on a single measurement question that may seem abstruse: the existence of a latent variable.

Latent Traits

Imagine that GPA estimates some student trait on a numerical scale. I won't focus on what the scale measures, but rather how we expect those values to be distributed. If we imagine that we've tapped into a latent variable that is the result of the contributions of many factors, then it is reasonable to assume by the Central Limit Theorem that the distribution of values is normal (the bell curve common to statistics). 

With the assumption that GPA should be a normal distribution, we can ask the

research question: what is the grade-point assignment that gives the best approximation of a normal GPA distribution?

Instead of assuming that a B is 3 grade points for GPA purposes, maybe the "best" value is 2.9 or 3.5, where "best" means the value that leads most closely to a normal distribution of values. 

The simplest way to induce grade point values is to use the ideas I wrote about here, here, and here.

Figure 1. Assuming a log-likelihood latent scale for course grades, and that the scale is normally distributed, red lines denote induced scalar values on the 0-4 grade point scale, where 4 = A, 3 = B, etc.

In this method, the lowest (F = 0) and highest (A = 4) values are fixed, and the intermediate ones are allowed to vary. The results are D = 0.8, C = 1.6 , and B = 2.7. One can see in Figure 1 that the gap between the induced values (red lines) is larger between A and B than the others, suggesting the need for a "super-A." At my institution, the A+ grade is four grade points, just like a regular A. Imagine that we let A+ = 5 instead. Then the induced scale shifts the undecorated A to the left.

Figure 2. A+ grades are assigned five points in this variation. The induced values are F = 0, D = 0.7, C = 1.5, B = 2.4, and A = 3.5. On a four-point scale it's equivalent to F = 0, D = 0.6, C = 1.2, B = 1.9, A = 2.8, and A+ = 4.

Notice the large gap in Figure 2 between A+ and A grades. The A+ version emphasizes the need for a grade to the right of A to better calibrate GPA. Alternatively, grading styles could be recalibrated on the usual scale by pushing the distribution leftward (more difficult-to-earn grades, making As less frequent).

Course Difficulty

There's a problem with the analysis above. The Central Limit Theorem that guarantees a normal distribution is fueled by independent samples, and the courses students take (hence grades) are not randomly selected in most cases. Some courses are more difficult to earn grades in than others, and students can pick their way through these in many different ways. We can't make the selections random, but we can try to account for the variance in course difficulty. There are multiple ways to do that, and I'll just describe one here.

For each student and each course section, count up the A-F grades assigned, and then compute the proportions of each grade type assigned for (1) each student, and (2) all the sections that student was in, combined.
Table 1. R output showing student statistics, one row per student. The S_ columns are frequencies of grades in the courses that student took, and the plain A-F grades are the proportions for that student.

In the first row of Table 1, that student took classes where 49% of the grades assigned were As, but the student's A-rate is 85%, far above expectations. By contrast, the third student's A-rate (37%) is close to the expected value for the courses (36%). 
With the grade proportions calculated, any grade point assignment, like A = 4, B = 3.4, C = 2.5, D = 2, F = 0, can be used to calculate a GPA for the student and for the sections of courses they took. If we subtract these (student GPA - courses GPA), we get a relative GPA that attempts to subtract out course difficulty. Then we can try out various point assignments to see what the distribution of this relative GPA looks like.
Claim 1: Any linear assignment of grade point weights results in the same distribution of GPA. 
I discovered this while trying out different weights. <edit> Here's the proof outline. Imagine that we start with the usual 0,1,2,3,4 weights, and they give us a distribution of GPAs. If we add a constant c to get weights c,1+c, 2+c, ..., this will move the distribution to the right, but won't change its shape. Therefore, we can assume without loss of generality that the F weight is zero. Now imagine multiplying all the weights by c to get 0, c, 2c,... . This will increase the span (standard deviation) of the distribution, but won't change the shape after it's divided by its standard deviation. So any linear transformation of the usual 0,1,... weights give us the same type of distribution (normal or not). Therefore any weights  \( w_i = a + bi \) preserve the distribution of GPAs after centering and scaling. A consequence is that we can always assume F = 0 and A = 4, and let the intermediate values vary without losing generality. </edit>

Given Claim 1, pinning F = 0 and A = 4 should not diminish the generality of solutions, so that's what I did.

Figure 3. Q-Q plots for (left) the usual integer (A= 4, B = 3...) weights, (middle) the optimal scale with just A, B, C, D, and F, and (right) the optimal scale including A+.

To assess different weights, I computed the sum of the squared difference between the cumulative distribution of GPAs and that of the normal distribution for each sample point. An optimization engine then tries out different combinations of weights to find the optimal one. These can be visually assessed using a Q-Q plot, where the empirical distribution is plotted in black and the red line is the ideal distribution. The three versions plotted in Figure 3 show that the usual integer weights don't give terrible results. Shifting the weights to A = 4, B = 3.25, C =  3.04, D =  2.80, and F = 0 improves the middle of the distribution (OptimalA), but there's still some divergence at the upper end. That's where adding the A+ as a separate grade helps. That version (OptimalA+) is nearly perfectly normal. The weights for that one are shown in Figure 4.

Figure 4. A visualization of the extended grade weights to produce a near-perfect normal distribution of grade averages. The black line is the usual integer scale, with vertical displacements of the labels to show differences.

Given Claim 1, the improvement in the GPA distribution's shape to near-normality is the result of the non-linear displacements (grades higher or lower than the dark line in Figure 4) and the addition of the A+ as a separate grade.

The optimization engine is sensitive to the initial guess for the parameters, and can converge to local minima that are not very good. The initial parameters for the output in Figure 4 were D = C = B = A = 3 points.


The weights in Figure 4 suggest the need to weight A+ grades differently than A grades. The gaps between A, B, and C are nearly equal, and since these are the most numerous grades, this explains why the usual 0 to four scale has a nearly normal distribution. The results suggest that there's not much difference between C and D grades, however. 

Results from the difficulty-adjusted weights in Figure 4 compare favorably to the latent-variable approach in Figure 2:


F = 0, D = 1.2, C = 1.3, B = 1.9, A = 2.6, and A+ = 4.

Latent Scale:

F = 0, D = 0.6, C = 1.2, B = 1.9, A = 2.8, and A+ = 4.

The only significant difference between the two is the weight for the D grade.

This analysis is predicated on the assumption that student abilities measured by GPA should be normally distributed. There are usually selection effects during the admissions process that could challenge the symmetric distribution theory. However, the similarity of the integer-weighted GPA to a normal distribution is close enough to think it's a reasonable assumption. Additionally, the fact that the optimal grade points end up in the right order (A > B, etc.) is a sign of validity.

Source Code

You can find the R code for the A+ version of the difficulty-adjusted weights on github here.

Predictive Validity

[Added 12/17/2021] I tried the different GPA scales as a predictor of success, using admittance to graduate school as the measure, because it's a high bar. The outcome data comes from the National Student Clearinghouse.

Figure 5. Smoothed rates of graduate school admittance by GPA metric over about 8000 students sorted from low to high. GPA = the usual GPA, wGPA is the usual weights after accounting for course difficulty, and wpGPA is the custom difficulty-adjusted weights found in the Discussion. 
The squiggles in the canon GPA seen in Figure 5 are an artifact of having few unique outcomes. The shape of that red line is due to the interpolating function attempting to fit a polynomial through points, and should not be taken as a continuous response. I left it that way as a warning against over-interpreting such curves.
Both difficulty-adjusted GPAs generate more unique values and are more amenable to this sort of averaging. The two curves are virtually identical, suggesting that changing the weights to make the distribution near-normal doesn't affect their utility much. Both are superior to the unweighted GPA if only because we have finer distinctions.
The re-weighting of GPA is a cute example of optimization, but it isn't worth the time for predicting this outcome. I also found no difference in predicting undergraduate graduation rates. 

The overall shape of the curves in Figure 5 suggest two selection effects based on GPA: one by the student, who chooses to apply to graduate school or not, and one by the graduate schools in admitting applicants. The relationship between GPA and graduate school attendance is approximately piece-wise linear with a break at around 25%. A potential second break near the top of the scale is interesting, suggesting that some of the most academically qualified students don't apply to graduate or professional school.

No comments:

Post a Comment