Thursday, November 14, 2019

Grade Consistency within Subjects

Introduction

A few days ago I posted graphs comparing within-subject grade agreement to between-subject agreement. This article improves on that idea. Since grades are capped (for us) at A = 4 points are skewed toward the high end, subjects that grant higher grades on average will generally show more agreement.

In other words, to really understand agreement, we have to take the grade average  into account. One theoretical approach is to use a kappa-type statistics that compares actual agreement to chance agreement. But these statistics are problematic and hard to explain to non-experts. Instead, we'll just consider the issue two-dimensional and look at variation and average in the same picture. The necessary input data comprises at least: StudentID, Subject, and Points, for each course you want to include, where Points is (for us) a 0-4 map to letter grades A,B,...,F with A = 4.0.

Implementation Code


Here's the function.
#' Calculate subject grade variation
#' @description Given a set of grade data, find the average variation in assigned grades 
#' to individual students who take multiple courses in a subject.  
#'
#' @param grades A dataframe with StudentID of the student, Subject code, 
#' e.g. "BIO", and Points = grade points on four-point scale
#' @param .min_classes The minimum number of classes a student needs in the subject 
#' before we include them in the statistics.#' 
#' @return A dataframe with means and standard deviations by subject
#' @export
#'
subject_grade_stats <- function(grades, .min_classes = 1){
  
  grades %>%
    group_by(Subject, StudentID) %>%
    summarize(N = n(),
              GradeSD = sd(Points, na.rm = TRUE),
              GradeMean = mean(Points, na.rm = TRUE)) %>%
    filter(N >= .min_classes) %>%
    summarize(GradeSD = weighted.mean(GradeSD,N, na.rm = TRUE),
              GradeMean = weighted.mean(GradeMean,N, na.rm = TRUE),
              NGrades = sum(N),
              NStudents = n())
  
}

This takes advantage of the conveinent way that dplyr groups data. The first summarize() ungroups the StudentID column, and the second one acts on the result, which is still grouped by Subject.

We can map the output to a plot.

Results


For the blog, I'll anonymize the subjects. The code below filters the statistics data to programs with at least 50 students in the data set and then plots average standard deviation versus average grade assigned in the subject.
library(tidyverse) # or just dplyr and ggplot2

library(ggrepel)   # optional, for avoiding overlapped labels



# get the grade stats for students with >2 courses in a subject, filter N >= 50 students

grade_stats <- subject_grade_stats(grades,3) %>% filter(NStudents >= 50 )

# anonymize for blog
grade_stats$Subject <- as.factor(grade_stats$Subject) > as.integer()
# plot standard deviation vs mean and label the subject

ggplot(grade_stats, aes(x = GradeMean, y = GradeSD, label = Subject)) +
  geom_smooth(method = "lm", se = FALSE, color = "gray", linetype = "dashed") +
  geom_label_repel() + # optional: requires library(ggrepel) 
  theme_minimal()

That produces the plot below.


As expected, the variation within the grades, measured by the average standard deviation of grades assigned to an individual student, is negatively correlated with the average grade assigned for the subject. The regression trend line (unweighted) is plotted for reference.

Subjects above the reference line show more variation than expected, meaning that students who take multiple courses within the discipline (at least three in this graph), receive more dispersed grades after accounting for the overall average. This could be because the discipline comprises different kinds of learning, or it could be because the students who enroll in that subject have more variability in their grade-earning ability, or it could be that the instructors inherently don't agree about grading practices.

In my data, there is a tendency for foreign languages to have smaller-than expected variation in grades assigned. The Art grade averages fall near the middle and show greater variation than the trend line, which is to be expected: art history is very different from design, which is very different from media skills like pottery or photography.

Two of the subjects, located at #8 and #20 have significantly different statistics that similar disciplines. One of these was the subject I identified in the prior post as an outlier.

Discussion

The assessment of grade reliability presented here is simple and (I think) easily explained. I like it better than my first attempt, although the graphs were prettier in that version. What's missing here is the additional validation obtained by comparing within-subject grades to between-subject grades. My method for doing that in the other analysis was to create all the combinations of classes between two subjects. For example, if a student took three BIO courses and two CHM courses, the number of unique combinations is an outer product: six of them in the example. This is probably not the best way to proceed, and something more like an ANOVA may be smarter.

No comments:

Post a Comment