Friday, November 08, 2019

Course Grades as Data

Introduction

To borrow a phrase from John Barth, within assessment circles there is a pernicious enthymeme that grades don't matter. Course grades are discounted outright as useful data about learning, or they are relegated to the purgatory of "indirect evidence." This ban is one of the data purity rules that also excludes surveys and really anything that is not:

  • tied to a specific piece of students work, and
  • classified by grading, rubric, etc. in an approved manner. 
This is a standardized testing approach that retains only part of the standardization: (1) common point-in-time student work, and (2) a similar-looking rating method. I say "similar-looking," because a real standardized approach would evaluate reliability to ensure that the ratings have some statistical stability. This is rarely done in assessment practice. If it were done, it would reveal that the reliability is quite low most of the time.

There are standard rhetorical objections to using grades, such as (1) lack of specificity in what kind of learning is being measured, (2) confounding of  learning with, e.g. effort, and (3) low reliability due to non-standard grading practices. These are rhetorical objections, not empirical ones. The assessment community seems to be largely allergic to using data to support claims like this. The objection about reliability is particularly ironic, given the low reliability of the data in common use, like papers regraded with rubrics.

The purpose of this post is to explore some of the ways in which grades are useful in understanding learning, including how grades analysis can lead to improvements in programs and practices. 

Data Qualities

First, let's take an inventory of what we have. It's useful to compare the usual data from program assessments to course grades.
  • Assessment data
    • many small sets of unique data types
    • large variation in how the classifications of student work were made
    • large variation in the nominal encodings of the classifications (e.g. test scores, rubric ratings, etc.)
    • usually anonymous (not tied to student ID)
    • often gathered only in upper-level courses, maybe only in a capstone experience
  • Course grades
    • large historical set of records stretching back years
    • large variation in how classifications were made (different grading practices)
    • common scoring encodings: usually A,B,C, etc.
    • grades are linked to student IDs, so they can be studied in context
    • grades are captured during a students' entire history, including transfer-in credit
To put some numbers on this, a hypothetical college with 5000 students, where students take 10 courses per year on average, will generate 50,000 data points per year via course grades. The same college may have 100 academic programs, each with (optimistically) five stated learning outcomes. Nationally, the median program size (by graduates per year) is 11, after dropping the zeros--so this rounding up. If all 11 are assessed on all five outcomes, we have annually*:
  • 100 x 5 = 500 individual data types that cannot realistically be aggregated for statistical purposes, generating
  • 5,500 data points per year, unlinked to other aspects of student demographics or histories.
So in the absolute best case, where every program assesses every outcome all the time, we still only have 10% of the data that grades are giving us, and moreover the individual pools of data from assessment are impossible to analyze comprehensively; we're stuck with small samples, analyzed in a hurry (we have 500 of them!), with poor controls on quality, if there are any at all. 

The last bullet points in the parallel lists above are particularly important. Most assessment programs can't tell us anything about why students don't complete, because they are focused on the end product. Students who drop out along they way are--astonishingly--invisible to the usual assessment methods.

With this perspective, we'd be foolish to ignore grades as data. In fact, we should begin with grades since they are freely available. 

*this implies that about 1000/5000 students graduate per year, a reasonable figure for a selective institution. 

Research Questions

What kinds of questions might course grades answer? Here are a few.
  • What predicts academic success in gateway courses?
  • What is the effect of grades on persistence in a subject, e.g. becoming a major or not, graduating in that major or not?
  • How reliable are grades at the institution? Within each major?
  • How are grades related to other data on learning?
  • Can we detect broad learning types through analysis of grades? For example, can we distinguish humanities-types skill from math-type skill from others?
  • How well do grades predict student retention to graduation?
  • Are high/low grades associated with a sense of belonging at the institution?
I'll address most of these over the next few articles. Let me start at the beginning, with reliability. If grades have too much unexplained variance, then there isn't much we can do with them.

Reliability

Here's an easy check you can do at your institution. For the most recent three or four graduating classes, get:
  • each graduate's student ID and first year GPA (FYGPA), and
  • each graduate's student ID and GPA for courses taken after the first year (SYGPA).
Join these two by the ID to get columns (ID, FYGPA, SYGPA). Compute the correlation of the latter two columns. At my institution, it's .79. This is a pretty high number, indicating that there is stability over time in grade averages per student. It's a simple measure of reliability that's easy to explain to others. This is an important fact to know, because it implies that there is a student trait we might call Academic Ability, and that it's persistent over time. This might lead us to conclude that we should account for academic ability when we examine any type of learning data, like discipline assessment data. Even if we have good data for the program, if we don't take into account the types of students in that program, we won't get the full picture of what's going on.

A more involved calculation lets us compare reliability within and between disciplines. Here we correlate lists of courses by subject code, e.g. Biology course grades correlated with Chemistry grades. We would expect to see patterns, like:
  • Grades within a single subject (e.g. Biology 101 to Biology 102) would be expected to have higher reliability than between disciplines (e.g. Biology 101 to English 101).
  • Disciplines with more regimented subject matter would be expected to have higher reliability, e.g. foreign languages and math would be expected to have higher internal correlations than history and art, because of the less structured curricula in the latter (depending on local curriculum design, of course).

Figure 1. Internal (red) and between-subject (gray) correlations of course grades by subject. The subject names are redacted.


The original plot has markers on the horizontal axis to tell us which academic program is represented on that vertical line. I left those off for this post. The red dots in figure 1 show the internal reliability of course grades, measured by Pearson correlation. This is a fairly crude way to do this, but straightforward to calculate. The gray dots are the correlations for that discipline and some other discipline. 

The general pattern is that, as expected, highly-structured curricula have higher internal reliability. Since the red dots appear mostly above the gray dots, we can conclude that within-discipline reliability is generally higher than between-discipline reliability. This in turn implies that different types of learning are occurring--something that I'll come back to in a moment.

The point circled in orange is an outlier. It is a subject with a highly-structured curriculum, but with grade reliability that appears too low. This is an example of how analysis of grades can scan the curriculum and narrow in on a possible improvement. 

Means and Variances

Correlations between grades are constrained by the average grade assigned because of ceiling effects. A nice complement to the previous graph is one that shows means and standard deviations of grades within each subject. 

Figure 2. Subject mean grades and standard deviations, annotated with internal consistency relative to the average. 

The graph shows individual disciplines by their means and standard deviations, with the internal correlations plotted as a difference from the average. Notice that the subjects with negative numbers (lower than average internal consistency) tend to have higher grades. The outlier is identified in the orange box, indicating that part of the issue is that the grades are probably too high in comparison to similar disciplines.

Inducing Learning Outcomes

The correlations between subject grades can be visualized in a network graph. The graph you get depends on how the thresholds are set (low threshold = many connections). Generally, academic ability drives most of the variation in grades, but after we factor that out, discipline effects become visible.
Figure 3. Visualization of course grade correlations between subjects. 


The network shows that mathematics and history arguable comprise two distinct skill sets that are generally useful to other disciplines. We could add Arts, which I left off here-it's not strongly connected with any of these listed. 

This quick analysis produced a university-wide map of learning outcomes from course grade data. With this basis for understanding, we can take the analysis even further, which I'll show next time.


Summary


We're only getting started here, but it should already be obvious that course grade data, far from being irrelevant to understanding student learning, is an essential data source. The rhetorical devices that are used to justify the proscription on grades from accreditation reporting are just that: mere rhetoric. When we let the data speak for itself, the value is obvious.

I haven't posted the R code this time, because it's fairly long. Contact me if you want it.


Update: I found a pdf of Giles Goat-Boy and located the apposite quote (page 42 of the pdf):
Doubtless Max saw then as clearly as I did later the ruesome enthymeme hanging like an echo in his pause. 
So it's ruesome rather than pernicious, as I had remembered. Both work in the context of this article.

No comments:

Post a Comment