Thursday, October 26, 2023

Assessment Institute 2023: Grades and Learning

I'm scheduled to give a talk on grade statistics on Monday 10/26, reviewing the work in the lead article of JAIE's edition on grades. The editors were supportive in not just accepting my dive into the statistics of course grades and their relationship to other measures of learning, but they recruited others in the assessment community to respond. I got to have the final word in a synthesis. 

That effort turned into a panel discussion at the Institute, which will follow my research presentation. The forum is hosted by AAC&U and JAIE, comprising many of the authors from the special edition who commented. One reason for this post is to provide a stable hyperlink to the presentation slides, which are heavily annotated. The file is saved on researchgate.net, and to cite the slides directly use:

If you want to cite the slides in APA format, you can use:

Eubanks, D. (2023, October 26). Grades and Learning. ResearchGate. https://www.researchgate.net/publication/374977797_Grades_and_Learning

The panel discussion will also have a few slides, but you'll have to get those from the conference website. 

Ancient Wisdom

As I've tracked down references in preparation for the conference, I've gone further back in time and would like to call out two books by Benjamin Bloom, whose name you'll recognize from taxonomy fame. 

The first of these was published more than sixty years ago.

Bloom, B. S., & Peters, F. R. (1961). The use of academic prediction scales for counseling and selecting college entrants. Crowell-Collier.
 
It starts off with a bang. 
 
The main thesis of this report is that there are three sources of variation in academic grades. One is the errors in human judgment of teachers about the quality of a student's academic achievement. Testers have over-emphasized this source of variation and have tended to view grades with great suspicion. Our work demonstrates that this source of variation is not as great as has been generally thought and that grade averages may have a reliability as high as +.85, which is not very different from the reliability figures for some of the best aptitude and achievement tests.
 
The treatment then goes on to describe the typical correlations between high school and college grades of around .6, which is nearly exactly what I see at my institution (it's slowly declining over time). The methods include grade transformations to improve predictiveness. That's still a useful topic, whether predicting college success or building regression models with learning data. 

The second book is:

Bloom, B. S. (1976). Human characteristics and school learning. McGraw-Hill.
 
 From page 1:

This is a book about a theory of school learning which attempts to explain individual differences in school learning as well as determine the ways in which such differences may be altered in the interest of the student, the school, and ultimately, the society.
 
The ideas here are more sophisticated than what we find in the advice books on assessment practice in that the focus is "individual differences" not just group averages. This turns out to be the key in understanding the data we generated and is reviewed in the Assessment Institute presentation. I wish I had read this book before having to reinvent the same ideas.That's partly my own limitation, because I don't have a background in education research. And I increasingly realize that assessment practice--at least all those program reports we write for accreditors--doesn't resemble research at all. Why is that?

Assessment Philosophies

 
There's an upcoming edition of Assessment Update (December, 2023) that will feature several articles critiquing the report-writing focus that concerns so much of assessment practice. In it I'll describe my understanding of the history of that practice and why it fails to deliver on its promise. A big part of that is because of the disconnection between report-writing and the theory and practice of educational research. Most recently we can see that in the non-impact that the Denning et. al work has made on assessment practice or accreditation reporting. That work should have caused a lot of soul-searching, but it's crickets. The conference program book doesn't mention Denning, and although there are several mentions of graduation rates, nothing directly links grades, graduation, and learning (other than my own presentation). When I've had occasion to talk to accreditation VPs, they seem to have not heard of it either. Nor was there a discussion on ASSESS-L. 

I recently came across Richard Feynman's quote that "I would rather have questions that can't be answered than answers that can't be questioned." Here are the two ways of looking at assessment that correspond to those cases.

1. Product Control by Management

I think the reason for the gap between research on learning and assessment practice is due to the founding metaphor of the define-measure-improve idea that drives assessment reports. As I'll mention in the Update article, that formula for continual improvement is a top-down management idea (e.g. related to Six-Sigma) that prioritizes control over understanding the world as it is. The impetus came from a federal recommendation in 1984 and became accreditation standards that causes universities to set up assessment offices that tell faculty how to do the work. There's a transparent flow from authority to action. Although educational research appears in the formula, it's not the priority and in practice it's impossible to do actual research on every program and every learning outcome at a university.

So assessment has become a top-down phenomenon that emphasizes product control (learning outcomes) with emphasis on control. This led to proscriptions on grades and other "answers that can't be questioned."

2. Educational Research

Managing hundreds of SLO reports for accreditors is soul-destroying, and the work that many of us prefer to do is (1) engage with faculty members on a more human level and lean on their professional judgment in combination with ideas from published research, or (2) do original research on student learning and other outcomes.

The research strand has a rich history and a lot of progress to show for it. The early days assessment movement, before being co-opted by accreditation, led to Scholarship of Teaching and Learning and a number of pedagogical advances that are still in use. 

Research often starts with preconceived ideas, theories, and models of reality that frame what we expect to find. But crucially, all of those ideas have to eventually bend to whatever we discover through empirical investigation. There are many questions that can't be answered at all, per Feynman. It's messy and difficult and time-consuming to build theories from data, test them, and slowly accumulate knowledge. It's bottom-up discovery, not top-down control.

What's new in the last few years is the rapid development of data science: the ease with which we can quickly analyze large data sets with sophisticated methods. 

The Gap

There's a culture gap between top-down quality control and bottom-up research. I think that explains why assessment practice seems so divorced from applicable literature and methods. I also don't think there's much of a future in SLO report-writing, since it's obvious by now that it's not controlling quality (see the slides for references besides Denning). 

This is a long way to go to explain why all those cautions about using grades have evolved. Grades are mostly out of the control of authorities, so it was necessary to create a new kind of grades that were under direct supervision (SLOs and assessments). It's telling that one of the objections to grades is that they might be okay if the work in the course is aligned to learning outcomes. In other words, if we put an approval process between grades and their use in assessment reports, it's okay--it's now subject to management control.

Older Posts on Grades

You might also be interested in these scintillating thoughts on grades:

Update 1

The day after I posted this, IHE ran an article on a study of placement: when should a student be place in a developmental class instead of the college introductory class. You can find the working paper here. From the IHE review, here's the main result:

Meanwhile, students who otherwise would have been in developmental courses but were “bumped up” to college-level courses by the multiple-measures assessment had notably better outcomes than their peers, while those who otherwise would have been in college-level courses but were “bumped down” by the assessment fared worse.

Students put in college-level courses because of multiple measures were about nine percentage points more likely than similar students placed using the standard process to complete college-level math or English courses by the ninth term. Students bumped up to college-level English specifically were two percentage points more likely than their peers to earn a credential or transfer to a four-year university in that time period. In contrast, students bumped down to developmental courses were five to six percentage points less likely than their peers to complete college-level math or English.

Note that the outcome here is just course completion. No SLOs were mentioned. Yet the research seems to lead to plausible and significant benefits for student learning. I point this out because, at least in my region, this effort would not check the boxes for an accreditation reports for the SLO standard. It fails the first checkbox: there are no learning outcomes!

If you want to read my thoughts on why defining learning outcomes is over-rated, see Learning Assessment: Choosing Goals.


Update 2

I had to race back from the conference to finish up work on an accreditation review committee, and I'm finally catching my breath from a busy week. Here are my impressions from the two sessions at the 2023 Assessment Institute.
 
 
This was my research talk, which was well-attended, and had great participation. I went through most of the slides to highlight empirical links between grades, learning, course rigor, student ability, and the development paths in learning data. One question I got was about using the word "rigor" or "difficulty" to describe the regression coefficients from a class section, which is really just a positive or negative displacement from the average expected grade. In the literature there's a more neutral term 'lift' that is used, which is better. I have avoided using 'lift' only because it's one more thing to explain, but I think that was an error. 
 
The reason that lift is better is because it doesn't come with meanings we associate with rigor or difficulty. The question was about the potential difference between a course that's academically challenging, which might be good for learning engagement, versus one that's poorly taught. I don't think we can tell the difference directly with the grade analysis I'm using (random effects models with intercepts for students and course sections). That's a great opportunity for future research--what additional data do we need, or is there some cleverer way of using grade data, like a value-added model by instructor. The Insler, et al piece referenced in my slides might be a starting point.
 
A metaphor that nicely contrasts the research I summarized versus the state of SLO compliance reports is horizontal versus vertical. The SLO reports are vertical in the sense that the chop up the curriculum into SLOs that are independently analyzed; there's no unifying model that explains learning across learning outcomes, and consequently few opportunities to find solutions that apply generally instead of per-SLO. So we get a lot of findings like "critical thinking scores are low, so we'll add more of those assignments," and not general improvements to teaching and learning that might affect many SLOs, students, and instructors at once. The latter is the horizontal approach that I emerged from the research on grades. 

Specifically, it's clear that GPA correlates with learning outcomes across the curriculum, so it's more efficient to think about how students with different ability levels or levels of engagement navigate the curriculum than it is to look at one SLO at a time and hope to catch that nuance. Additionally, improving one instructor's teaching methods can affect a lot of SLOs and students. 

One of the questions got to that point--it was about how a rubric might be used to disambiguate aspects of an SLO, perhaps in contrast to the more general nature of what grades tell us. In the vertical approach, the assessment office would conceivably have to oversee the administration of dozens or hundreds of such rubrics--kind of what we are asked to do now. In the horizontal approach, we put the emphasis on supporting faculty to take advantage of new teaching methods in all their classes. This entails supporting faculty development and engaging faculty as partners rather than as supervisors, which is contrary to the SLO report culture, but it's bound to be more effective. 

Some of the questions were important, but too high level to really address from the results of the empirical studies, because of the layers of judgment and politics that are involved. For example, how can the empirical results be used to inform a well-structured curriculum. I think this is quite possible to do, but it partly depends on the culture in an academic program, and their collective goals. In principle, we could aim to make all courses in a curriculum have the same lift (i.e. difficulty --see above), but that may not be what a department want or needs. Perhaps we need some easier courses so that students can choose a schedule that's manageable. We're not having those conversations yet, but they are important. Should some majors be easier than others to permit pathways for lower-GPA students? If so, how do we ensure that these aren't implicitly discriminatory? It gets complicated fast, but if we don't have data-informed conversations, it's hard to get beyond preconceptions and rhetoric.

This forum was the culmination of several years of work on my part, and perhaps more importantly due to the vision and execution of the editors of J. Assessment and IE. Mark was there to MC the panel, which largely comprised authors who responded to my lead article on grades with their own essays that are found in the same issue. Kate couldn't make it to the panel, but she deserves thanks for making space for us on the AAC&U track at the conference. We didn't get a chance to ask her what she meant by "third rail" in the title. Certainly, if you have tried to use course grades as learning assessments in your accreditation reports, you run the risk of being shocked, so maybe that's it.

Peter Ewell was there! He's been very involved with the assessment movement since the beginning and has documented its developments in many articles. I got a chance to sit down with him the day before the panel and discuss that history. My particular interest was a 1984 NIE report that recommended to accreditors to adopt the define-measure-improve formula that's still the basis for most SLO standards, nearly forty years later. My short analysis of the NIE report and its consequences will be found in December, 2023 edition of Assessment Update. Peter led that report's creation, so it was great to hear that I hadn't missed the mark in my exegesis. We also chatted about the grades issue, how the prejudice got started (accident of history), and the ongoing obtuseness (my words) of the accrediting agencies. 

One of my three slides in my introductory remarks for the panel showed the overt SACSCOC "no grades" admonition found in official materials. I could have added another example, taken from  a Q&A with one of the large accreditors, (lightly edited)

Q: [A program report] listed all of the ways they were gathering data. There were 14 different ways in this table. All over the place, but course grades weren't in the list anywhere, and I know from working in this for a long time that there has been historically among the accreditors a kind of allergy to course grades. Do you think that was just an omission, or is there active discouragement to use course grades?

A: We spend a considerable amount of time training institutions around assessment, and looking at both direct and indirect measures, and grades would be an indirect measure. And so we encourage them to look at more direct ways of assessing student learning.

It's not hard to debunk the "indirect" idea--there's no intellectual content there, e.g. no statistical test for "directness." What it really means is "I don't like it." and "we say so."

I should note, as I did in the sessions, that this criticism of accreditor group-think is not intended to undermine the legitimacy of accreditation. I'm very much in favor of peer accreditation, and have served on eleven review teams myself. It works, but it could be better, and the easiest thing to improve is the SLO standard. Ironically, it means just taking a hard look at the evidence.

I won't try to summarize the panelists. There's a lot of overlap with the essays they wrote in the journal, so go read those. I'll just pull out one thread of conversation that emerged, which is the general relationship between regulators and reality. 

A large part of a regulator's portfolio (including accreditors, although they aren't technically regulators) is to exert power by creating reality. I think of this as "nominal" reality, which is an oxymoron, but an appropriate one. For example, a finding that an instructor is unqualified to teach a class effectively means that the instructor won't be able to teach that class anymore, and might lose his or her job. An institution that falls too short may see its accreditation in peril, which can lead to a confidence drop and loss of enrollment. These are real outcomes that stem from paperwork reviews.

The problem is that regulators can't regulate actual reality, like passing a law to "prove" a math theorem, or reducing the gravitational constant to reduce the fraction of overweight people in the population. This no-fly zone includes statistical realities, where the trouble starts for SLO standards. Here's an illustration. A common SLO is "quantitative reasoning," which sounds good, so it gets put on the wish list for graduates. Who could be against more numerate students? The phrase has some claim to existence as a recognizable concept: using numbers to solve problems or answer questions. But that's far too general to be reliably measured, which is reality's way of indicating a problem. Any assessment has to be narrow in scope, and the generalization of a collection of narrow items to some larger skill set has to be validated. In this case, the statement is too broad to ever be valid, since many types of quantitative reasoning depend on domain knowledge that student's can be assumed to have. 

In short, just because we write down an SLO statement doesn't mean it corresponds to anything real. Empiricism usually works the other way around, by naming patterns we reliably find in nature. As a reductio ad absurdum consider the SLO "students will do a good job." If they do a good job on everything they do, we don't really need any other SLOs, so we can focus on just this one--what an advance that would be for assessment practice! As a test, we could give students the task of sorting a random set of numbers, like 3,7,15,-4,7, 25,0,1/2, and 4, and score them on that. Does that generalize to doing a good job on, say, conjugating German verbs? Probably not. If we called the SLO "sorting a small list of numbers" we would be okay, but "doing a good job" is a phantasm.

So you can see the problem. When regulators require wish lists that probably don't correspond to anything real, we get some kind of checkbox compliance, and worse: a culture devoted to the study of generalized nonsense.

Two Paths to Relevance

I am always impressed at these conferences how nice the assessment community is. There is enormous potential energy there to do good in the world, and my thesis throughout these talks was that their work is impeded by outdated ideas forced on us by accreditation requirements. 

I don't think the current resource drain caused by the compliance culture is sustainable, and it behooves us to look for alternatives in order for assessment offices to stay relevant in a world of tighter budgets and chatGPT-generated assessment reports. As I describe more fully in the upcoming Assessment Update, and as I mentioned at the conference, I think there are two paths forward. One is to do more data science on student learning and success, extending the type of work that IR offices do. I don't think we need a lot of these, assuming that we can generalize results. For example, if there's a demand for sophisticated grade statistics like I presented, software vendors will line up to sell them to you, and it's probably cheaper than hiring someone to build the software from scratch. 

The second strand is where we probably need to most effort, and that's to extend the successful work that assessment offices do now to engage with faculty members on improvement projects. If we abandon the SLO reporting drill, the reason for faculty resentment vanishes, and we can spend time reading literature on pedagogy, become familiar with discipline standards (like ACTFL), and instead of relying on dodgy data, engaged in a collaborative and trusting relationship with faculty to rely on whatever combination of organic data (like final exam scores, course grades, rubrics where they already exist) and their professional judgment.

You might object that faculty are working with us now because of the SLO report requirements, and if that goes away they may stop taking our calls. This is true, but it has administrative and regulatory implications. If accreditation standards take a horizontal approach and ask for evidence of teaching quality, it's a natural fit to a more faculty development style of assessment office. And it will likely produce better results, since an institution can currently place no value on undergraduate education and still pass the SLO requirement with the right checkboxes.

But I don't claim to have all the answers. It's encouraging that the conversation has begun, and I think my final words resonated with the group, viz. that librarians and accountants set their own professional standards. After more than three decades of practice, isn't it time that the assessment practitioners stopped deferring to accreditors to tell them what to do and set their own standards?