Thursday, October 26, 2023

Assessment Institute 2023: Grades and Learning

I'm scheduled to give a talk on grade statistics on Monday 10/26, reviewing the work in the lead article of JAIE's edition on grades. The editors were supportive in not just accepting my dive into the statistics of course grades and their relationship to other measures of learning, but they recruited others in the assessment community to respond. I got to have the final word in a synthesis. 

That effort turned into a panel discussion at the Institute, which will follow my research presentation. The forum is hosted by AAC&U and JAIE, comprising many of the authors from the special edition who commented. One reason for this post is to provide a stable hyperlink to the presentation slides, which are heavily annotated. The file is saved on researchgate.net, and to cite the slides directly use:

If you want to cite the slides in APA format, you can use:

Eubanks, D. (2023, October 26). Grades and Learning. ResearchGate. https://www.researchgate.net/publication/374977797_Grades_and_Learning

The panel discussion will also have a few slides, but you'll have to get those from the conference website. 

Ancient Wisdom

As I've tracked down references in preparation for the conference, I've gone further back in time and would like to call out two books by Benjamin Bloom, whose name you'll recognize from taxonomy fame. 

The first of these was published more than sixty years ago.

Bloom, B. S., & Peters, F. R. (1961). The use of academic prediction scales for counseling and selecting college entrants. Crowell-Collier.
 
It starts off with a bang. 
 
The main thesis of this report is that there are three sources of variation in academic grades. One is the errors in human judgment of teachers about the quality of a student's academic achievement. Testers have over-emphasized this source of variation and have tended to view grades with great suspicion. Our work demonstrates that this source of variation is not as great as has been generally thought and that grade averages may have a reliability as high as +.85, which is not very different from the reliability figures for some of the best aptitude and achievement tests.
 
The treatment then goes on to describe the typical correlations between high school and college grades of around .6, which is nearly exactly what I see at my institution (it's slowly declining over time). The methods include grade transformations to improve predictiveness. That's still a useful topic, whether predicting college success or building regression models with learning data. 

The second book is:

Bloom, B. S. (1976). Human characteristics and school learning. McGraw-Hill.
 
 From page 1:

This is a book about a theory of school learning which attempts to explain individual differences in school learning as well as determine the ways in which such differences may be altered in the interest of the student, the school, and ultimately, the society.
 
The ideas here are more sophisticated than what we find in the advice books on assessment practice in that the focus is "individual differences" not just group averages. This turns out to be the key in understanding the data we generated and is reviewed in the Assessment Institute presentation. I wish I had read this book before having to reinvent the same ideas.That's partly my own limitation, because I don't have a background in education research. And I increasingly realize that assessment practice--at least all those program reports we write for accreditors--doesn't resemble research at all. Why is that?

Assessment Philosophies

 
There's an upcoming edition of Assessment Update (December, 2023) that will feature several articles critiquing the report-writing focus that concerns so much of assessment practice. In it I'll describe my understanding of the history of that practice and why it fails to deliver on its promise. A big part of that is because of the disconnection between report-writing and the theory and practice of educational research. Most recently we can see that in the non-impact that the Denning et. al work has made on assessment practice or accreditation reporting. That work should have caused a lot of soul-searching, but it's crickets. The conference program book doesn't mention Denning, and although there are several mentions of graduation rates, nothing directly links grades, graduation, and learning (other than my own presentation). When I've had occasion to talk to accreditation VPs, they seem to have not heard of it either. Nor was there a discussion on ASSESS-L. 

I recently came across Richard Feynman's quote that "I would rather have questions that can't be answered than answers that can't be questioned." Here are the two ways of looking at assessment that correspond to those cases.

1. Product Control by Management

I think the reason for the gap between research on learning and assessment practice is due to the founding metaphor of the define-measure-improve idea that drives assessment reports. As I'll mention in the Update article, that formula for continual improvement is a top-down management idea (e.g. related to Six-Sigma) that prioritizes control over understanding the world as it is. The impetus came from a federal recommendation in 1984 and became accreditation standards that causes universities to set up assessment offices that tell faculty how to do the work. There's a transparent flow from authority to action. Although educational research appears in the formula, it's not the priority and in practice it's impossible to do actual research on every program and every learning outcome at a university.

So assessment has become a top-down phenomenon that emphasizes product control (learning outcomes) with emphasis on control. This led to proscriptions on grades and other "answers that can't be questioned."

2. Educational Research

Managing hundreds of SLO reports for accreditors is soul-destroying, and the work that many of us prefer to do is (1) engage with faculty members on a more human level and lean on their professional judgment in combination with ideas from published research, or (2) do original research on student learning and other outcomes.

The research strand has a rich history and a lot of progress to show for it. The early days assessment movement, before being co-opted by accreditation, led to Scholarship of Teaching and Learning and a number of pedagogical advances that are still in use. 

Research often starts with preconceived ideas, theories, and models of reality that frame what we expect to find. But crucially, all of those ideas have to eventually bend to whatever we discover through empirical investigation. There are many questions that can't be answered at all, per Feynman. It's messy and difficult and time-consuming to build theories from data, test them, and slowly accumulate knowledge. It's bottom-up discovery, not top-down control.

What's new in the last few years is the rapid development of data science: the ease with which we can quickly analyze large data sets with sophisticated methods. 

The Gap

There's a culture gap between top-down quality control and bottom-up research. I think that explains why assessment practice seems so divorced from applicable literature and methods. I also don't think there's much of a future in SLO report-writing, since it's obvious by now that it's not controlling quality (see the slides for references besides Denning). 

This is a long way to go to explain why all those cautions about using grades have evolved. Grades are mostly out of the control of authorities, so it was necessary to create a new kind of grades that were under direct supervision (SLOs and assessments). It's telling that one of the objections to grades is that they might be okay if the work in the course is aligned to learning outcomes. In other words, if we put an approval process between grades and their use in assessment reports, it's okay--it's now subject to management control.

Older Posts on Grades

You might also be interested in these scintillating thoughts on grades:

Update 1

The day after I posted this, IHE ran an article on a study of placement: when should a student be place in a developmental class instead of the college introductory class. You can find the working paper here. From the IHE review, here's the main result:

Meanwhile, students who otherwise would have been in developmental courses but were “bumped up” to college-level courses by the multiple-measures assessment had notably better outcomes than their peers, while those who otherwise would have been in college-level courses but were “bumped down” by the assessment fared worse.

Students put in college-level courses because of multiple measures were about nine percentage points more likely than similar students placed using the standard process to complete college-level math or English courses by the ninth term. Students bumped up to college-level English specifically were two percentage points more likely than their peers to earn a credential or transfer to a four-year university in that time period. In contrast, students bumped down to developmental courses were five to six percentage points less likely than their peers to complete college-level math or English.

Note that the outcome here is just course completion. No SLOs were mentioned. Yet the research seems to lead to plausible and significant benefits for student learning. I point this out because, at least in my region, this effort would not check the boxes for an accreditation reports for the SLO standard. It fails the first checkbox: there are no learning outcomes!

If you want to read my thoughts on why defining learning outcomes is over-rated, see Learning Assessment: Choosing Goals.


Update 2

I had to race back from the conference to finish up work on an accreditation review committee, and I'm finally catching my breath from a busy week. Here are my impressions from the two sessions at the 2023 Assessment Institute.
 
 
This was my research talk, which was well-attended, and had great participation. I went through most of the slides to highlight empirical links between grades, learning, course rigor, student ability, and the development paths in learning data. One question I got was about using the word "rigor" or "difficulty" to describe the regression coefficients from a class section, which is really just a positive or negative displacement from the average expected grade. In the literature there's a more neutral term 'lift' that is used, which is better. I have avoided using 'lift' only because it's one more thing to explain, but I think that was an error. 
 
The reason that lift is better is because it doesn't come with meanings we associate with rigor or difficulty. The question was about the potential difference between a course that's academically challenging, which might be good for learning engagement, versus one that's poorly taught. I don't think we can tell the difference directly with the grade analysis I'm using (random effects models with intercepts for students and course sections). That's a great opportunity for future research--what additional data do we need, or is there some cleverer way of using grade data, like a value-added model by instructor. The Insler, et al piece referenced in my slides might be a starting point.
 
A metaphor that nicely contrasts the research I summarized versus the state of SLO compliance reports is horizontal versus vertical. The SLO reports are vertical in the sense that the chop up the curriculum into SLOs that are independently analyzed; there's no unifying model that explains learning across learning outcomes, and consequently few opportunities to find solutions that apply generally instead of per-SLO. So we get a lot of findings like "critical thinking scores are low, so we'll add more of those assignments," and not general improvements to teaching and learning that might affect many SLOs, students, and instructors at once. The latter is the horizontal approach that I emerged from the research on grades. 

Specifically, it's clear that GPA correlates with learning outcomes across the curriculum, so it's more efficient to think about how students with different ability levels or levels of engagement navigate the curriculum than it is to look at one SLO at a time and hope to catch that nuance. Additionally, improving one instructor's teaching methods can affect a lot of SLOs and students. 

One of the questions got to that point--it was about how a rubric might be used to disambiguate aspects of an SLO, perhaps in contrast to the more general nature of what grades tell us. In the vertical approach, the assessment office would conceivably have to oversee the administration of dozens or hundreds of such rubrics--kind of what we are asked to do now. In the horizontal approach, we put the emphasis on supporting faculty to take advantage of new teaching methods in all their classes. This entails supporting faculty development and engaging faculty as partners rather than as supervisors, which is contrary to the SLO report culture, but it's bound to be more effective. 

Some of the questions were important, but too high level to really address from the results of the empirical studies, because of the layers of judgment and politics that are involved. For example, how can the empirical results be used to inform a well-structured curriculum. I think this is quite possible to do, but it partly depends on the culture in an academic program, and their collective goals. In principle, we could aim to make all courses in a curriculum have the same lift (i.e. difficulty --see above), but that may not be what a department want or needs. Perhaps we need some easier courses so that students can choose a schedule that's manageable. We're not having those conversations yet, but they are important. Should some majors be easier than others to permit pathways for lower-GPA students? If so, how do we ensure that these aren't implicitly discriminatory? It gets complicated fast, but if we don't have data-informed conversations, it's hard to get beyond preconceptions and rhetoric.

This forum was the culmination of several years of work on my part, and perhaps more importantly due to the vision and execution of the editors of J. Assessment and IE. Mark was there to MC the panel, which largely comprised authors who responded to my lead article on grades with their own essays that are found in the same issue. Kate couldn't make it to the panel, but she deserves thanks for making space for us on the AAC&U track at the conference. We didn't get a chance to ask her what she meant by "third rail" in the title. Certainly, if you have tried to use course grades as learning assessments in your accreditation reports, you run the risk of being shocked, so maybe that's it.

Peter Ewell was there! He's been very involved with the assessment movement since the beginning and has documented its developments in many articles. I got a chance to sit down with him the day before the panel and discuss that history. My particular interest was a 1984 NIE report that recommended to accreditors to adopt the define-measure-improve formula that's still the basis for most SLO standards, nearly forty years later. My short analysis of the NIE report and its consequences will be found in December, 2023 edition of Assessment Update. Peter led that report's creation, so it was great to hear that I hadn't missed the mark in my exegesis. We also chatted about the grades issue, how the prejudice got started (accident of history), and the ongoing obtuseness (my words) of the accrediting agencies. 

One of my three slides in my introductory remarks for the panel showed the overt SACSCOC "no grades" admonition found in official materials. I could have added another example, taken from  a Q&A with one of the large accreditors, (lightly edited)

Q: [A program report] listed all of the ways they were gathering data. There were 14 different ways in this table. All over the place, but course grades weren't in the list anywhere, and I know from working in this for a long time that there has been historically among the accreditors a kind of allergy to course grades. Do you think that was just an omission, or is there active discouragement to use course grades?

A: We spend a considerable amount of time training institutions around assessment, and looking at both direct and indirect measures, and grades would be an indirect measure. And so we encourage them to look at more direct ways of assessing student learning.

It's not hard to debunk the "indirect" idea--there's no intellectual content there, e.g. no statistical test for "directness." What it really means is "I don't like it." and "we say so."

I should note, as I did in the sessions, that this criticism of accreditor group-think is not intended to undermine the legitimacy of accreditation. I'm very much in favor of peer accreditation, and have served on eleven review teams myself. It works, but it could be better, and the easiest thing to improve is the SLO standard. Ironically, it means just taking a hard look at the evidence.

I won't try to summarize the panelists. There's a lot of overlap with the essays they wrote in the journal, so go read those. I'll just pull out one thread of conversation that emerged, which is the general relationship between regulators and reality. 

A large part of a regulator's portfolio (including accreditors, although they aren't technically regulators) is to exert power by creating reality. I think of this as "nominal" reality, which is an oxymoron, but an appropriate one. For example, a finding that an instructor is unqualified to teach a class effectively means that the instructor won't be able to teach that class anymore, and might lose his or her job. An institution that falls too short may see its accreditation in peril, which can lead to a confidence drop and loss of enrollment. These are real outcomes that stem from paperwork reviews.

The problem is that regulators can't regulate actual reality, like passing a law to "prove" a math theorem, or reducing the gravitational constant to reduce the fraction of overweight people in the population. This no-fly zone includes statistical realities, where the trouble starts for SLO standards. Here's an illustration. A common SLO is "quantitative reasoning," which sounds good, so it gets put on the wish list for graduates. Who could be against more numerate students? The phrase has some claim to existence as a recognizable concept: using numbers to solve problems or answer questions. But that's far too general to be reliably measured, which is reality's way of indicating a problem. Any assessment has to be narrow in scope, and the generalization of a collection of narrow items to some larger skill set has to be validated. In this case, the statement is too broad to ever be valid, since many types of quantitative reasoning depend on domain knowledge that student's can be assumed to have. 

In short, just because we write down an SLO statement doesn't mean it corresponds to anything real. Empiricism usually works the other way around, by naming patterns we reliably find in nature. As a reductio ad absurdum consider the SLO "students will do a good job." If they do a good job on everything they do, we don't really need any other SLOs, so we can focus on just this one--what an advance that would be for assessment practice! As a test, we could give students the task of sorting a random set of numbers, like 3,7,15,-4,7, 25,0,1/2, and 4, and score them on that. Does that generalize to doing a good job on, say, conjugating German verbs? Probably not. If we called the SLO "sorting a small list of numbers" we would be okay, but "doing a good job" is a phantasm.

So you can see the problem. When regulators require wish lists that probably don't correspond to anything real, we get some kind of checkbox compliance, and worse: a culture devoted to the study of generalized nonsense.

Two Paths to Relevance

I am always impressed at these conferences how nice the assessment community is. There is enormous potential energy there to do good in the world, and my thesis throughout these talks was that their work is impeded by outdated ideas forced on us by accreditation requirements. 

I don't think the current resource drain caused by the compliance culture is sustainable, and it behooves us to look for alternatives in order for assessment offices to stay relevant in a world of tighter budgets and chatGPT-generated assessment reports. As I describe more fully in the upcoming Assessment Update, and as I mentioned at the conference, I think there are two paths forward. One is to do more data science on student learning and success, extending the type of work that IR offices do. I don't think we need a lot of these, assuming that we can generalize results. For example, if there's a demand for sophisticated grade statistics like I presented, software vendors will line up to sell them to you, and it's probably cheaper than hiring someone to build the software from scratch. 

The second strand is where we probably need to most effort, and that's to extend the successful work that assessment offices do now to engage with faculty members on improvement projects. If we abandon the SLO reporting drill, the reason for faculty resentment vanishes, and we can spend time reading literature on pedagogy, become familiar with discipline standards (like ACTFL), and instead of relying on dodgy data, engaged in a collaborative and trusting relationship with faculty to rely on whatever combination of organic data (like final exam scores, course grades, rubrics where they already exist) and their professional judgment.

You might object that faculty are working with us now because of the SLO report requirements, and if that goes away they may stop taking our calls. This is true, but it has administrative and regulatory implications. If accreditation standards take a horizontal approach and ask for evidence of teaching quality, it's a natural fit to a more faculty development style of assessment office. And it will likely produce better results, since an institution can currently place no value on undergraduate education and still pass the SLO requirement with the right checkboxes.

But I don't claim to have all the answers. It's encouraging that the conversation has begun, and I think my final words resonated with the group, viz. that librarians and accountants set their own professional standards. After more than three decades of practice, isn't it time that the assessment practitioners stopped deferring to accreditors to tell them what to do and set their own standards?

Saturday, July 29, 2023

Average Four-Year Degree Program Size

"How much data do you have?" is an inevitable question for program-level data analysis. For example, assessment reports that attempt understand student learning within an academic major program typically depend on final exams, papers, performance adjudication, or other information drawn from the seniors before they graduate: a reasonable point in time to assess the qualities of the students before they depart. Most accreditors require this kind of activity with a standard addressing the improvement of student learning, for example SACSCOC's 8.2a or HLC's 4b. 

The amount of data available for such projects depends on the number of graduating seniors. As an overall assessment of these amounts I pulled the counts reported to IPEDS for 2017 through 2019 (pre-pandemic) for all four-year (bachelor's degrees) programs. These rows of data each come with a disciplinary CIP code, which is a decimal-system index that describes a hierarchy of subject areas. For example 27.01 is Mathematics, and 27.05 is Statistics. Psychology majors start with 42. 

We have to decide what level of CIP code to count as a "program." The density plot in Figure 1 illustrates all three levels: CIP-2 is the most general, e.g. code 27 includes all of math and statistics and their specializations. 

There are a lot of zeros in the IPEDS data, implying that institutions are reporting that they have a program, but it has no graduates for that year.  In my experience, peer reviewers are reasonable about that, and will relax the expectation that all programs produce data-driven reports, but your results may vary. For purposes here, I'll assume the Reasonable Reviewer Hypothesis, and omit the zeros when calculating statistics like the medians in Figure 1.


 
Figure 1. IPEDS average number of graduates for four-year programs, 2017-19, counting first and second majors, grouped by CIP code resolution, with medians marked (ignoring size zero programs).
 
CIP-6 is the most specific code, and is the level usually associated with a major. The Department of Homeland Security has a list of CIP-6 codes that are considered STEM majors. For example, 42.0101 (General Psychology) is not STEM, but 42.2701 (Cognitive Psychology and Psycholinguistics) is STEM. The CIP-6 median size is nine graduates, and it's reasonable to expect that institutions identify major programs at this level. But to be conservative, we might imagine that some institutions can get away with assessment reports for logical groups of programs instead of each one individually. Taking that approach, and combining all three CIP levels effectively assumes that there's a range of institutional practices, and enlarges the sample sizes for assessment reports. Table 1 was calculated under that assumption.
 
Table 1. Distribution of average program sizes with selected minimums. 

Size Percent
less than 5 30%
less than 10 46%
less than 20 63%
less than 30 72%
less than 50 82%
less than 400 99%  

Half of programs (under the enlarged definition) have fewer than 12 graduates a year. Because learning assessment data is typically prone to error, a practical rule of thumb for a minimum sample size is N = 400, which begins to permit reliability and validity analysis. Only 1% of programs have enough graduates a year for that. 

A typical hedge against small sample sizes is to only look at the data every three years or so, in which case around half the programs would have at least 30 in their sample, but only if they got data from every graduate, which often isn't the case. Any change coming from the analysis has a built-in lag of at least four years from the time the first of those students graduated. That's not very responsive, and would only be worth the trouble if the change has a solid evidentiary basis, and is significant enough to have a lasting and meaningful impact on teaching and learning. But 30 samples isn't going to be enough for a significant project either.

One solution for the assessment report data problem is to encourage institutions to research student learning more broadly--starting with all undergraduates, say--so that there's a useful amount of data available. The present situation faced by many institutions--reporting by academic program--guarantees that there won't be enough data available to do a serious analysis, even when there's time and expertise available to do so. 

The small sample sizes lead to imaginative reports. Here's a sketch of an assessment report I read some years ago. I've made minor modifications to hide the identity.

A four-year history program had graduated five students in the reporting period, and the two faculty members had designed a multiple-choice test as an assessment of the seniors' knowledge of the subject. Only three of the students took the exam. The exam scores indicated that there was a weakness in the history of Eastern civilizations, and the proposed remedy was to hire a third faculty member with a specialty in that area. 

This is the kind of thing that gets mass-produced, increasingly assisted by machines, in the name of assuring and improving the quality of higher education. It's not credible, and a big part of the problem is the expected scope of research, as the numbers above demonstrate. 

Why Size Matters

The amount of data we need for analysis depends on a number of factors. If we are to take the analytical aspirations of assessment standards seriously, we need to be able to detect significant changes between group averages. This might be two groups at different times, two sections of the same course, or the difference between actual scores and some aspirational benchmark. If we can't reduce the error of estimation to a reasonable amount, such discrimination is out of reach, and we may make decisions based on noise (randomness). Bacon & Stewart (2017) analyzed this situation in the context of business programs. The figure below is taken from their article. I recommend reading the whole piece.

Figure 2. Taken from Bacon & Stewart, showing minimum sample sizes needed to detect a change for various situations (alpha = .10). 

Although the authors are talking about business programs, the main idea--called a power analysis--is applicable to  assessment reporting generally. The factors included in the plot are the effect size of some change, the data quality (measured by reliability), the number of students assessed each year, and the number of years we wait to accumulate data before analyzing it. 

Suppose we've changed the curriculum and want the assessment data to tell us if it made a difference. If the data quality isn't up to that task, the data quality also isn't good enough to tell us that there's a problem that needs to be fixed to begin with--it's the same method. Most effect sizes from program changes are small. The National Center for Education Evaluation has a guide for this. In their database of interventions, the average effect size is .16 (Cohen's D, or the number of standard deviations the average measure changes), which is "small" in the chart. 

The reliability of some assessment data is high, like a good rubric with trained raters, but it's expensive to make, so it's a trade-off with sample size. Most assessment data will have a reliability of .5 or less, so the most common scenario is the top line on the graph. In that case, if we graduate 200 students per year, and all of them are assessed, then it's estimated to take four years to accumulate enough data to accurately detect a typical effect size (since alpha = .1, there's still a 10% chance we think there's a difference when there isn't). 

With a median program size of 12, you can see that this project is hopeless: there's no way to gather enough data under typical conditions. Because accreditation requirements force the work to proceed, programs have to make decisions based on randomness, or at least pretend to. Or risk a demerit from the peer reviewer for a lack of "continuous improvement."  

Consequences 

The pretense of measuring learning in statistically impossible cases is a malady that afflicts most academic programs in the US because of the way accreditation standards are interpreted by peer reviewers. This varies, of course, and you may be lucky enough that this doesn't apply. But for most programs, the options are few. One is to cynically play along, gather some "data" and "find a problem" and "solve the problem." Since peer reviewers don't care about data quantity or quality (else the whole thing falls apart), it's just a matter of writing stuff down. Nowadays, ChatGPT can help with that. 

Another approach is to take the work seriously and just work around the bad data by relying on subjective judgment instead. After all, the accumulated knowledge of the teaching faculty is way more actionable than the official "measures" that ostensibly must be used. The fact that it's really a consensus-based approach instead of a science project must be concealed in the report, because the standards are adjudicated on the qualities of the system, not the results. And the main requirement of this "culture of assessment" is that it relies on data, no matter how useless it is. In that sense, it's faith-based.

You may occasionally be in the position of having enough data and enough time and expertise to do research on it. Unfortunately, there's no guarantee that this will lead to improvements (a significant fraction of the NCEE samples have a negative effect), but you may eventually develop a general model of learning that can help students in all programs. Note that good research can reduce the number of samples needed by attributing score variance to factors other than an intervention, e.g. student GPA prior to the intervention. This requires a regression modeling approach that I rarely see in assessment reports, which is a lost opportunity.
 

References

Bacon, D. R., & Stewart, K. A. (2017). Why assessment will never work at many business schools: A call for better utilization of pedagogical research. Journal of Management Education, 41(2), 181-200.

Tuesday, July 11, 2023

Driveway Math

In 2015 I bought a house on the slope of Paris Mountain in Greenville, South Carolina. A flat spot for the house was bulldozed, creating a steep hill behind the house, leaving a slope from the floor of the garage to the street, so the driveway in between is slanted at about 15 degrees. That turns out to be a lot of degrees. I first noticed that this could be a problem when I drove up from Florida in my 2005 Camry, and after an exhausting day on I-95, finally pulled into the garage. CRUNCH went the bottom of the car, as it ground over the peak caused by the driveway slant.

Figure 1. Cross section of the driveway and garage floor.

I learned my lesson and bought an SUV, but being limited in what car one can own by some bothersome geometry is annoying. So I began investigating what modification to the driveway would allow more options. There's currently a small bevel at the join between the slanted and flat concrete, and the idea is to expand that in both directions. How wide would the bevel need to be to allow a Prius to get up the driveway without bottoming out? Or a Chevy Bolt?

The important car dimensions here are the wheelbase (distance between axles) and ground clearance. These are relatively easy to find on the internet, and bulk downloads exist (for a price). One useful data point was that my wife's 2010 Honda Civic just barely touched the concrete, giving me a minimum dimensional set to work from. Initially, I just computed the ratio of wheelbase to clearance and used that to estimate which cars would make it into the garage without the nasty crunch. Anything less than 15.7 was going to be a problem. However, this greatly limits the cars that will work, and that's become an issue lately. We'd like to have the option of getting a small used EV to just drive in town, to complement the Rav4 we share now. 

There are two kinds of solutions I've considered. One is remaking the driveway to increase the bevel, and the other is to install a rubber bump in the garage to lift the front of the car as the lowest point passes over the peak. 

Given the physical dimensions of the car, the driveway, and the proposed modifications, it's straightforward, but fussy, to do the geometry to calculate the minimum clearance at each point in the driveway as the car comes into the garage. I built a Shiny app to do that, which you can download on Github.  (Use at your own risk: no warranties are implied.)

 
One thing I learned from tweaking the knobs is that a step-up in the garage needs to be wide to be effective. In the bottom graph above you can see the discontinuity where the front wheel rolls over the bump pictured at the top. In reality this would be sloped, but I left it as a discrete jump so it's more visible in the graph. 
 
I used the Rav4 to test the model against reality with carefully placed markers.

 The photo shows a 2" marker placed about 14" behind the edge of the garage slab. By moving the marker until it just brushes the bottom of the car I can get a sense of how well the model conforms to reality. One thing I learned is that the sloped driveway is curved a bit to make a slight hump--it's not perfectly flat, so the results deviate a bit from the model in favor of slightly more clearance. 

Another calculation was to find where the plane of the flat garage slab and the plan of the included driveway meet.

Although the driveway isn't perfectly flat, the inclined plane intersects with the horizontal about two inches from the existing edge. This is due to an existing bevel where the two meet. That meeting point is the center of the bevel in the simulation. So it's currently about 2", but because of the slight hump in the driveway it functions more like 3". I didn't try to model the hump explicitly, but that would be the next step in making the app more accurate.

All of this is somewhat approximate, and there's no certainty without actually driving a car into the garage, but this analysis has given me a good idea of how much wider our selection of cars can be if we increase the bevel to about 12" and possibly add a 1" ledge (a secured rubber mat, probably) inside.

Tuesday, June 06, 2023

Why Discount Rates Only Go Up

The annual NACUBO report on tuition discounts was covered in Inside Higher Ed back in April, including a figure showing historical rates. (More recently this was covered in the Economist.)

The discount rate is the fraction of posted tuition that is "given back" by the institution in financial aid awards. Usually these are unfunded awards, meaning there was never any money involved. For a primer on this idea, see Paul Tough's account in the New York Times.  

In private higher ed, where financial aid leveraging is ubiquitous, high discount rates are seen as a sign of failure (not at the elites, which are not as dependent on tuition revenue). Over my three decades in the business, I've seen a lot of hand-wringing about discount rates. Despite all the angst, the causes of discount rates do not seem to be well understood. High discounts are generally seen as good for students, e.g. "institutions are devoting a lot of their own resources to make education more affordable relative to the tuition price" per a NACUBO official quoted in the IHE article. But that benefit is seen as a drag on institutional finances, e.g. "[...] a challenge for many private colleges, since tuition and fee revenue is a large share of their overall funding [...]", from the same source.

Both of these sentiments miss the point. Tuition discounts go up because they mathematically must go up given the business model of private higher ed. Talking about discount rates from year to year is largely a waste of time. Considering the long term effects, on the other hand, should be a concern. 

Discounting Freshmen

The most important thing to know is that it's primarily the first year students who get the discount. Like in any other service industry where there's competition for new customers, it's harder to attract new ones than to retain existing ones. This is why streaming services, newspapers, and so on offer discounts for for an initial period, after which the rates increase. These businesses can't survive on the revenue they would get at the discounted rate, so the increase is necessary. Colleges have a similar situation: because of competition they can only collect an average net tuition revenue per student from freshmen that's lower than what they need per-student to make budget. The difference between the two lines in Figure 1 suggests that private colleges need about a 5% lower discount rate, averaged over all students, than the freshman discount.

Colleges aren't in the kind of business where it's culturally acceptable to overtly offer a discount for the first year to new customers, like a Netflix trial offer. Instead, we raise the price on returning students. If we didn't raise tuition, everyone would pay the low-low introductory rate, and we'd go broke. But raising tuition has a bit of complicated math, because rate hikes compound over time. A sophomore will have one tuition increase, a junior two, and a senior three. These increases are generally not adjusted with increased financial aid, or that would defeat the purpose of increasing net revenue per student up to the level we need.
 

Raising Tuition

So how much do we need to raise tuition every year to increase the net revenue from returning students so that the total average net revenue per student is high enough to support operations? First, it's useful to notice that if enrollment patterns are stable, the tuition increase needed to maintain net revenue can be a fixed amount, rather than a percentage of current tuition. To specify the enrollment pattern, let's call  \(  p_1 \)  the fraction of undergraduates who are first year, \( p_2 \) the fraction who are second year students, and so on. Average net revenue per student, \( N_a \) has a relationship to the average net revenue per freshman \( N_f \) per the formula

$$ N_a =  N_f + (p_2 + 2p_3 + 3p_4)T' $$

where \( T' \) is the constant amount we raise tuition each year. The formula assumes that no matter what the nominal tuition rate is, the amount of net revenue per freshman stays constant due to market forces. The returning classes pay more because the market forces (mostly) no longer apply. The second year students see one tuition increase, the third year students two, and the seniors have seen three increases. This is a simplification that doesn't consider transfers in, students who remain longer than four years, and so on.
 
Note that the fractions \(  p_1 \) , \(  p_2 \) , \(  p_3 \) , and \(  p_4 \) sum to one (100%) and describe a student population that relates to admissions standards and completion rates. A large first year attrition will enlarge  \(  p_1 \) and put more pressure on tuition increases, since there are relatively fewer returners.
 
So under strong competition for freshmen, we'd expect to see that tuition increases are approximately linear (as opposed to exponential from percentage compounding).  

NACUBO Data

Let's see what we can learn from the NACUBO numbers in Figure 1. It's useful to begin by converting discount rates to "realized rates," or net tuition revenue divided by gross tuition. If tuition is $30,000 and the total average discount rate D is 33%, then the average net revenue per student is $20,000 and the realized rate R is 20,000/30,000 = 67%, or 100% - discount rate. With that idea, we can calculate the ratio of realized rates from the chart. The ratio \(r\) of the realized rates is the same as the ratio of average net revenue, i.e.
 
$$ r  = \frac{R_a}{R_f} = \frac{\frac{N_a}{G}} {\frac{N_f}{G}}= \frac{N_a}{N_f} $$
where the a subscript means all students and f means only freshmen. 
 
In a competitive market for freshmen, we'd expect \(r\) to stay greater than one. In 2015, this ratio was (100 - 43)/(100 - 48) = 1.096. In 2022 it was (100 - 50.9)/(100 - 56.2) = 1.121. The ratio has been remarkably stable over the ten year period shown in Figure 1, with an average of 1.11. So over the last decade, the average undergraduate paid about 11% more than an average freshmen. We can use that information to find out how much we need to raise tuition by each year to make budget.
 
Figure 2. NACUBO and IPEDS realized revenue ratios are nearly constant at about 1.1 in recent years.
 
For convenience, I'm jumping ahead a little to include IPEDS ratios on the same chart in Figure 2. The IPEDS data is discussed in the next section. 
 
The equation for \( N_a \) in the previous section assumes a constant tuition increase each year, but since we have the ratio of \( N_a / N_f \) here instead of the difference \(  N_a - N_f \), it's convenient to fudge a little on the constant tuition increase in order to make the math work. This entails using a percentage increase on net revenue each year instead of a fixed amount. 

Assume that for each dollar in average net tuition revenue we get from a new student we actually need r dollars, averaged over all undergraduates. The NACUBO ratios suggests that r = 1.11 is reasonable. Because the increase happens each year, returning students have two increases as third year students and three increases as fourth years. We have

$$r = p_1 + p_2(1+t') + p_3(1+t')^2 + p_4(1+t')^3$$

where \( t' \) is the fraction of freshman net revenue per student that we raise price by, \( t' = T' / N_f \). 

A college with a high retention rate might have class proportions of .3, .25, .23, and .22 (ignoring fifth-year and beyond). In that case, solving for r = 1.11 gives t' = 0.076, which means that we need to increase the net tuition revenue for returners by about 7.6%  each year. This is a fixed fractional multiplier to \( N_f \) that becomes a tuition increase each year if there is no additional discounting for returners. For context, if discount rates are already in the 50% range, then that 7.6% of extra revenue can be generated by a 7.6/2 = 3.8% tuition hike that gets passed on to returners without additional discounting. That calculation is complicated by the fact that students with the leanest aid packages are usually the first to leave, so the tuition increase may need to be a little higher to accommodate.

This calculation shows how to maintain a constant flow of net revenue, and any increases to the budget would incur larger tuition hikes. If we must raise tuition just to keep the same amount of net revenue coming in, then by definition discount rates must increase. 

In this scenario where we raise tuition rates fractionally every year to get r = 1.11, fourth-year students are paying about 26% more than freshmen  \( (1.08^3 - 1) \). In practice, this could be lower if the average net revenue paid by freshmen increased each year. However, the historical NACUBO data shows that 11% gap being steady over time, so even if freshmen are paying more, the budget must be growing at the same rate.
 

IPEDS Data

The NACUBO data can't be exactly reproduced using IPEDS data, but we can get close. The finance data includes net tuition revenue, posted tuition amounts, and number of undergraduates, from which we can estimate \( N_a \) as net tuition revenue divided by the number of undergraduates, and hence the discount rate by \( D = 1 - N_a/T \).
 
I selected private colleges with at least 1000 undergraduates and with a complete history of the data I needed, then filtered to those that were heavily undergraduate, so as to avoid bias from graduate school tuition. This left 223 institutions. I'm not sure how well this overlaps with the NACUBO set.

If the model assumptions are correct, we'd expect that tuition increases look linear over time, and a divergence between net revenue and posted tuition that causes increasing discount rates.

Figure 3. IPEDS history of selected private colleges, showing average tuition (blue) with annual percentage tuition increase labeled and average net revenue per student (red) with discount rates labeled. Figures are median values over 223 four-year undergraduate institutions. 

Average tuition rates in Figure 3 are about linear over time, as would be expected from the model above, and the annual percentage increases in tuition decline over time as a consequence.
 
This historical data shows the average tuition in 2007 at around $23,000, and large annual increases (6.5% the first year), while net tuition revenue per student was about $15,000 with a discount rate of 37%. For revenue maintenance, we need linear tuition growth, which is what we see here (blue). Over the eleven years, average tuition increased by $1,225 per year. These increases generated more than steady-state net revenue, as shown in the red line tracking net revenue per student: from 2006 to 2020, it grew from $15k per undergraduate to $20k, a 33% increase. The cumulative inflation rate over that period was about 30%, so in real dollars, net revenue per student was nearly flat. This aligns with what we saw in Figure 2: freshmen pay more over time, but what the college needs to operate increases at about the same rate, so that tuition increases stay relatively constant instead of declining.

Freshman Discount

Under the model assumptions, if we know the annual increase in price and the average net revenue over all undergraduates--both of which IPEDS provides--we can estimate the freshman discount using 

$$ N_f =  N_a - (p_2 + 2p_3 + 3p_4)T' $$
 
I estimated the class proportions using available data, but it's not very good, and needs some work. Fortunately, the result isn't that sensitive to the distribution. Once we have the net revenue rates, the discount rates can be calculated.
 
Figure 4. Comparative discount rates between NACUBO and IPEDS
 
The discount rates in Figure 4 track well, but there's about a 6% difference in total rates and a slightly large gap for the freshman rates. This is a big enough discrepancy to demand explanation. A proper analysis needs a hierarchical model with better empirical estimates of the class proportions. I'll eventually post the code and data for this so you can try it out for yourself. In the meantime, email me if you're interested, at deubanks.office@gmail.com. 

Discussion

It's not a good look for higher education to advertise higher and higher prices in order to maintain, in some sectors, a subsistence level of revenue. Headlines talk about the posted tuition rates "skyrocketing" at the same time that small colleges are closing up shop.
 
The effects of tuition leveraging, where institutional aid is used for differential pricing, is a flexible tool that can be used for good purposes. However, it seems unfair to have large gaps between what freshmen pay and what seniors pay, escalating the financial pressure each year just to stay in college. I doubt that this was an intended result of the leveraging idea, but it must be a common effect. 

The leveraging model, combined with competition for first-year students, causes ineluctable rise in tuition and discount rates. In the long run, this business model isn't sustainable because posted tuition rates will become ridiculously high. Colleges will have to reset their nominal rates eventually (and take a short-term revenue hit), or find some other way to manage first-year discounts. There are creative solutions to this problem, but they require re-imagining net revenue streams. Some colleges have tried flat pricing (no discounting), but that's not going to work if you have to compete for students, because it doesn't solve the first-year discount problem, so just jettisoning aid leveraging won't solve the main problem.
 
Tuition increases should be decided in dollar amounts, not percent increases, with estimated impacts on each returning class, and there should be sensitivity to how constant these increases over time. A per-class retention analysis can inform the decision. Some students may need financial escape valves, which might claw back some of the increase for returners. Of course, if a college can cut spending enough to make do with the average net revenue generated per freshman, it doesn't have to raise tuition at all.

Update: There's a 9/15/2023 article in Inside Higher Ed about cutting tuition that's relevant here: link.

Try It Yourself

I built a simulation so you can try out various scenarios. You can find in on ShinyApps.io. This is a free account, so the monthly usage allotment may dry up. Email me if you want to the code to run locally. When it launches, you should see this

 
Figure 5. Screenshot of discount simulator
 
You can adjust the tuition level, and the two figures for average net revenue per student, all in thousands of dollars. To begin with, tuition is at 40k and net revenue is 20k for all students as well as freshmen (a 50% discount). This implies there's no price pressure to discount freshman, which isn't realistic for most colleges. But in that happy situation, there's no need to raise tuition just to keep the same revenue, and hence the lines are flat. If you adjust the net revenue per undergraduate slider up to 22K, this reflects a need for 10% more revenue on average than we can get from the freshmen rate. That sets off a cascade of tuition increases each year and corresponding discount rate increases. The new graphs will show a future trajectory where the same amount of total net tuition revenue is coming in, but now reflecting market forces to discount freshmen. You can also play around with the class proportions, e.g. to reflect a school with lower retention rates.

Saturday, April 29, 2023

Why are Graduation Rates Increasing?

This post is the first of a series on student achievement. 

The National Center for Education Statistics (NCES) summarizes graduation rates for first-time full-time undergraduates here. Overall rates have increased from 34% to 47% between start years of 1996 and 2014-an average increase of .7% per year. Why is this?

Educated Guesses

"Why" questions are a trap, like a free trial subscription to the Wall Street Journal. See The Book of Why for an accessible discussion of the history of causal thinking and details of path-diagram method that Judea Pearl developed. One problem with finding causal explanations is that it's too easy to imagine a cause from some effect we'd like to explain. In logic this is called abductive reasoning, and historically this has led to a lot of problems, like "the barn burned down because Ahab used witchcraft." However, abduction is useful for formulating guesses about causes, as long as we're willing to be wrong about all of them. In that spirit, here are some possible reasons for the graduation rate increases that are empirically testable.

  1. Colleges are making it easier to graduate because:
    1. Requirements are more lenient
    2. Student support has increased
       
  2. Student population characteristics are changing in favor of those more likely to graduate

  3. Student-institution matching has improved

The first of these is straightforward to understand. The second suggests a "selection effect" of the type that tripped up the 1983 report A Nation at Risk (see chapter 7 of Cathy O'Neil's book). Suppose the group of students who are least likely to graduate college become discouraged over time, perhaps because of rising costs and diminishing public confidence, so on average a smaller fraction of them choose to attend college each year. That self-selection out of the applicant pools would be expected to result in higher graduation rates.

The third hypothesis is even more subtle. On average, students are applying to ever more colleges each year, casting the net more widely as it were. This suggests that the academic and financial match between students and institutions might be improving--a kind of free market hypothesis. If so, this effect might be discernible in the data as a connection between higher application rates and higher retention or graduation rates.

But let's start with the first hypothesis on the list, because there has already been some good work done to ask the question. 

Grade Inflation

I first saw this research mentioned in Inside Higher Ed: The Grade-Inflation Completion Connection. The published paper is here, and the citation is

Denning, J., Eide, E., & Warnick, M., Mumford, K., Patterson, R. & Warnick, M. (2022). 
Why have college completion rates increased? American Economic Journal: Applied 
Economics, 14(3): 1–29 https://doi.org/10.1257/app.202005

The abstract reads

We document that college completion rates have increased since the 1990s, after declining in the 1970s and 1980s. We find that most of the increase in graduation rates can be explained by grade inflation and that other factors, such as changing student characteristics and institutional resources, play little or no role. This is because GPA strongly predicts graduation, and GPAs have been rising since the 1990s. This finding holds in national survey data and in records from nine large public universities. We also find that at a public liberal arts college grades increased, holding performance on identical exams fixed.

The data used in this paper comes from student-level data from national surveys (among other sources), but is limited to the period 1988-2002 for some data and up to 2010 for other data. 
 
Note that the abstract signals that the authors have considered selection effects and increasing student support (two of the hypothesized causes on my list), but find the most evidence for GPA increase as the largest identifiable cause of increased graduation rates. 
 
In the discussion, the authors point to other factors not on my list (emphasis added):

We discuss relevant trends that could affect college graduation, such as the college wage premium, enrollment, student preparation, study time, employment during college, price, state support for higher education, and initial college attended. The trends in these variables would predict declining college graduation rates almost uniformly. 

In other words, given what we see, we should expect to see graduation rates decreasing. Instead, "changes in first-year GPA account for 95 percent of the change in graduation rates" and these changes in GPA can't be accounted for by student characteristics like better preparation or college test scores (i.e. measured learning isn't increasing with grades). In common parlance, grade inflation in the first year of college is making it easier to progress and graduate. Additionally, the largest effect is seen in improvements to the low end of the GPA range:

In each sample, and in each cohort, the change in the probability of graduation is largest for GPAs between 1.0 and 2.5. That is, improvements in GPAs in that range correlate with meaningful increases in graduation, whereas GPAs above or below that range do not change the probability of graduation as much.

An analysis of (quasi-) standardized tests shows that scores didn't increase along with GPA, which supports the grade inflation conclusion. 

The authors published an overview for a general audience in The Chronicle as "The Grade Inflation Conversation We're Not Having." They explain the hypothesized mechanism for the graduation rate increases:

It is easy to see how a focus on graduation could affect grade inflation. Imagine that a college wants to increase its graduation rate. An institution under such pressure has several options at hand. It could maintain grading standards and help students via tutoring or other student-success programs. It could change who is admitted (if it is a selective school). These are costly changes to make, and any particular program may not work.

Alternatively, the college could relax its grading standards and suggest or accept that what used to be D-level work is now worth a C. Relaxing grading standards has the advantage of providing no direct cost to the university, the professor, or the student.

This stark appraisal of the economics of higher ed says it all: there's a market for degrees, not so much for learning. Although there are important caveats, this idea echos Brian Caplan's tune.

Recent Graduation Rates

The period covered in the study ends over twenty years ago, but the NCES data shows that graduation rates continue to increase. To try to understand trend in the context of Denning, et al., I used IPEDS data on four-year undergraduate institutions, and compared a selectivity measure to four-year graduation rates.

In the figure, AppsPer is applicants per enrolled student, where higher levels mean more demand for four-year undergraduate programs at a university. Since applicants increasingly apply to more and more schools, I binned this value within each year to be consistent, using quintiles, so AppsPer 5 comprises the institutions with the largest number of applicants for each enrolled student for that year. Larger pools of applicants allow colleges to be more selective. 

As expected, more selective schools have higher graduation rates across the time period. While all selectivity ranges saw increased graduation rates, this is most dramatic in the least-selective institutions. The pattern is similar if we use standardized test scores instead of applicants per student. These patterns are consistent with the idea that the first year of college is easier to pass for the least-prepared students. That can't be the whole story, but it seems likely to be part of it. And it seems likely that the grade effects found in the paper have continued, and perhaps have accelerated.

Conclusions

I take from this study a couple of useful points. The first year of college is really important. We already know that first year retention is important, and at my university the most important predictor of retention is grades. That relationship is non-linear and has a discontinuity post-covid that penalizes low grades even more after the pandemic. We surmise that covid interfered with high school preparation of math and writing skills, which affect the least-prepared students the most.

This study shows how grading and learning can diverge, and there is a tension between two things we care about:
  1. Less grading rigor increases retention and graduation
  2. More grading rigor is associated with more learning
This tension puts access and accountability on a collision course. Before jumping to conclusions, however, I recommend reading at least the conclusion section of the Denning, et al paper. The authors are careful to outline various implications, for example without making an overall judgment that the GPA trends lead to bad outcomes for students. I'll discuss a few other current research articles to dig into these questions in the next few posts.

Friday, March 17, 2023

Do this, Don't do that

Sign, signEverywhere a signBlockin' out the sceneryBreakin' my mindDo this, don't do thatCan't you read the sign?
 
-"Signs" by Les Emmerson (Five Man Electrical Band)
 
 
A February 2023 Ezra Klein podcast pokes at his audience with the title "How Liberals — Yes, Liberals — Are Hobbling Government." He interviews Nick Bagley, a law professor at the University of Michigan, who has fascinating ideas about government regulation. Really. 

The whole piece is great, but I want to extract a simple idea, that regulations can get in the way of regulation. That is, we can have too many rules for them to be effective in performing their advertised purpose, like maintaining a high quality educational system, or ensuring clean air and water. Bagley notes that small-government-leaning lawmakers can restrict the power of government by creating so many regulations that the functions of an office grind to a halt. But even when lawmakers have good intentions, the effect can be the same. The obvious question is how much regulation is enough?

 

Rules, Thin and Thick

I heard the podcast after having read Lorraine Daston's Rules: A Short History of What We Live By. If you want to get up to speed, you can read reviews in The Wall Street Journal, Law & Liberty, and The New Yorker. A key idea is Daston's conception of a "thick" rule, which can be succinct because it relies on the good judgment in enforcement (e.g. the golden rule). Exceptions and rare conditions can be made according to the characteristics of a given case. A "thin" rule, by contrast is very specific and meant to be followed absolutely (e.g. always stop at a stop sign). She observes that coinciding with the scientific revolution, and the absolute "laws of nature," thin rules have increasingly become the norm.
 
Thin rules appeal to objectivity and fairness ("justice is blind"), whereas thick rules depend on the wisdom and benevolence of rulers. We can see immediately that a population suspicious of its government would tilt toward thin rules. However, governments cannot be run on thin rules exclusively, as I'll discuss below. 

For thin rules to work, we have to have a reliable means of knowing whether it is being followed or not. Games like chess have rules that pertain to a small number of game piece types (pawns, etc.) and a checkered board. The piece types are recognizable with respect to their roles, and the rules are unambiguously defined for every board position. This is quite different from a baseball umpire calling balls and strikes, where the rules are just as clear, but their adjudication is somewhat subjective. Chess rules are thinner than baseball's, not because of the way they are written but because of limitations in the reliability of identifying states (e.g. checkmate versus called strike). We may hear fans yelling "the ump is blind," but there's no equivalent for chess.

The first conclusion, then is that the effectiveness of regulation depends on the specificity of rules and how reliably we can determine the state of compliance. There are two possible ways we might proceed: (1) make rules thinner and thinner until we reach acceptable reliability, or (2) accept that some rules will be relatively thick. Unfortunately, the first option has severe problems.

 

Reliability 

Imagine that some case of potential rule violation is being adjudicated, and we give this task to several independent groups who all use the same evidence for their respective deliberations. If they reach similar conclusions over many such cases, we'd conclude that the process is reliable. This can be formalized in a number of ways, including intuitive measures of "rater agreement." This is a subject I've done work on: see here and here.  A more measurement-based approach is to compare the variation in the objects being observed to the variation in instrumentation. A speedometer that wiggles around while the car is moving at constant speed exhibits "noise" or "error," whence we get ideas like signal-to-noise ratios, systematic errors (when they are biased in one direction), and the educational measurement definition of reliability as the ratio of "good" variance to total variance. 

In adjudicating rules, we might roughly divide reliability issues into perception and decision-making. The first concerns the quality of the information available, and the second refers to the potential arbitrariness of decisions (inter-rater agreement again). 

Reliability may be associated with fairness, but too much reliability can be harmful, as with mandatory sentencing guidelines that leave no room for a judge's discretion. The illusion of reliability can also be harmful. For example, processes and paperwork can be verified as existing, but the reliable existence of reports, say, does not mean that those reports are meaningful signs of following rules. It is part of the human condition, I think to equate process reliability with process validity, but it's a bad assumption. This relates to the educational measurement idea of consequential validity, which we can simplify here to "does it really matter?"

 

Deterministic Rules

The thinnest possible rules are purely logical, like computer programs. That's fortunate, because a vast amount of research has been done on their limits. One of the foundational results, due to Alan Turing (1937), is called The Halting Problem, which can be extended to show that many of the questions we would like to ask of a computer program can't be answered, like will it crash?

Suppose you move to a new state, and the power company won't take your check without an in-state driver's license. But the DMV won't give you a driver's license without proving you live in state, proof being a power bill sent to your address. This is one way for a deterministic process to freeze up (see race conditions). There's no general way to analyze a system of rules to eliminate such problems. Software companies have systems for spotting and fixing problems as they arise (debugging), which means continual adjustment to the computer code. Libraries of known-good code are used as building blocks. But this is an uncertain human process, and not guaranteed to succeed.

In practice, software engineering is a lot planning followed by finding and fixing problems. There are different philosophies of design, as described in The Problem with Software: Why smart engineers write bad code. The author concludes that "[...] software developers seem to do everything in their power to make even the easy parts harder by wasting an inordinate amount of time on reinvention and inefficient approaches. A lot of mistakes are in fundamental areas that should be understood by now [...]" (see ACM review).

All of this to say that the problems of regulation are likely to multiply the thinner we make them. The more they resemble computer programming, the more the resulting system becomes bewilderingly complex and impossible to maintain. When my university's finance and human resources processes moved to an intricate online system, some issues that could have been resolved informally before suddenly had to find a solution in a system strictly defined by computer programs, so that a simple phone call to clear up confusion instead becomes a software "escalation event" to find a solution at a high enough level of abstraction to solve the problem. Thin rules multiply because the world is complex, constantly demanding new exceptions.

 

Theories of Regulation

The properties of regulation have not gone unnoticed by the academy. There are organizations devoted to the study of regulations, for example, this one, supported by the University of Florida. Here's a snip from that source:
Normative theories of regulation generally conclude that regulators should encourage competition where feasible, minimize the costs of information asymmetries by obtaining information and providing operators with incentives to improve their performance, provide for price structures that improve economic efficiency, and establish regulatory processes that provide for regulation under the law and independence, transparency, predictability, legitimacy, and credibility for the regulatory system.
The role of competition is to take burdens off of the regulator by delegating adjudication to the "real world," which allows a regulator to focus on creating a level playing field, so that "natural laws" of competition apply equally to all, like an umpire in a baseball game.

The same source describes the role of the human regulator, concluding with the advice that
A regulator should carefully map crucial relationships, know their natures, and build a strong regulatory agency.  The regulator should also stir and steer, but always with humility, knowing that by stirring the pot the regulator is surfacing problems that others might think the regulator should leave alone, and that by steering the regulator is providing direction that policymakers and lawmakers properly see as theirs to provide, but which they cannot provide because of their limited information and knowledge.
This does not sound like computer programming at all, and it's clear that this conception of leadership leans on professional judgment and the desire to do good. 
 
The thinnest rules I saw mentioned at this source concerned price controls, where rules and metrics mix. And there's an indirect reference to reliability in "Regulation is predictable if regulatory decisions are consistent over time so that stakeholders are able to anticipate how the regulator will resolve issues." Such consistency depends on (1) common understanding of rules, (2) reliable state identification, and (3) reliable adjudication. The middle part of that sandwich is likely to be a problem, because words seem concrete when we write them, but whatever bits of reality correspond to those words may be ill-defined.

 

Rules for Humans

Rules constraining human behavior are not like computer programs, because logic gates don't have internal motivations. As soon as motivation enters the picture, all this changes, including with computers. One of the first mechanical computers famously attracted a moth to the warm innards, causing errors in the output. To  "debug" a computer was already a term of art, but this made it a literal process. I assume that the glowing vacuum tubes of the subsequent generation of computing machines attracted even more of the furry beasts, but Google has let me down there; it wants to sell me moth balls and vacuum cleaners.

One way that thin rules can be subverted is through motivated nominal adherence, where the regulation is followed to the letter, but not in spirit.  Ergo the case of VW's car emissions. As described in Car & Driver

Volkswagen installed emissions software on more than a half-million diesel cars in the U.S.—and roughly 10.5 million more worldwide—that allows them to sense the unique parameters of an emissions drive cycle set by the Environmental Protection Agency. [...]
 
In the test mode, the cars are fully compliant with all federal emissions levels. But when driving normally, the computer switches to a separate mode—significantly changing the fuel pressure, injection timing, exhaust-gas recirculation [permitting nitrogen-oxide emissions] up to 40 times higher than the federal limit.

The pattern of nominal compliance with rules, combined with self-serving side effects is a very old one, and the founding principle of classical Cynicism, where Diogenes was instructed by the oracle to "debase the coin of the realm." See my extended thoughts on the role of Cynicism in higher education here, and see Louisa Shea's excellent book The Cynical Enlightenment.

Nominal compliance is related to the idea of cheating, where the rules appear to be followed, but are either being secretly broken or are being followed in a way that subverts their larger purpose. There's an online forum devoted to stories of the latter case (https://www.reddit.com/r/MaliciousCompliance/). One of my favorite stories, however, comes from Richard Goodman's memoirs Remembering America. While serving in the army, he helped prepare for an inspection of the post and discovered that there was one more two-and-a-half ton truck in the motor pool than was listed on the manifest. Having this extra truck created quite a panic until Goodman came up with the solution. At night, a trusted crew dug a large pit and buried a truck, so it would never be seen by the inspectors and the count would be correct.
 
The point here is two-fold. First, no thin regulation will capture the overall intent of the rule: there's a gap between the "letter of the law" and the "spirit of the law," as we say. This gap can only be perceived and addressed by a thicker adjudication than the formal (thin) one. If this thick perception did not exist, we would never detect cheating or subversion. 
 
A nice example of these layers of perception comes from Arlie Russell Hochschild's book Strangers In Their Own Land. Louisiana (finally) outlawed driving with unsealed containers of alcohol in 2004, but compliance can be nominal:
At a Caribbean Hut in Lake Charles, a satisfied customer reported ordering a 32-ounce Long Island Iced Tea with a few extra shots, a piece of Scotch tape placed over the straw hole--so it was "sealed"--and drove on (p. 67).
Second, rule enforcement in the face of adverse motivation is an evolving challenge. Arnold Kling (substack) wrote about this in a blog post "The Chess Game of Financial Regulation"
It turns out that financial regulation is not like a math problem, which can be solved once and stays solved. Instead, financial regulation is like a chess game, in which moves and counter-moves proceed continually, eventually changing the board in ways that players have not anticipated.
Effective regulation, then, is complex and entails second- and third-order thinking about the implications of the rules' likely changes to behavior that will subvert it. The Prohibition in the United States is an example of a massive regulatory failure because of such side effects.

 

Thick Rules

Not all rules can be thin, because not all rules can be reliably adjudicated. Facebook's moderation efforts, as documented in Steven Levy's very readable Facebook: The Inside Story provide several examples. Here's one: "lactivists" demonstrated at the company's headquarters to push for photos of breastfeeding moms to be exempt from the no-nudity moderation rules. Daston's book has sometimes-hilarious examples of attempts to use thin rules for thick purposes concerning sumptuary laws in France (e.g. "beaked shoes cannot have points longer than a finger's width"). Another of Arnold Kling's posts relates thin/thick rules (he doesn't use those terms, he talks about formal versus informal rules) to the sociology of small and large groups.  

Daston's book describes the adjudication of thick rules as "casuistry," a process of looking at a complex case from all sides, considering rules in the context of exceptions to the rules. Like "Cynicism," the word casuistry has lost its potency; both are victims of modernity's regression to the mean (in German there are two words for the former to mark the change: Kynismus and Zynismus, see here). It's more useful to use them in a more original sense, where they are complementary in my reading:
  • Casuistry: the art of making judgments for thick rules by considering all angles
  • Cynicism: the art of revealing a deeper reality by subverting conventions.
Socrates defined humans as "featherless bipeds," whereupon Diogenes tossed a plucked chicken at his feet. Regulators assure us that banks are sound; depositors withdraw their money and demonstrate that "soundness" is just an appeal to have faith. The Ivy League uses SAT scores as proof of ability; rich parents pay smart kids to take the test for their kids.  

Casuistry easily handles Cynical acts in adjudicating thick rules. We can see through the Varsity Blues deceptions and know that they are immoral. Thin rules are where Cynical attacks present the most challenges. 
 
Our university accreditor requires that we list all teaching faculty along with the courses they teach and their qualifications for doing so. There are thin rules to handle most of the cases: having a terminal degree in biology means you can teach any undergraduate biology class. For exceptions, the review team must use casuistry (no one calls it that). A yoga teacher for a physical education class probably doesn't have an advanced degree in yoga, for example. I think this combination of thick and thin rules works well in this case. The main vulnerability is to a Cynical attack that omits or misrepresents the evidence. If institutions regularly engaged in such deception, the accreditation process would devolve to paperwork reliability without validity. 

 

Conclusions

My first conclusion is that rules should only be as thin as they need to be, to avoid the measurement costs associated with the required reliability (thin, low-reliability rules are worse than useless). One way to audit an existing set of rules, then, is to examine the reliability of the "how do we know?" challenge. In the case of the emissions requirement, I assume the actual measurements were highly reliable, but they failed to generalize to the performance of cars on the road, creating a gap between perceived reliability (in the lab) and actual meaningfulness.

The reliability calculation is confounded by motivation, which leads me to the second conclusion: regulation will be most effective when the motivation to subvert it is minimized. For example, a law that taxes should be included in all retail prices would make everyone's life easier, and as long as everyone has to do this retailers don't have a motivation to subvert the rule. Similarly with the German rule (I'm told) to rotate which gas stations are allowed to be open on Sunday, so the staff get a break on the weekend, and on average no gas station loses revenue.

My intent in working through these ideas is to apply them to higher education policy at the levels I engage with, including accreditation. I think there are shared goals in the regulatory triad (states, accreditors, federal government), which suggests that a joint discovery of the appropriate thickness of rules will be productive. Shared goals and collaborative rule-making can ideally result in the amount of trust necessary to permit casuistry for thick rules when they are called for. That is, everyone must agree that in some cases it is acceptable to give up some reliability in order to achieve depth and reasonableness. This may or may not be achievable, but at least these ideas provide a vocabulary to describe what actually does happen. 

This article was lightly edited on 5/21/2023, and I added the sealed container example.