Friday, August 16, 2024

A Canticle for Bloom*

Introduction

Stephen Jay Gould promoted the idea of non-overlaping magisteria, or ways of knowing the world that can be separated into mutually exclusive domains, where each "holds the appropriate tools for meaningful discourse and resolution." The tension Gould was trying to resolve was between religion and science: 

Science tries to document the factual character of the natural world, and to develop theories that coordinate and explain these facts. Religion, on the other hand, operates in the equally important, but utterly different, realm of human purposes, meanings, and values—subjects that the factual domain of science might illuminate, but can never resolve. -- Stephen Jay Gould from Rock of Ages

I'll call these "ways of knowing" or WOKs, which seems more down to earth than "magisteria." Each WOK contains cross-checks on knowledge that are particular to the domain.  Scientific questions are judged by scientific standards. Personal choices are based on experience and usually don't have "correct" answers but degrees of validation.

Should you give a friend a loan? That's a lived experience question. Are your car's spark plugs failing to ignite? That's a science question. Should you take a recommended drug despite severe side effects? That's somewhere in between. 

There's a third WOK we need to talk about: ideology. Steven Mintz provided a crisp definition recently in his blog on InsideHigherEd

Ideologies simplify, clean up and package reality into something easily consumable, palatable and appealing to a mass audience. In doing so, ideologues discard the messy, complex and often unpleasant aspects of reality, presenting only what fits neatly within their framework. Ideology, thus, distorts reality by filtering out anything that does not conform to its narrative.

Ideologies are particularly powerful when associated with a utopia. For example, the way that Marxism evolved into an intellectual justification for Stalin's USSR. Lysenkoism enforced ideology on biological research.  

Here's a diagram of our three WOKs with what we might find in the overlaps.

Galileo was interested in the physical reality of the cosmos (among other things), at the center of the diagram, creating new methods for WOK 2. But the correct way of speaking about the universe needed to adhere to church doctrine (ideology, WOK 3). This is the "correctness" overlap, only some of which corresponded to reality (geocentricism did not). See Steven Shapin's The Scientific Revolution for a nuanced narrative; it's not as simple as the usual telling of Galileo vs the Church. Jennifer Michael Hecht would rightfully insist on putting "ritual" in the overlap between ideology and lived experience, and argue that the ritual can fulfill a social and personal need. Ideology isn't bad; we need it. It can just overlap in odd ways with other WOCs.

SLO Assessment 

Yesterday I participated in a conversation with a small group of experienced assessment directors as part of an ongoing project to fix accreditation standards. The discussion echoed a theme I heard last year when I interviewed a dozen peer reviewers from various accreditors, that there's value in the formal kind of assessment that gathers data and does statistics on it, but it's more common to see success by just getting faculty together to talk about learning goals. We also talked about accreditation standards. I suggest that there are three important WOKs in assessment:
  1. Professional judgment and collaboration
  2. Educational measurement and inferential statistics
  3. Adjudication of accreditation policy
If we find that first-generation students have abysmal pass rates in math (WOK 2), this will affect conversations about pedagogy, support services, course prerequisites, and so forth that would happen in a department meeting (WOK 1). Conversely, if the math faculty all agree that Calculus 1 isn't adequately preparing students for Calculus 2 based on classroom experiences (WOK 1), it might prompt a more formal analysis of grades and test scores (WOK 2).

Perhaps in a perfect world, assessment offices would operate within these two WOKs, with a large faculty support role to facilitate conversations and share knowledge (WOK 1), with a separate function to gather and warehouse data, do research, and connect with the wider research community to bring ideas back (WOK 2). It's not clear that universities would fund such an outfit, however. Assessment offices are expensive and only exist because of accreditation requirements, to get the reports done (WOK 3).

 

The Third Circle

Any bureaucracy draws from a kind of ideology, at least implicitly. Paperwork and procedure serve to "simplify, clean up and package reality into something easily consumable," in Mintz's formulation. Behind the paperwork is a purpose: the DMV's goal is safer roads. The EPA's is  a clean environment. The validation of WOK 3 uses a formalized classification of the world (e.g. driver's test) to assign cases to policy distinctions (driver's license granted or denied).
 
Accreditation requirements provide the motivation to run assessment operations, but they also impose a particular ideology. I described its origins and effects in "Assessment standards are broken." In short, the ideology can be abbreviated as "define-measure-improve" and is a version of "scientific management," descendants of Taylor and Drucker and others, of which Six Sigma is a variation. There is an intended overlap with WOK 2, almost subsuming the scientific WOK within the ideology, with the goal "we're going to require you to use science to improve the state of education."
 
Robert Birnbaum catalogs variations of this idea, calling the phenomenon 
 
a paradox of complexity and simplicity. Its central ideas may appear brilliantly original. Yet at the same time they are so commonsensical as to make us wonder why we had not thought of them ourselves, and so obviously reasonable as to defy disagreement. 
-- Management Fads in Higher Education, pg 5.
 
The result is that the internal validation of knowledge in WOK 3, which is done by trained peer review teams, assumes the preeminence of WOK 2 (scientific knowledge). 
 
Any new idea has to compete with existing ones, and faculty tradition was the natural enemy of the define-measure-improve protocol, despite the obvious overlap with what teachers do on the job, and what they want to accomplish. In the Sturm und Drang of 1983's A Nation at Risk, educators took a lot of heat (a theme in US politics). In higher education, teaching work had to be mapped from existing practices, seen as inefficient, to ones that aligned with the scientific management principles.
  • Defining learning

    • Old: choose textbooks, write syllabus, approve curriculum, create tests or other assessments

    • New: write statements of student learning objectives, often in a hierarchy of course, program, institution

  • Measurement

    • Old: grade tests, writing samples, performances, etc. Build a shared sense of acceptability via faculty consensus and constant exposure to students, assign summative course grades

    • New: use only a few approved methods, including specific assignments, papers associated with rubrics (no grades!). Learning is seen as distinct from more general student success. Emphasis on outputs instead of inputs.

  • Improvement

    • Old: Professional growth in teaching practice, department or institution level consensus on change (WOK 1), using data summaries like grade or test averages or pass rates to identify needs for improvement (WOK 2).

    • New: Averages or frequencies of approved data sources to find deficiencies, then imagine a way to remedy them
The requirements to write reports using the new methods lobotomized WOK 1 for faculty. To comply with the reporting requirements they had to start over using approved replacement methods in order to be allowed to "know" how their students were doing. Naturally they resented this. Hated it, even, and have produced a genre of articles complaining about assessment. We still deal with the effects.
 
It's worth repeating that what assessment directors say works best is WOK 1--what the scientific approach was intended to replace--and because that's what actually works, the accreditation reviews have relaxed over time to allow more room for WOK 1. Your situation depends on your accreditor, but it's still an awkward fit because of the need to perform the other rituals (defining and gathering data in the approved way) in order to validate WOK 1, which doesn't really need that extra work to function.
 
As I described in "Assessing for Student Success" the science project falls apart immediately, because what students learn in a college curriculum is a lot of detailed topics with interconnections. I estimated several hundred topics (SLOs if you like) in a math curriculum. These can't be described, let alone measured, within the parallel framework the accreditors created. 
 
It's worth noting that in the industrial setting, where these ideas were formed, it is possible to define and measure everything important, and have real-time data from instruments on an assembly line. That doesn't translate well to education, where there's little standardization.

The accreditation requirements strengthen the main thing that works (WOK 1) by creating more opportunities for faculty to talk about student learning, especially when facilitated by a good assessment director. But the requirements also diminish the effectiveness of faculty work by heaping on artificial requirements in the name of science. This poor attempt to mandate WOK 2 fails the validity checks within WOK 2: sample sizes are too small, too noisy, and the causal models used are too simple. 

This collision between science and policy gets resolved ad baculum: you will be beaten with the stick of non-compliance until you at least pretend to believe that rubrics are always valid measures. In short, the SLO accreditation requirements co-opt the authority of science, but replace scientific standards with ideologically-correct ones. This has created a self-sustaining culture of compliance, abetted by consultants, vendors, and peer review training that maintains this closed garden of bureaucracy as science.

 

A Litmus Test

A few years ago, I concluded that the best way to illustrate how accreditation standards drove us into an epistemological ditch was to spotlight their allergy to course grades. In the age of big data, the idea that we'd arbitrarily throw out millions of data points covering the whole history of students at our institution is absurd. It only makes sense within the accreditation bubble, and so it puts a spotlight on the difference between the scientific claims of accreditors versus the reality. To heighten the contradictions, so to speak. You can find my summary of research on grades and learning here. Starting on page 23 you'll find a list of standard objections to using course grades as data about student learning. 
 
I won't rehash that material here. It suffices to quote an article that cites Bloom (the taxonomy guy) from 1976, eight years before define-measure-improve kicked off:
 
Perhaps the most productive use of GPA is as a covariate. GPA has the potential to explain nearly half the variance in education research models (Bloom, 1976), thus shedding light on the variance explained by other variables of interest, such as changes in course or curriculum design.  
 
-- Bacon, D. R., & Bean, B. (2006). GPA in Research Studies: An Invaluable but Neglected Opportunity. Journal of Marketing Education, 28(1), 35–42.
 
 
When I have the opportunity, I ask accreditors what they think of course grades as data. This is a litmus test for self-reflection. So far they've all failed it. The wise ones don't want to come out too strongly against grades, because it implies they don't think transcripts are meaningful. They generally hedge, calling grades "indirect measures." But there's no test for directness in the vocabulary of WOK 2. You won't find a way to create p-values on directness in the manual on educational measurement, because the idea isn't statistical. We already have a robust vocabulary on reliability and validity, and there's no need to confuse the issue with "directness." 

When pushed on that point, one accreditor representative cited an anecdote about how a student didn't feel like the grade reflected learning. This would be ironic if the scientific claims of the define-measure-improve protocol were serious about the science. After dismissing faculty consensus as opinion, reciting "the plural of anecdote isn't data," and distributing buttons at conferences that read "show me the data," a high priest of the order can simply use an anecdote to dismiss the whole research record on course grades. The remarkable thing is that this doesn't seem to cause any cognitive dissonance. 

The point of this illustration is that the overlapping circles in the three WOKs don't presently stand the light of public exposure. The accreditors will look ridiculous. We need to fix it before that happens, to create a more sensible overlap of the WOKs.

There's More

This article is long enough, and I'm going to stop here. I have not discussed the effective use of WOK 2 (educational measurement and inferential statistics) as a successful assessment tool. That may be addressed in a future post. Suffice to say that that requirements of a good research project (e.g. large data sample, tests of reliability) aren't feasible in 99% of department-level accreditation reports. From a practical point of view, it's a lot of extra work to do a real research project to get a tiny amount of credit, if any at all (none at all if you research retention instead of learning). The irony is that WOK 3 assumes that it contains WOK 2, when in fact the intersection is nearly empty.
 

Tuesday, June 11, 2024

Why the Student/Faculty Ratio is a Bad Metric

The student/faculty ratio, which represents on average how many students there are for each faculty member, is a common metric of educational quality. The ratio shows up in the Common Data Set (CDS) and college guides, presumably so prospective students can compare colleges. 

The standard way to count students and faculty for the CDS calculation is to equate the total number of students into a smaller number of artificial standardized units. That's because some students may take a single class, while others take ten or more an academic year, and faculty teaching loads similarly vary. This is usually done by converting each population into Full Time Equivalent (FTE) units. The idea is that a part time student isn't the same as a full time student, but we might count, say, three part time students as equal to one full time student. This is the CDS approach, which we can write as FTE = FT + PT/3. 

The part time conversion formula is ad hoc, chosen for convenience instead of meaningfulness. It considers all part timers (students or faculty) as averaging a third of a full load, when that will vary across institutions and across time. Additionally, "full time" is defined by policy, and this also varies by institution. A full time faculty load at a research institution is probably less than the load at a teaching college. The full-time definition for students is usually a range of numerical values, e.g. a full load is somewhere between 12 and 20 credits. There's a big difference between 12 and 14 or 16 or 18 credits, when averaged over the whole student population, because it directly impacts how many course sections need to be taught. So the student FTE works okay as a measure of revenue (paying tuition for full load), but not as a demand measure (how many classes do we need to teach). 

Similarly, faculty members who get release time from teaching or conversely teach overloads may be "full time" for contractual purposes, but not reflect their classroom presence. There are complicated adjustments in the CDS definition, which refers to the AAUP definition of full time faculty. For example, a faculty member on leave for research (e.g. sabbatical) still counts even though they are not in the classroom, but if they have a replacement hired, that replacement should not be counted. A certain amount of judgment is required to decide if a hire is a replacement or not, resulting in "house rules" for counting faculty. 

These effects combine to erode the meaningfulness of an FTE-based student/faculty ratio. But we might take a step back and ask what the ratio is intending to do anyway.

Deriving the Raw Ratio

If we focus on the student classroom experience, we might think of a student in a class as the basic unit for counting. How many of these are there? If there are \(N_s\) students, and on average each student has a class load of \(L_s\) each academic year, then there are a total of \(N_s L_s\) units of "student-classes." That's how many of these experiential units were consumed.
 
From the faculty perspective--the production side--if there are \(N_f\) faculty members, with an average teaching load of \(L_f\) classes, and average class size of \(A\), then the total student-classes is those three numbers multiplied together. Since student-classes taught (produced) must equal student-classes taken (consumed), these can be set against each other as
 
$$ N_s L_s = N_f L_f A $$ 

Using these raw counts (not FTEs) of students and faculty, the ratio is then 

 $$ \frac{N_s}{ N_f} = A \frac{L_f }{ L_s} $$ 

The average loads for faculty and students are largely determined by policy. For example, if six classes per year is the contractual load for a full-time faculty member, and ten courses per year is necessary to complete a bachelor's degree in four years, then the load fraction is 6/10. Because of part-timers and overloads, the measured load averages won't be exact, but they should be close to that and relatively stable over time for an institution--as long as policies don't change. 
 
The raw student-faculty ratio is then measuring average class size times an index of institutional policy, the load ratio \(L_f / L_s\), including the  prevalence of part-timers and overloads. This load ratio won't be the same between institutions, so including it as a factor is not appropriate if we want a comparable index. If we drop the load ratio from the right side, we have average class size, which is a comparable index of student experience. It's crude--a distribution would be better--but as a single metric it's not terrible. 
 

The FTE Ratio

 
With that understanding, we can now see what the FTE ratio is all about. Suppose we create a "true" FTE calculation for faculty by dividing the total number of classes taught by the policy's specification for a full load. So if 3000 courses are taught in an academic year, and the faculty handbook says the load for a full-time faculty member is six courses, then the FTE faculty is 3000/6 =  500. We can do a similar calculation with students, e.g. using twelve credits as a full time student load.

In the terms defined earlier, the number of classes taught is \(N_f L_f\), which we need to divide by the faculty policy-defined load \(P_f\) to get \( \text{FTE}_f = N_f L_f / P_f \). Similarly for students, \( \text{FTE}_s= N_s L_s/ P_s \). In each case, the \( L/P\) fraction is expressing the measured average load as compared to the policy load. If there are a lot of adjuncts teaching, the average load per faculty member might be 5.1, whereas the full-time load is 6. Putting this together we have
 
$$ \frac{\text{FTE}_s}{\text{FTE}_f} = \frac{N_s L_s}{N_f L_f} \cdot \frac{P_f}{P_s} $$ 

The derivation in the previous section shows that 
 
 $$ A = \frac{N_s L_s}{N_f L_f} $$ 
 
so the FTE ratio is: 

$$ \frac{\text{FTE}_s}{\text{FTE}_f} = A \frac{P_f}{P_s} $$ 

The difference between the raw ratio and the FTE ratio is that the former uses empirical average loads for students and faculty, whereas the FTE version uses policy-defined loads. In both cases, these vary by institution.

In the actual CDS calculations, the house rules and approximations, like FT = PT/3, will add error, so you probably won't get exactly the formula above.

Discussion

The student/faculty ratio conflates two types of educational quality. Average class size crudely evaluates the quality of in-class instruction, as a measure of accessibility to faculty while teaching. The load ratio (empirical or policy-based) assesses how much time faculty have to devote to each class, as well as how much students must spread themselves around to cover the required load. These are related to classroom experience, but are different dimensions. I suggest we adopt a rule for measures of "one dimension per dimension," in which case we could describe classroom quality as (1) average class size, (2) average teaching load, and (3) average student load. Attempting to combine all that information into one metric just creates a mess. 

I've never seen the above derivation before, and until I did it I didn't know what was going into the ratio calculation. I suspect that most producers and consumers of the metric don't really understand it, and probably misuse it. For many purpose, the average class size is a convenient summary measure of student experience that also represents operational efficiency: financially it represents both revenue and cost in the same scale. It can be aggregated or disaggregated to whatever level of analysis you care do to do. And it's easily understood and communicated.

Bottom line: use average class size instead of student/faculty ratio.