Saturday, July 29, 2023

Average Four-Year Degree Program Size

"How much data do you have?" is an inevitable question for program-level data analysis. For example, assessment reports that attempt understand student learning within an academic major program typically depend on final exams, papers, performance adjudication, or other information drawn from the seniors before they graduate: a reasonable point in time to assess the qualities of the students before they depart. Most accreditors require this kind of activity with a standard addressing the improvement of student learning, for example SACSCOC's 8.2a or HLC's 4b. 

The amount of data available for such projects depends on the number of graduating seniors. As an overall assessment of these amounts I pulled the counts reported to IPEDS for 2017 through 2019 (pre-pandemic) for all four-year (bachelor's degrees) programs. These rows of data each come with a disciplinary CIP code, which is a decimal-system index that describes a hierarchy of subject areas. For example 27.01 is Mathematics, and 27.05 is Statistics. Psychology majors start with 42. 

We have to decide what level of CIP code to count as a "program." The density plot in Figure 1 illustrates all three levels: CIP-2 is the most general, e.g. code 27 includes all of math and statistics and their specializations. 

There are a lot of zeros in the IPEDS data, implying that institutions are reporting that they have a program, but it has no graduates for that year.  In my experience, peer reviewers are reasonable about that, and will relax the expectation that all programs produce data-driven reports, but your results may vary. For purposes here, I'll assume the Reasonable Reviewer Hypothesis, and omit the zeros when calculating statistics like the medians in Figure 1.


 
Figure 1. IPEDS average number of graduates for four-year programs, 2017-19, counting first and second majors, grouped by CIP code resolution, with medians marked (ignoring size zero programs).
 
CIP-6 is the most specific code, and is the level usually associated with a major. The Department of Homeland Security has a list of CIP-6 codes that are considered STEM majors. For example, 42.0101 (General Psychology) is not STEM, but 42.2701 (Cognitive Psychology and Psycholinguistics) is STEM. The CIP-6 median size is nine graduates, and it's reasonable to expect that institutions identify major programs at this level. But to be conservative, we might imagine that some institutions can get away with assessment reports for logical groups of programs instead of each one individually. Taking that approach, and combining all three CIP levels effectively assumes that there's a range of institutional practices, and enlarges the sample sizes for assessment reports. Table 1 was calculated under that assumption.
 
Table 1. Distribution of average program sizes with selected minimums. 

Size Percent
less than 5 30%
less than 10 46%
less than 20 63%
less than 30 72%
less than 50 82%
less than 400 99%  

Half of programs (under the enlarged definition) have fewer than 12 graduates a year. Because learning assessment data is typically prone to error, a practical rule of thumb for a minimum sample size is N = 400, which begins to permit reliability and validity analysis. Only 1% of programs have enough graduates a year for that. 

A typical hedge against small sample sizes is to only look at the data every three years or so, in which case around half the programs would have at least 30 in their sample, but only if they got data from every graduate, which often isn't the case. Any change coming from the analysis has a built-in lag of at least four years from the time the first of those students graduated. That's not very responsive, and would only be worth the trouble if the change has a solid evidentiary basis, and is significant enough to have a lasting and meaningful impact on teaching and learning. But 30 samples isn't going to be enough for a significant project either.

One solution for the assessment report data problem is to encourage institutions to research student learning more broadly--starting with all undergraduates, say--so that there's a useful amount of data available. The present situation faced by many institutions--reporting by academic program--guarantees that there won't be enough data available to do a serious analysis, even when there's time and expertise available to do so. 

The small sample sizes lead to imaginative reports. Here's a sketch of an assessment report I read some years ago. I've made minor modifications to hide the identity.

A four-year history program had graduated five students in the reporting period, and the two faculty members had designed a multiple-choice test as an assessment of the seniors' knowledge of the subject. Only three of the students took the exam. The exam scores indicated that there was a weakness in the history of Eastern civilizations, and the proposed remedy was to hire a third faculty member with a specialty in that area. 

This is the kind of thing that gets mass-produced, increasingly assisted by machines, in the name of assuring and improving the quality of higher education. It's not credible, and a big part of the problem is the expected scope of research, as the numbers above demonstrate. 

Why Size Matters

The amount of data we need for analysis depends on a number of factors. If we are to take the analytical aspirations of assessment standards seriously, we need to be able to detect significant changes between group averages. This might be two groups at different times, two sections of the same course, or the difference between actual scores and some aspirational benchmark. If we can't reduce the error of estimation to a reasonable amount, such discrimination is out of reach, and we may make decisions based on noise (randomness). Bacon & Stewart (2017) analyzed this situation in the context of business programs. The figure below is taken from their article. I recommend reading the whole piece.

Figure 2. Taken from Bacon & Stewart, showing minimum sample sizes needed to detect a change for various situations (alpha = .10). 

Although the authors are talking about business programs, the main idea--called a power analysis--is applicable to  assessment reporting generally. The factors included in the plot are the effect size of some change, the data quality (measured by reliability), the number of students assessed each year, and the number of years we wait to accumulate data before analyzing it. 

Suppose we've changed the curriculum and want the assessment data to tell us if it made a difference. If the data quality isn't up to that task, the data quality also isn't good enough to tell us that there's a problem that needs to be fixed to begin with--it's the same method. Most effect sizes from program changes are small. The National Center for Education Evaluation has a guide for this. In their database of interventions, the average effect size is .16 (Cohen's D, or the number of standard deviations the average measure changes), which is "small" in the chart. 

The reliability of some assessment data is high, like a good rubric with trained raters, but it's expensive to make, so it's a trade-off with sample size. Most assessment data will have a reliability of .5 or less, so the most common scenario is the top line on the graph. In that case, if we graduate 200 students per year, and all of them are assessed, then it's estimated to take four years to accumulate enough data to accurately detect a typical effect size (since alpha = .1, there's still a 10% chance we think there's a difference when there isn't). 

With a median program size of 12, you can see that this project is hopeless: there's no way to gather enough data under typical conditions. Because accreditation requirements force the work to proceed, programs have to make decisions based on randomness, or at least pretend to. Or risk a demerit from the peer reviewer for a lack of "continuous improvement."  

Consequences 

The pretense of measuring learning in statistically impossible cases is a malady that afflicts most academic programs in the US because of the way accreditation standards are interpreted by peer reviewers. This varies, of course, and you may be lucky enough that this doesn't apply. But for most programs, the options are few. One is to cynically play along, gather some "data" and "find a problem" and "solve the problem." Since peer reviewers don't care about data quantity or quality (else the whole thing falls apart), it's just a matter of writing stuff down. Nowadays, ChatGPT can help with that. 

Another approach is to take the work seriously and just work around the bad data by relying on subjective judgment instead. After all, the accumulated knowledge of the teaching faculty is way more actionable than the official "measures" that ostensibly must be used. The fact that it's really a consensus-based approach instead of a science project must be concealed in the report, because the standards are adjudicated on the qualities of the system, not the results. And the main requirement of this "culture of assessment" is that it relies on data, no matter how useless it is. In that sense, it's faith-based.

You may occasionally be in the position of having enough data and enough time and expertise to do research on it. Unfortunately, there's no guarantee that this will lead to improvements (a significant fraction of the NCEE samples have a negative effect), but you may eventually develop a general model of learning that can help students in all programs. Note that good research can reduce the number of samples needed by attributing score variance to factors other than an intervention, e.g. student GPA prior to the intervention. This requires a regression modeling approach that I rarely see in assessment reports, which is a lost opportunity.
 

References

Bacon, D. R., & Stewart, K. A. (2017). Why assessment will never work at many business schools: A call for better utilization of pedagogical research. Journal of Management Education, 41(2), 181-200.

Tuesday, July 11, 2023

Driveway Math

In 2015 I bought a house on the slope of Paris Mountain in Greenville, South Carolina. A flat spot for the house was bulldozed, creating a steep hill behind the house, leaving a slope from the floor of the garage to the street, so the driveway in between is slanted at about 15 degrees. That turns out to be a lot of degrees. I first noticed that this could be a problem when I drove up from Florida in my 2005 Camry, and after an exhausting day on I-95, finally pulled into the garage. CRUNCH went the bottom of the car, as it ground over the peak caused by the driveway slant.

Figure 1. Cross section of the driveway and garage floor.

I learned my lesson and bought an SUV, but being limited in what car one can own by some bothersome geometry is annoying. So I began investigating what modification to the driveway would allow more options. There's currently a small bevel at the join between the slanted and flat concrete, and the idea is to expand that in both directions. How wide would the bevel need to be to allow a Prius to get up the driveway without bottoming out? Or a Chevy Bolt?

The important car dimensions here are the wheelbase (distance between axles) and ground clearance. These are relatively easy to find on the internet, and bulk downloads exist (for a price). One useful data point was that my wife's 2010 Honda Civic just barely touched the concrete, giving me a minimum dimensional set to work from. Initially, I just computed the ratio of wheelbase to clearance and used that to estimate which cars would make it into the garage without the nasty crunch. Anything less than 15.7 was going to be a problem. However, this greatly limits the cars that will work, and that's become an issue lately. We'd like to have the option of getting a small used EV to just drive in town, to complement the Rav4 we share now. 

There are two kinds of solutions I've considered. One is remaking the driveway to increase the bevel, and the other is to install a rubber bump in the garage to lift the front of the car as the lowest point passes over the peak. 

Given the physical dimensions of the car, the driveway, and the proposed modifications, it's straightforward, but fussy, to do the geometry to calculate the minimum clearance at each point in the driveway as the car comes into the garage. I built a Shiny app to do that, which you can download on Github.  (Use at your own risk: no warranties are implied.)

 
One thing I learned from tweaking the knobs is that a step-up in the garage needs to be wide to be effective. In the bottom graph above you can see the discontinuity where the front wheel rolls over the bump pictured at the top. In reality this would be sloped, but I left it as a discrete jump so it's more visible in the graph. 
 
I used the Rav4 to test the model against reality with carefully placed markers.

 The photo shows a 2" marker placed about 14" behind the edge of the garage slab. By moving the marker until it just brushes the bottom of the car I can get a sense of how well the model conforms to reality. One thing I learned is that the sloped driveway is curved a bit to make a slight hump--it's not perfectly flat, so the results deviate a bit from the model in favor of slightly more clearance. 

Another calculation was to find where the plane of the flat garage slab and the plan of the included driveway meet.

Although the driveway isn't perfectly flat, the inclined plane intersects with the horizontal about two inches from the existing edge. This is due to an existing bevel where the two meet. That meeting point is the center of the bevel in the simulation. So it's currently about 2", but because of the slight hump in the driveway it functions more like 3". I didn't try to model the hump explicitly, but that would be the next step in making the app more accurate.

All of this is somewhat approximate, and there's no certainty without actually driving a car into the garage, but this analysis has given me a good idea of how much wider our selection of cars can be if we increase the bevel to about 12" and possibly add a 1" ledge (a secured rubber mat, probably) inside.