"How much data do you have?" is an inevitable question for program-level data analysis. For example, assessment reports that attempt understand student learning within an academic major program typically depend on final exams, papers, performance adjudication, or other information drawn from the seniors before they graduate: a reasonable point in time to assess the qualities of the students before they depart. Most accreditors require this kind of activity with a standard addressing the improvement of student learning, for example SACSCOC's 8.2a or HLC's 4b.
The amount of data available for such projects depends on the number of graduating seniors. As an overall assessment of these amounts I pulled the counts reported to IPEDS for 2017 through 2019 (pre-pandemic) for all four-year (bachelor's degrees) programs. These rows of data each come with a disciplinary CIP code, which is a decimal-system index that describes a hierarchy of subject areas. For example 27.01 is Mathematics, and 27.05 is Statistics. Psychology majors start with 42.
We have to decide what level of CIP code to count as a "program." The density plot in Figure 1 illustrates all three levels: CIP-2 is the most general, e.g. code 27 includes all of math and statistics and their specializations.
There are a lot of zeros in the IPEDS data, implying that institutions are reporting that they have a program, but
it has no graduates for that year. In my experience, peer reviewers
are reasonable about that, and will relax the expectation that all
programs produce data-driven reports, but your results may vary. For
purposes here, I'll assume the Reasonable Reviewer Hypothesis, and omit
the zeros when calculating statistics like the medians in Figure 1.
Figure 1. IPEDS average number of graduates for four-year programs, 2017-19, counting first and second majors, grouped by CIP code resolution, with medians marked (ignoring size zero programs).
CIP-6 is the most specific code, and is the level usually associated with a major. The Department of Homeland Security has a
list
of CIP-6 codes that are considered STEM majors. For example, 42.0101 (General Psychology) is not STEM, but 42.2701 (Cognitive Psychology and Psycholinguistics) is STEM. The CIP-6 median size is nine graduates, and it's reasonable to expect that institutions identify major programs at this level. But to be conservative, we might imagine that some institutions can get away with assessment reports for logical groups of programs instead of each one individually. Taking that approach, and combining all three CIP levels effectively assumes that there's a range of institutional practices, and enlarges the sample sizes for assessment reports. Table 1 was calculated under that assumption.
Table 1. Distribution of average program sizes with selected minimums.
Size | Percent |
less than 5 | 30% |
less than 10 | 46% |
less than 20 | 63% |
less than 30 | 72% |
less than 50 | 82% |
less than 400 | 99% | |
Half of programs (under the enlarged definition) have fewer than 12 graduates a year. Because learning assessment data is typically prone to error, a practical rule of thumb for a minimum sample size is N = 400, which begins to permit reliability and validity analysis. Only 1% of programs have enough graduates a year for that.
A typical hedge against small sample sizes is to only look at the data every three years or so, in which case around half the programs would have at least 30 in their sample, but only if they got data from every graduate, which often isn't the case. Any change coming from the analysis has a built-in lag of at least four years from the time the first of those students graduated. That's not very responsive, and would only be worth the trouble if the change has a solid evidentiary basis, and is significant enough to have a lasting and meaningful impact on teaching and learning. But 30 samples isn't going to be enough for a significant project either.
One solution for the assessment report data problem is to encourage institutions to research student learning more broadly--starting with all undergraduates, say--so that there's a useful amount of data available. The present situation faced by many institutions--reporting by academic program--guarantees that there won't be enough data available to do a serious analysis, even when there's time and expertise available to do so.
The small sample sizes lead to imaginative reports. Here's a sketch of an assessment report I read some years ago. I've made minor modifications to hide the identity.
A four-year history program had graduated five students in the reporting period, and the two faculty members had designed a multiple-choice test as an assessment of the seniors' knowledge of the subject. Only three of the students took the exam. The exam scores indicated that there was a weakness in the history of Eastern civilizations, and the proposed remedy was to hire a third faculty member with a specialty in that area.
This is the kind of thing that gets mass-produced, increasingly assisted by machines, in the name of assuring and improving the quality of higher education. It's not credible, and a big part of the problem is the expected scope of research, as the numbers above demonstrate.
Why Size Matters
The amount of data we need for analysis depends on a number of factors. If we are to take the analytical aspirations of assessment standards seriously, we need to be able to detect significant changes between group averages. This might be two groups at different times, two sections of the same course, or the difference between actual scores and some aspirational benchmark. If we can't reduce the error of estimation to a reasonable amount, such discrimination is out of reach, and we may make decisions based on noise (randomness). Bacon & Stewart (2017) analyzed this situation in the context of business programs. The figure below is taken from their article. I recommend reading the whole piece.
Figure 2. Taken from Bacon & Stewart, showing minimum sample sizes needed to detect a change for various situations (alpha = .10).
Although the authors are talking about business programs, the main idea--called a power analysis--is applicable to assessment reporting generally. The factors included in the plot are the effect size of some change, the data quality (measured by reliability), the number of students assessed each year, and the number of years we wait to accumulate data before analyzing it.
Suppose we've changed the curriculum and want the assessment data to tell us if it made a difference. If the data quality isn't up to that task, the data quality also isn't good enough to tell us that there's a problem that needs to be fixed to begin with--it's the same method. Most effect sizes from program changes are small. The National Center for Education Evaluation has a
guide for this. In their database of interventions, the average effect size is .16 (Cohen's D, or the number of standard deviations the average measure changes), which is "small" in the chart.
The reliability of some assessment data is high, like a good rubric with trained raters, but it's expensive to make, so it's a trade-off with sample size. Most assessment data will have a reliability of .5 or less, so the most common scenario is the top line on the graph. In that case, if we graduate 200 students per year, and all of them are assessed, then it's estimated to take four years to accumulate enough data to accurately detect a typical effect size (since alpha = .1, there's still a 10% chance we think there's a difference when there isn't).
With a median program size of 12, you can see that this project is hopeless: there's no way to gather enough data under typical conditions. Because accreditation requirements force the work to proceed, programs have to make decisions based on randomness, or at least pretend to. Or risk a demerit from the peer reviewer for a lack of "continuous improvement."
Consequences
The pretense of measuring learning in statistically impossible cases is a malady that afflicts most academic programs in the US because of the way accreditation standards are interpreted by peer reviewers. This varies, of course, and you may be lucky enough that this doesn't apply. But for most programs, the options are few. One is to cynically play along, gather some "data" and "find a problem" and "solve the problem." Since peer reviewers don't care about data quantity or quality (else the whole thing falls apart), it's just a matter of writing stuff down. Nowadays, ChatGPT can help with that.
Another approach is to take the work seriously and just work around the bad data by relying on subjective judgment instead. After all, the accumulated knowledge of the teaching faculty is way more actionable than the official "measures" that ostensibly must be used. The fact that it's really a consensus-based approach instead of a science project must be concealed in the report, because the standards are adjudicated on the qualities of the system, not the results. And the main requirement of this "culture of assessment" is that it relies on data, no matter how useless it is. In that sense, it's faith-based.
You may occasionally be in the position of having enough data and enough time and expertise to do research on it. Unfortunately, there's no guarantee that this will lead to improvements (a significant fraction of the NCEE samples have a negative effect), but you may eventually develop a general model of learning that can help students in all programs. Note that good research can reduce the number of samples needed by attributing score variance to factors other than an intervention, e.g. student GPA prior to the intervention. This requires a regression modeling approach that I rarely see in assessment reports, which is a lost opportunity.
References
Bacon, D. R., & Stewart, K. A. (2017). Why assessment will never work at many business schools: A call for better utilization of pedagogical research. Journal of Management Education, 41(2), 181-200.
No comments:
Post a Comment