Friday, November 19, 2021

Learning Assessment: Choosing Goals

Introduction

I recently received a copy of a new book on assessment, one that was highlighted at this year's big assessment conference:

Fulcher, K. H., & Prendergast, C. (2021). Improving Student Learning at Scale: A How-to Guide for Higher Education. Stylus Publishing, LLC.
 
This is not a review of that book; I just want to highlight an important idea I came across, from pages 60-63 of the paperback edition. The authors contrast two methods, described as deductive or inductive,for selecting learning goals that form the basis for program reporting and (ideally) improvement.Here's how they name the two methods, along with my suggested associations (objective, subjective) in italics:
  • Deductive (Objective): "A learning area is targeted for improvement because assessment data indicate that students are not performing as expected." (p. 60).

  • Inductive (Subjective): "[P]otential areas for improvement are identified through the experiences of the faculty (or the students)." (p. 60).

Although it's not in the book, I suggested the associations to objective/subjective because objectivity is often seen as superior to subjectivity in measurement and decision-making. See, for example Peter Ewell's NILOA Occasional paper "Assessment, Accountability, and Improvement: Revisiting the Tension."

In the early days of the assessment movement, campus assessment practices were consciously separated from what went on in the classroom. This separation helped increase the credibility of the generated evidence because, as “objective” data-gathering approaches, these assessments were free from contamination by the subject they were examining.(p. 19)
The desire to be free from human bias is related to assessment's roots as a management method--the same family tree as Six-Sigma, which helped Jack Welsh's General Electric build refrigerators more efficiently. It's related to the positivist movement's emphasis on definitions and strict classifications, and further back to the scientific revolution. However, the history of human beings "objectively measuring" one another includes dark and tragic episodes, as recounted in part in the 2020 president's address to the National Council on Measurement in Education (NCME). From the abstract:

Reasons for distrust of educational measurement include hypocritical practices that conflict with our professional standards, a biased and selected presentation of the history of testing, and inattention to social problems associated with educational measurement.
While the methods of science may be mooted as objective, the uses of those methods are never unbiased as long as humans are involved. 
 
A subtler critique of objective/deductive decision-making comes from artificial intelligence research. Deduction requires rules to follow, whereas induction depends on educated guesses. For a fascinating demonstration of these two approaches, see Anna Rudolf's analysis of two chess-playing programs. The contest pitted an objective/deductive algorithm (Stockfish) that used rules and points to assign values to positions, against an inductive/subjective program (Google's AlphaZero) that wasn't even initially taught the rules of the game! It played by sense of smell. And won. The video highlights AlphaZero's "inductive leaps" that have it making plays that look ill-advised according to the usual rule-based understanding of play, but turned out to be a winning strategy.

So putative objectivity can be challenged in at least two ways, viz. that human involvement means it's not really objective, and that objective methods aren't necessarily better than subjective ones in solving problems beyond a certain complexity level. 

In their book, Fulcher and Prendergast describe another compelling reason to consider an inductive/subjective approach: practicality.

Learning Goals

Accreditation requirements for assessment reporting insist that academic programs declare their learning goals up front. Measurement and improvement stem from these definitions, which is why some assessment advocates philosophize about verbs. My thesis here is that programs could do more productive work if the accreditors didn't force them to pre-declare the goals as a prerequisite to an acceptable report. I am not claiming that it is a bad idea to think about learning objectives and describe them in advance; rather that if we only take that approach and ignore the rich subjective experience of faculty outside the those lines, we unnecessarily diminish our understanding and ability to act. Accreditation rules are too restrictive.

For a faculty member, however, this pre-declaration of goals in approved language is where cognitive dissonance begins. The curriculum's aims are already amply described in course descriptions, syllabi, and textbooks. A table of contents in an introductory text will list dozens of topics, many worth at least one lecture to cover. In some disciplines, like math, these topics are associated with problem sets (i.e. detailed assessments). The pedagogy and assessments in these textbooks has evolved with experience; we don't teach calculus anything like what's found in Newton's Principia, and for good reason. Teachers develop subjective judgments about what material is likely to be difficult for students, and learn tricks to help them through it. This is an "AlphaZero" method: inductive, depending on human neural nets to assemble and weigh data rather than relying on preset rules like "if average student scores are below 70%, I will take action."

In a 2021 AALHE Intersection paper, I counted learning goals in a few textbooks and estimated that there are at least 500 for a typical program. A first calculus course contains learning goals like "computing limits with L'Hôpital's rule" and "differentiating polynomials." By contrast, a typical assessment report for accreditation asks a program to identify around five learning goals. That's a factor of 100 difference, and we should forgive a faculty member for incredulity at this point: can five categories significantly cover those 500 goals? Even in six-sigma applications, imagine how many measurements there are to specify a refrigerator. It's a lot more than five. And each of those measures, like bolt size and thread width, are specific to a particular state of assembly, where it can be tested--just like usual classroom assessments via assignments and grading. You don't want to build a whole "cohort" of refrigerators and then find out that the screws holding it together are all the wrong size.

This cognitive dissonance between formal systems and common sense is familiar. Anytime you've filled out a form and found that the checkboxes and fill-ins don't make sense in your case, that's a collision between the deductive/objective and inductive/subjective worlds. We live primarily in the latter, reasoning without decision rules. We follow the rules of the road--which lane to drive in, obeying signals and signs, but the rules don't tell us where to go or why. The most significant decisions in life are subjective/inductive.

Examples

Some of the following examples are my interpretations of selected assessment work drawn from the literature, conferences, and practice.

Interview Skills

In a 2018 RPA article, Keston (co-author of the book cited above) and colleagues documented a program improvement in Computer Information Systems.

Lending, D., Fulcher, K. H., Ezell, J. D., May, J. L., & Dillon, T. W. (2018). Example of a program-level learning improvement report. Research & Practice in Assessment, 13, 34-50 [link]

The same story is told less formally in a collection of assessment stories here. It's short and worth reading, but here's the apposite part--a faculty member's realization based on recent direct observation of student work:

We realized that nowhere in our curriculum did we actively teach the skill of interviewing. Nor did we assess our students’ ability to perform an effective [requirements elicitation] interview, apart from two embedded exam questions in their final semester.

So although although there existed an official learning outcome intended to cover this important interview skills, it was assessed in a way that hadn't elicited the important information that students weren't doing it well. This illustrates the flaw in assuming that a few general goals can adequately cover a curriculum in detail. But faculty are immersed in these details, and good teachers will notice and act on such information. So in this case, the deductive machinery was all in place, including an assessment mechanism, but it failed where the inductive method worked. 

An important point illustrated here, and mentioned in the book, is that having faculty enthusiasm for a project is conducive to making changes.

Discussion-Based Assessment

At the 2021 online assessment meeting put on by University of Florida (Tim Brophy and colleagues), I saw Will Miller's presentation. From the abstract (my underlining):

Jacksonville University decided to design and implement a discussion-based approach to assessment for the 2020-2021 academic year. This decision—which was made to minimize the bureaucratic feeling of sending out templates and providing deadlines—has led to enhanced faculty and staff understanding of assessment without the stress of prolonged back and forth reviews and evaluations. [...]

[T]he discussion-based approach has led to expressions of cathartic relief. Faculty and staff in this model are provided with immediate feedback, an opportunity to reflect on their accomplishments in this unparalleled time in higher education, and to hear affirmations of value and contribution. Ultimately, we have been able to collect information that is both deeper and wider than in previous assessment cycles while gaining countless insights into the efforts of the campus community to ensure success in the face of adversity. 

An example I remember from the talk is that the Dance faculty faced special challenges teaching classes virtually, as one might imagine they would. They noticed that students were getting injured at alarming rates when practicing on their own. So they addressed the problem. This didn't happen because they had pre-declared a learning outcome about dance injury, and then created benchmark measures and so on, but because faculty were paying attention.

Notice the indications in the abstract (which I underlined) that the inductive/subjective approach provides more timely and useful information and is more natural than the formal goal-setting. 

Teaching Statistics

In Spring 2021 I taught an introductory statistics class to working adults as an online-only section. It was my first completely online (synchronous) class, and although I'd taught the material before, I had to rethink everything for the new format. 

A key concept in statistics is that estimates like averages and proportions are associated with error, and estimating that error is an important learning goal. Typically error bounds like confidence intervals are obtained using a look-up table of values, often found in a textbook's appendix. You can see one here, and a portion is reproduced below.
 
It's hard enough to teach students to use these tables and the rules that go with them, even when I can walk around the room and point to the relevant sections. It seemed like a poor use of our online time to try to replicate that. Instead, I built an app that students could use to find the critical values by moving sliders around to represent the problem parameters.

The results of this switch were immediately gratifying. Students grasped the concepts more quickly, got the answer correct more often (in my subjective judgment), and some of them even developed an intuition for what the critical value would be without looking it up. I was so happy with the results that I shared the app with others at the university. 
 
At the end of a course, instructors are invited to submit assessments of students.  There was more than one preset learning goal that might apply in this case. I've abbreviated the descriptions to focus on the topic of confidence intervals, p-values, and t-scores.
  •  Analytical reasoning. [...] explain the assumptions of a model, express a model mathematically, illustrate a model graphically, [...], and appropriately select from among different models. 
  • Quantitative methods. [...] articulate a testable hypothesis, use appropriate quantitative methods to empirically test a hypothesis, interpret the statistical significance of test statistics and estimates [...] 
  • Formal reasoning. (General education) [...]  mastery of rigorous techniques of formal reasoning. [...] the mathematical interpretation of ideas and phenomena; [...] the symbolic representation of quantification,
Even without formal assessments, like comparing before-and-after averages, it was clear that students had learned more about this topic, but those gains could easily be counted in all of the learning goals bulleted above. There is no chance that someone analyzing the summative data from those assessments could work backwards to figure out what was going on in detail. Conversely, if those summative scores were lacking, there would be no way to track it down to the issue of finding critical values.

The inability of general learning assessments to tell us much useful about detailed work is why we see so many assessment reports that "add more critical thinking exercises to the syllabus" or something equally anodyne, just to make the peer reviewer check the box. 

Student Success

There are many examples of deductive/objective assessments that do work. The formal approach works exactly when statistical and measurement theory say it will work, with sufficient samples of reliable data, analytical expertise, and where we can reasonably hypothesize cause-and-effect. 

Good research in student learning falls in this category, but it is difficult to do, and not likely to occur in program-level accreditation reports. Examples using grades, retention, graduation rates, license exam pass rates, and so on, are more common. One example is Georgia State University's success over time with graduation rates. 

Unfortunately, the most voluminous and reliable data available to understand learning is course grades, which are generally banned by accreditors as a primary data source. So the most fruitful lines of inquiry for deductive/objective methods are denied.

Discussion

In the stats example, students learned better when I switched from a rules-based "look up the value on a table" method to one where they had to interact with the probability distribution, monitor changes in shapes and values, and narrow in on the parameters needed--an inductive/subjective method. It was subjective because the app has limited precision, so students had to decide when a value was "good enough" for the application. It was inductive because finding the critical value was done by trial and error, moving sliders around.

Teaching students rules-as-knowledge has limited benefits even in math. It's better to build intuition and understanding of interacting components. This is just as true for an academic program trying to understand student learning. Intuition, experience, professional judgment, and the great volume of observation that informs these cannot be dismissed as "merely subjective" without losing most of the actionable information available.

This is not to say that pre-declaring that we want our students to be able to write at a college level is a useless activity; it's just that such formulas fail to account for 99% of what happens in education, or else cover it at such dilution that it can't be used for assessment of the curriculum in a meaningful way. This conflation with declaring goals and assessing them was bemoaned by Peter Ewell in a 2016 article: "One place where the SLO movement did go off the rails, though, was allowing SLOs to be so closely identified with assessment." (Here SLO = student learning outcome.)

Defenders of learning goal statements as the foundation for assessment seem to dismiss most of what happens in college classrooms as mere "objectives" or "stuff on the syllabus," without recognizing that such material comprises the bulk of a college education. The only justification is that accreditors require it (proof by we-say-so), continuing the doom-loops of peer review, consultant, and software contracts in a largely vain attempt at relevance. Ewell, in the same article, connects the SLO-assessment problem as central to these accreditation issues (third paragraph from the end). 

An example of a practical learning goals statement for a physics program can be found at UC-Berkely's site," where program goals are divided into knowledge, skills, and attitudes. This is a nice overview that is suitable to inform students and help faculty reach consensus about curriculum and student outcomes after graduation. Such "ground rules" for the program can also help the faculty identify what should change. This might be related to student learning as perceived by faculty, but it could also be in response to changes in the physics discipline or the job market.

Such general goals as "broad knowledge of classical mechanics" have limited use in informing data-gathering activities for the reasons already mentioned: most learning goals are very specific and best dealt with in the context of courses. An exception would be the goal of cultivating particular behaviors over time, like "successfully pursue career objectives." That requires a new kind of data collection, like 1. when are students aware of their career options?, 2. when do they start taking actions, like seeking a mentor, selecting graduate programs, asking for letters? How are these behaviors associated with the social capital that students arrive with (e.g. are first-generation students less likely to engage with careers early?). It's a significant amount of work to take such a project seriously. 
 
One type of longitudinal study is quite easy: when courses have prerequisites we can compare the grades in the first class to the grades in the second. If the second mechanics course shows low grades despite the same students having high grades in the first course, there might be a problem. As noted, such analysis is off limits for assessment reports because of accreditors' predispositions against grades as data.

Conclusions

The analysis here, stemming from Fulcher & Prendergast's observations, is compelling because it explains facts that are otherwise puzzling. Why is it that after all these years, the assessment reporting process doesn't seem to have produced much, and continues to be locked in a perpetual start-over mode? Part of the explanation is surely that pre-specifying learning goals and then attempting to measure performance prior to any action is not practical in most cases.

If assessment offices focused on helping teachers improve their attention to specific learning goals and their in-class assessments, those teachers will take away new habits that benefit students in many learning goals. This is more efficient than the one-at-a-time rules-based approach, and--as we've seen--more likely to result in meaningful change.
 
 

No comments:

Post a Comment