Friday, November 19, 2021

Learning Assessment: Choosing Goals

Introduction

I recently received a copy of a new book on assessment, one that was highlighted at this year's big assessment conference:

Fulcher, K. H., & Prendergast, C. (2021). Improving Student Learning at Scale: A How-to Guide for Higher Education. Stylus Publishing, LLC.
 
This is not a review of that book; I just want to highlight an important idea I came across, from pages 60-63 of the paperback edition. The authors contrast two methods, described as deductive or inductive,for selecting learning goals that form the basis for program reporting and (ideally) improvement.Here's how they name the two methods, along with my suggested associations (objective, subjective) in italics:
  • Deductive (Objective): "A learning area is targeted for improvement because assessment data indicate that students are not performing as expected." (p. 60).

  • Inductive (Subjective): "[P]otential areas for improvement are identified through the experiences of the faculty (or the students)." (p. 60).

Although it's not in the book, I suggested the associations to objective/subjective because objectivity is often seen as superior to subjectivity in measurement and decision-making. See, for example Peter Ewell's NILOA Occasional paper "Assessment, Accountability, and Improvement: Revisiting the Tension."

In the early days of the assessment movement, campus assessment practices were consciously separated from what went on in the classroom. This separation helped increase the credibility of the generated evidence because, as “objective” data-gathering approaches, these assessments were free from contamination by the subject they were examining.(p. 19)
The desire to be free from human bias is related to assessment's roots as a management method--the same family tree as Six-Sigma, which helped Jack Welsh's General Electric build refrigerators more efficiently. It's related to the positivist movement's emphasis on definitions and strict classifications, and further back to the scientific revolution. However, the history of human beings "objectively measuring" one another includes dark and tragic episodes, as recounted in part in the 2020 president's address to the National Council on Measurement in Education (NCME). From the abstract:

Reasons for distrust of educational measurement include hypocritical practices that conflict with our professional standards, a biased and selected presentation of the history of testing, and inattention to social problems associated with educational measurement.
While the methods of science may be mooted as objective, the uses of those methods are never unbiased as long as humans are involved. 
 
A subtler critique of objective/deductive decision-making comes from artificial intelligence research. Deduction requires rules to follow, whereas induction depends on educated guesses. For a fascinating demonstration of these two approaches, see Anna Rudolf's analysis of two chess-playing programs. The contest pitted an objective/deductive algorithm (Stockfish) that used rules and points to assign values to positions, against an inductive/subjective program (Google's AlphaZero) that wasn't even initially taught the rules of the game! It played by sense of smell. And won. The video highlights AlphaZero's "inductive leaps" that have it making plays that look ill-advised according to the usual rule-based understanding of play, but turned out to be a winning strategy.

So putative objectivity can be challenged in at least two ways, viz. that human involvement means it's not really objective, and that objective methods aren't necessarily better than subjective ones in solving problems beyond a certain complexity level. 

In their book, Fulcher and Prendergast describe another compelling reason to consider an inductive/subjective approach: practicality.

Learning Goals

Accreditation requirements for assessment reporting insist that academic programs declare their learning goals up front. Measurement and improvement stem from these definitions, which is why some assessment advocates philosophize about verbs. My thesis here is that programs could do more productive work if the accreditors didn't force them to pre-declare the goals as a prerequisite to an acceptable report. I am not claiming that it is a bad idea to think about learning objectives and describe them in advance; rather that if we only take that approach and ignore the rich subjective experience of faculty outside the those lines, we unnecessarily diminish our understanding and ability to act. Accreditation rules are too restrictive.

For a faculty member, however, this pre-declaration of goals in approved language is where cognitive dissonance begins. The curriculum's aims are already amply described in course descriptions, syllabi, and textbooks. A table of contents in an introductory text will list dozens of topics, many worth at least one lecture to cover. In some disciplines, like math, these topics are associated with problem sets (i.e. detailed assessments). The pedagogy and assessments in these textbooks has evolved with experience; we don't teach calculus anything like what's found in Newton's Principia, and for good reason. Teachers develop subjective judgments about what material is likely to be difficult for students, and learn tricks to help them through it. This is an "AlphaZero" method: inductive, depending on human neural nets to assemble and weigh data rather than relying on preset rules like "if average student scores are below 70%, I will take action."

In a 2021 AALHE Intersection paper, I counted learning goals in a few textbooks and estimated that there are at least 500 for a typical program. A first calculus course contains learning goals like "computing limits with L'Hôpital's rule" and "differentiating polynomials." By contrast, a typical assessment report for accreditation asks a program to identify around five learning goals. That's a factor of 100 difference, and we should forgive a faculty member for incredulity at this point: can five categories significantly cover those 500 goals? Even in six-sigma applications, imagine how many measurements there are to specify a refrigerator. It's a lot more than five. And each of those measures, like bolt size and thread width, are specific to a particular state of assembly, where it can be tested--just like usual classroom assessments via assignments and grading. You don't want to build a whole "cohort" of refrigerators and then find out that the screws holding it together are all the wrong size.

This cognitive dissonance between formal systems and common sense is familiar. Anytime you've filled out a form and found that the checkboxes and fill-ins don't make sense in your case, that's a collision between the deductive/objective and inductive/subjective worlds. We live primarily in the latter, reasoning without decision rules. We follow the rules of the road--which lane to drive in, obeying signals and signs, but the rules don't tell us where to go or why. The most significant decisions in life are subjective/inductive.

Examples

Some of the following examples are my interpretations of selected assessment work drawn from the literature, conferences, and practice.

Interview Skills

In a 2018 RPA article, Keston (co-author of the book cited above) and colleagues documented a program improvement in Computer Information Systems.

Lending, D., Fulcher, K. H., Ezell, J. D., May, J. L., & Dillon, T. W. (2018). Example of a program-level learning improvement report. Research & Practice in Assessment, 13, 34-50 [link]

The same story is told less formally in a collection of assessment stories here. It's short and worth reading, but here's the apposite part--a faculty member's realization based on recent direct observation of student work:

We realized that nowhere in our curriculum did we actively teach the skill of interviewing. Nor did we assess our students’ ability to perform an effective [requirements elicitation] interview, apart from two embedded exam questions in their final semester.

So although although there existed an official learning outcome intended to cover this important interview skills, it was assessed in a way that hadn't elicited the important information that students weren't doing it well. This illustrates the flaw in assuming that a few general goals can adequately cover a curriculum in detail. But faculty are immersed in these details, and good teachers will notice and act on such information. So in this case, the deductive machinery was all in place, including an assessment mechanism, but it failed where the inductive method worked. 

An important point illustrated here, and mentioned in the book, is that having faculty enthusiasm for a project is conducive to making changes.

Discussion-Based Assessment

At the 2021 online assessment meeting put on by University of Florida (Tim Brophy and colleagues), I saw Will Miller's presentation. From the abstract (my underlining):

Jacksonville University decided to design and implement a discussion-based approach to assessment for the 2020-2021 academic year. This decision—which was made to minimize the bureaucratic feeling of sending out templates and providing deadlines—has led to enhanced faculty and staff understanding of assessment without the stress of prolonged back and forth reviews and evaluations. [...]

[T]he discussion-based approach has led to expressions of cathartic relief. Faculty and staff in this model are provided with immediate feedback, an opportunity to reflect on their accomplishments in this unparalleled time in higher education, and to hear affirmations of value and contribution. Ultimately, we have been able to collect information that is both deeper and wider than in previous assessment cycles while gaining countless insights into the efforts of the campus community to ensure success in the face of adversity. 

An example I remember from the talk is that the Dance faculty faced special challenges teaching classes virtually, as one might imagine they would. They noticed that students were getting injured at alarming rates when practicing on their own. So they addressed the problem. This didn't happen because they had pre-declared a learning outcome about dance injury, and then created benchmark measures and so on, but because faculty were paying attention.

Notice the indications in the abstract (which I underlined) that the inductive/subjective approach provides more timely and useful information and is more natural than the formal goal-setting. 

Teaching Statistics

In Spring 2021 I taught an introductory statistics class to working adults as an online-only section. It was my first completely online (synchronous) class, and although I'd taught the material before, I had to rethink everything for the new format. 

A key concept in statistics is that estimates like averages and proportions are associated with error, and estimating that error is an important learning goal. Typically error bounds like confidence intervals are obtained using a look-up table of values, often found in a textbook's appendix. You can see one here, and a portion is reproduced below.
 
It's hard enough to teach students to use these tables and the rules that go with them, even when I can walk around the room and point to the relevant sections. It seemed like a poor use of our online time to try to replicate that. Instead, I built an app that students could use to find the critical values by moving sliders around to represent the problem parameters.

The results of this switch were immediately gratifying. Students grasped the concepts more quickly, got the answer correct more often (in my subjective judgment), and some of them even developed an intuition for what the critical value would be without looking it up. I was so happy with the results that I shared the app with others at the university. 
 
At the end of a course, instructors are invited to submit assessments of students.  There was more than one preset learning goal that might apply in this case. I've abbreviated the descriptions to focus on the topic of confidence intervals, p-values, and t-scores.
  •  Analytical reasoning. [...] explain the assumptions of a model, express a model mathematically, illustrate a model graphically, [...], and appropriately select from among different models. 
  • Quantitative methods. [...] articulate a testable hypothesis, use appropriate quantitative methods to empirically test a hypothesis, interpret the statistical significance of test statistics and estimates [...] 
  • Formal reasoning. (General education) [...]  mastery of rigorous techniques of formal reasoning. [...] the mathematical interpretation of ideas and phenomena; [...] the symbolic representation of quantification,
Even without formal assessments, like comparing before-and-after averages, it was clear that students had learned more about this topic, but those gains could easily be counted in all of the learning goals bulleted above. There is no chance that someone analyzing the summative data from those assessments could work backwards to figure out what was going on in detail. Conversely, if those summative scores were lacking, there would be no way to track it down to the issue of finding critical values.

The inability of general learning assessments to tell us much useful about detailed work is why we see so many assessment reports that "add more critical thinking exercises to the syllabus" or something equally anodyne, just to make the peer reviewer check the box. 

Student Success

There are many examples of deductive/objective assessments that do work. The formal approach works exactly when statistical and measurement theory say it will work, with sufficient samples of reliable data, analytical expertise, and where we can reasonably hypothesize cause-and-effect. 

Good research in student learning falls in this category, but it is difficult to do, and not likely to occur in program-level accreditation reports. Examples using grades, retention, graduation rates, license exam pass rates, and so on, are more common. One example is Georgia State University's success over time with graduation rates. 

Unfortunately, the most voluminous and reliable data available to understand learning is course grades, which are generally banned by accreditors as a primary data source. So the most fruitful lines of inquiry for deductive/objective methods are denied.

Discussion

In the stats example, students learned better when I switched from a rules-based "look up the value on a table" method to one where they had to interact with the probability distribution, monitor changes in shapes and values, and narrow in on the parameters needed--an inductive/subjective method. It was subjective because the app has limited precision, so students had to decide when a value was "good enough" for the application. It was inductive because finding the critical value was done by trial and error, moving sliders around.

Teaching students rules-as-knowledge has limited benefits even in math. It's better to build intuition and understanding of interacting components. This is just as true for an academic program trying to understand student learning. Intuition, experience, professional judgment, and the great volume of observation that informs these cannot be dismissed as "merely subjective" without losing most of the actionable information available.

This is not to say that pre-declaring that we want our students to be able to write at a college level is a useless activity; it's just that such formulas fail to account for 99% of what happens in education, or else cover it at such dilution that it can't be used for assessment of the curriculum in a meaningful way. This conflation with declaring goals and assessing them was bemoaned by Peter Ewell in a 2016 article: "One place where the SLO movement did go off the rails, though, was allowing SLOs to be so closely identified with assessment." (Here SLO = student learning outcome.)

Defenders of learning goal statements as the foundation for assessment seem to dismiss most of what happens in college classrooms as mere "objectives" or "stuff on the syllabus," without recognizing that such material comprises the bulk of a college education. The only justification is that accreditors require it (proof by we-say-so), continuing the doom-loops of peer review, consultant, and software contracts in a largely vain attempt at relevance. Ewell, in the same article, connects the SLO-assessment problem as central to these accreditation issues (third paragraph from the end). 

An example of a practical learning goals statement for a physics program can be found at UC-Berkely's site," where program goals are divided into knowledge, skills, and attitudes. This is a nice overview that is suitable to inform students and help faculty reach consensus about curriculum and student outcomes after graduation. Such "ground rules" for the program can also help the faculty identify what should change. This might be related to student learning as perceived by faculty, but it could also be in response to changes in the physics discipline or the job market.

Such general goals as "broad knowledge of classical mechanics" have limited use in informing data-gathering activities for the reasons already mentioned: most learning goals are very specific and best dealt with in the context of courses. An exception would be the goal of cultivating particular behaviors over time, like "successfully pursue career objectives." That requires a new kind of data collection, like 1. when are students aware of their career options?, 2. when do they start taking actions, like seeking a mentor, selecting graduate programs, asking for letters? How are these behaviors associated with the social capital that students arrive with (e.g. are first-generation students less likely to engage with careers early?). It's a significant amount of work to take such a project seriously. 
 
One type of longitudinal study is quite easy: when courses have prerequisites we can compare the grades in the first class to the grades in the second. If the second mechanics course shows low grades despite the same students having high grades in the first course, there might be a problem. As noted, such analysis is off limits for assessment reports because of accreditors' predispositions against grades as data.

Conclusions

The analysis here, stemming from Fulcher & Prendergast's observations, is compelling because it explains facts that are otherwise puzzling. Why is it that after all these years, the assessment reporting process doesn't seem to have produced much, and continues to be locked in a perpetual start-over mode? Part of the explanation is surely that pre-specifying learning goals and then attempting to measure performance prior to any action is not practical in most cases.

If assessment offices focused on helping teachers improve their attention to specific learning goals and their in-class assessments, those teachers will take away new habits that benefit students in many learning goals. This is more efficient than the one-at-a-time rules-based approach, and--as we've seen--more likely to result in meaningful change.
 
 

Sunday, November 14, 2021

Time to Graduation

How long does it take to complete a bachelor's degree? Furman's four-year graduation rate runs around 75% and the six-year rate is about 81%, so take a guess what the average time to graduation is.

The answer is an average of 3.8 years from start to graduation, a rate that has been steady for years.  Surprised? I was, because the math doesn't seem to add up. Wouldn't it have to be more than four years?

Here's the code I used to calculate time to graduation.

grads <- grads %>%
  mutate(GradDate = ymd(GradDate),
         StartDate = ymd(paste0(Cohort,"/8/20")),
         Time      = as.numeric((GradDate - StartDate)/365.25))

This relies on the lubridate library to convert text strings like "05-07-2021" into a date format and to perform the difference calculation in the last line. Subtracting two dates gives the difference in days, which I divided by 365.25 to get years. I approximated the actual start dates by August 20 of the year they enrolled, which won't be off more than a week (less than 1% of four years). 

The first thing to notice is that August 2016 to May 2020 is three months short of four years, so a "four year graduation rate" is really more like a 3.75-year graduation rate for the majority of our graduates. Additionally, students who take longer than four years, typically do not take longer than five, so a six-year rate is more like a 4.75-year rate in practice.

Using just those two facts we could estimate the time to graduation with 

$$ \frac{3.75(.75) + 4.75(.81 - .75)}{.81} = 3.82 $$

But there are a few three-year graduates too, which brings the average down to 3.8. 

IPEDS has four-, five-, and six year graduates, and the code in the next section can extract such from the CSV files. 

Figure 0. Cumulative graduation rates by quintile of four-year rate.

Just eyeballing the figure shows that lower four-year rates are associated with longer average times to graduation.

Estimating Time to Graduation

As argued in Most college students don’t graduate in four years, so college and the government count six years as “success”, time to graduation is a big deal because it (1) costs money to attend college, and (2) limits the income of the students. It's hard to start a career before finishing the degree, and those extra years are lost earnings and lost experience in the job market.
 
Pundits are already calling curtains for liberal arts colleges, and part of the problem is cost. The benefits of small classes and broad learning goals is a hard sell when a private college costs ten to twenty thousand dollars more per year. But that doesn't factor in the time to graduation! If it takes a year or two longer at a "cheaper" college, that cost needs to be averaged in to make a fair comparison. 

The College Scorecard uses a 200% time to graduation (!), meaning eight years for a bachelor's degree. Fortunately, the IPEDS database has more detailed information, which we can retrieve from the grYYYY tables.

grad_time <- read_csv("gr2019.csv") %>%
  filter(GRTYPE %in% c(8,13,14,15)) %>% 
  select(UNITID,
         GRTYPE,
         N =     GRTOTLT) %>%
  spread(GRTYPE, N, sep = "_") %>%
  mutate(Grad4    = GRTYPE_13 / GRTYPE_8,
         Grad5    = GRTYPE_14 / GRTYPE_8,
         Grad6    = GRTYPE_15 / GRTYPE_8,
         GradTime = (Grad4*3.75 + Grad5*4.75 + Grad6*5.75)/(Grad4 + Grad5 + Grad6)) %>%
  na.omit()


This code snip accesses a local copy of the 2019 IPEDS graduation file (in reality, my code hits a data warehouse instead and averages over three years), and grabs the information about undergraduates who finish in four years or less, five years, or six years. A GradTime estimate follows from that.
 
Figure 1. Estimated time to graduation by institution predicted by four-year graduation rates.

We can estimate the average time to graduation from the four year graduation rate. The regression line is shown in the figure. It's a cubic polynomial in Grad4 and has an R-squared of .81. 

The black line can be used as a cost-scaling factor in addition to a risk assessment. The lower the graduation rate, the riskier it is to enroll because of drop-outs, and on average the longer it will take to graduate as well. I called out two colleges for illustration. St. John's College in Annapolis has a 67% four-year grad rate in the IPEDS survey, and according to  collegeresults.org, has a net cost of about $30K/year. Youngtown State University, in Ohio, has a net cost on the Scorecard of about $12K/yr, and a four-year graduation rate of about 19%.

Note that the y-axis isn't zeroed, which makes the gap seem larger than it is. The time-to-graduation difference between these two schools represents about an 18% extra cost for YSU graduates due to the extra time to graduation, so a comparative yearly cost of $14K instead of $12K. This doesn't count opportunity cost--the lost earnings from being in a job earlier--so it's a conservative estimate.

Career Lag

The analysis so far is unsatisfactory because it doesn't consider the fates of students who do not graduate. I'll illustrate how this might be done. Students who do not graduate may transfer to another college and finish there, or may drop out. Either way, it sets them behind because of lost time and credits. For the sake of argument, assume that the resulting career lag penalty for dropping out of the first school attended is ten years. A normal lag is 3.75 years--the time a first-time full-time freshman takes to earn a bachelor's degree if all goes as planned. For non-graduates, this hypothetical extra lag is intended reflect the earnings penalty for not getting the diploma, or having to wait because of transfer mechanics. Its a guess--don't take it too seriously.

The formula for career lag is then

CareerLag = (Grad4*3.75 + Grad5*4.75 + Grad6*5.75 + (1-Grad4-Grad5-Grad6)*nongrad_career_lag) # nongrad_career_lag = 10

We can run the same analysis as before to get the graph below.

Figure 2. Hypothesized career lag by institution predicted by four-year graduation rates.
 

This model has an R-squared of .91; you can see it's pretty linear. Now the difference between the two highlighted schools is more than a 30% difference. We will get different values if we pick different non-grad penalties.

Discussion

I'm not that familiar with the labor economics work in higher ed. Brian Caplan's book has a nice primer, but I don't recall that much about time to graduation. A quick google turns up these related articles:

My scan of these suggests that the situation is complicated (e.g. are students working while in college, explaining the delay to graduation?), but that taking longer to graduate is associated with lower earnings, and that this may be partly due to signalling to the labor market. Students who take longer to graduate may be seen as less capable.

In 2020, Sara Vanovac and I published a validity analysis of student writing assessments and discovered a Matthew effect wherein students with the highest levels of assessed writing ability were also seen to develop the most quickly. We went looking for other examples of this type of divergence and found one in Raj Chetty, et al's work on equality of opportunity

Here's figure 8 from our paper, showing an inter-generational divergence in earnings.

Figure 3. Selectivity of universities of children is associated with incomes of parents. Right: incomes of children who graduate is linked to the selectivity of the institution.

The figure uses the mammoth data set Chetty scoured from tax records to suggest that lower incomes persist despite college opportunity partly because of the effect of college selectivity (which includes the type of students they select). The incomes for those attending non-selective institutions barely exceed parent incomes--for those who graduate. This may be partly due to longer graduation times. The figure doesn't take into account the reality that lower selectivity entails lower graduation rates as well, so the true picture on the right is more divergent that the one shown.

Conclusions

We can add time to graduation to the list of student success measures we track, like retention and graduation rates. I've started using four-year rates for bachelor's programs, instead of the more common six-year rates, as a proxy for institutional quality (outcomes for students).

It may be that delaying graduation is beneficial to a student, for example working to pay for college in a job that benefits their career. There's no way to know without studying the matter at your institutions: what types of students leave before graduation and what types of students take longer to graduate? Are these the same groups? With data from the National Student Clearinghouse we can calculate time to graduation for transfer-outs (and transfer-ins).

Resources

The code I used to make the first two figures is here. You can modify it to read CSV files downloaded from IPEDS if you don't have those tables in a database.

Updates

After the initial post, I went back and added Figure 0. The github code is updated too.

Friday, November 05, 2021

Kuder-Richardson formula 20

My academia.edu account emails me links to articles they think I'll like, and the recommendations are usually pretty good. A couple of weeks ago I came across a paper on the reliability of rubric ratings for critical thinking that way:

Saxton, E., Belanger, S., & Becker, W. (2012). The Critical Thinking Analytic Rubric (CTAR): Investigating intra-rater and inter-rater reliability of a scoring mechanism for critical thinking performance assessments. Assessing writing, 17(4), 251-270. [link]

Rater agreement is a topic I've been interested in for a while, and the reliability of rubric ratings is important to the credibility of assessment work. I've worked with variance measures like intra-class correlation, and agreement statistics like the Fleiss kappa, but I don't recall seeing Cronbach's alpha used as a rater agreement statistic before. It usually comes up in assessing test items or survey components. 

Here's the original reference.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. psychometrika, 16(3), 297-334. [link] 

In reading this paper, it's apparent that Cronbach was attempting to resolve a thorny problem in testing theory--how to estimate the reliability of a standardized test? Intuitively, reliability is the tendency for the test results to remain constant when we change unimportant details, like swap out items for similar ones. More formally, reliability is intended to estimate how good the test is a measuring a test-takers "true score." 

The alpha coefficient is a generalization of a previous statistic called the Kuder-Richardson formula 20, which Cronbach notes is a mouthful and will never catch on. He was right!

Variance and Covariance

The alpha statistic is a variance measure, specifically a ratio of covariance to variance. It's a cute idea. Imagine that we have several test items concerning German noun declension (the endings of words based on gender, tense, etc.). Which is correct:
  1. Das Auto ist rot.
  2. Der Auto ist rot.
  3. Den Auto ist rot.

 And so on. If we believe that there is a cohesive body of knowledge comprising German noun declension (some memorization and some rules), then we might imagine that the several test items on this subject might tell us about a test-taker's general ability. But if so, we would expect some consistency in the correct-response patterns. A knowledgeable test taker is likely to get them all correct, for example. On the other hand, for a set of questions that mixed grammar with math and Russian history, we would probably not assume that such correlations exist. 

As a simple example, imagine that we have three test items we believe should tap into the same learned ability, and the items are scaled so that each of them has variance one. Then the covariance matrix is the same as the correlation matrix:

$$  \begin{bmatrix} 1 & \rho_{12} & \rho_{13} \\ \rho_{21} & 1 & \rho_{23} \\ \rho_{31} & \rho_{32} & 1 \end{bmatrix} $$

The alpha statistic is based on the ratio of the off-diagonal correlation to the sum of the whole matrix, asking how much of the total variance is covariance? The higher that ratio is, then--to a point--the more confidence we have that the items are in the same domain. Note that if the items are completely unrelated (correlations all zero), we'd have zero over one, an indication of no inter-item reliability. 

Scaling

One immediate problem with this idea is that it depends on the number of items. Suppose that instead of three items we have \(n\), so the matrix is \(n \times n\), comprising \(n\) diagonal elements and \(n^2 - n \) covariances. Suppose that these are all one. Then the ratio of the sum of covariance to the total is 

$$ \frac{n^2 - n}{n^2} = \frac{n - 1}{n}. $$

Therefore, if we want to keep the scale of the alpha statistic between zero and one, we have to scale by \( \frac{n}{n-1} \). Awkward. It suggests that the statistic isn't just telling us about item consistency, but also about how many items we have. In fact, we can increase alpha just by adding more items. Suppose all the item correlations are .7, still assuming variances are one. Then 

$$ \alpha = \frac{n}{n - 1} \frac{.7 (n^2 - 1)}{.7 (n^2 - 1) + n } = \frac{.7n}{.7n + .3} $$

which asymptotically approaches one as \( n \) grows large. Since we'd like to think of item correlation as the big deal here, it's not ideal that with fixed correlations, the reliability measure depends on how many items there are. This phenomenon can be linked to the Spearman-Brown lift formula, but I didn't track that reference down.

This dependency on the number of items is manageable for tests and surveys as long as we keep it in mind, but it's more problematic for rater agreement.

It's fairly obvious that if all the items have the same variance, we can just divide and get the correlation matrix, so alpha will not change in this case. But what if items have different variances? Suppose we have correlations of .7 as before, but items 1, 2, and 3 have standard deviations of 1, 10, and 100, respectively.  Then the covariance matrix is

$$  \begin{bmatrix} 1 & 7 & 70 \\ 7 & 100 & 700 \\ 70 & 700 & 10000 \end{bmatrix} $$

The same-variance version has alpha = .875, but the mixed-variance version has alpha = .2, so the variance of individual items matters. This presents another headache for the user of this statistic, and perhaps we should try to ensure that variances are close to the same in practice. Hey, what does the wikipedia page advise?

Assumptions

I've treated the alpha statistic as a math object, but its roots are in standardized testing. The Cronbach's paper addresses a problem extant at that time: how to estimate test reliability, or at least items on a test that are intended to measure the same skill. From my browsing that paper it seems that one popular method at the time was to split the items in half and correlate scores from one half to the other. This is unsatisfactory, however, because the result depends on how we split the test. The alpha is shown in Cronbach's paper to be the average over all such choices, so it standardizes the measure. 

However, there's more to this story, because the wiki page describes assumptions about the test item statistics that should be satisfied before alpha is a credible measure. The strictest assumption is called "tau equivalence," in which "data have equal covariances, but their variances may have different values." Note that this will always be true if we only have two items (or raters, as in the critical thinking paper), but generally I would think this is a stretch. 

Never mind the implausibility, however. Does tau-equivalence fix the problem identified in the previous section? It seems unreasonable that the reliability of a test should change if I simply change the scores for items. Suppose I've got a survey with 1-5 Likert-type scales, and I compute alpha. I don't like the result, so I arbitrarily change one of the scales to 2-10 by doubling the values, and get a better alpha. That adjustment is obviously not desirable in a measure of reliability. But it's possible without preconditions on the covariance matrix. Does tau-equivalence prevent such shenanigans?
 
For a discussion of the problems caused by the equal-covariance assumption and others see 

Trizano-Hermosilla, I., & Alvarado, J. M. (2016). Best alternatives to Cronbach's alpha reliability in realistic conditions: congeneric and asymmetrical measurements. Frontiers in psychology, 7, 769. [link]

The authors make the point that alpha keeps getting used, long after more reasonable methods have emerged. I think this is particularly true for rater reliability, which seems like an odd use for alpha.
 

Two Dimensions

The 2x2 case is useful, since it's automatically tau-equivanent (there is only one covariance element, so it's equal to itself). Suppose we have item standard deviations of \( \sigma_1, \sigma_2 \) and a correlation of \( \rho_{12} \). Then the covariance will remain the same if we scale one of the items proportionally by a constant \( \beta \) and the other inversely so. In detail, we have

$$  \begin{bmatrix} \beta^2 \sigma_1^2 & \rho_{12} \sigma_1 \sigma_2 \\ \rho_{12} \sigma_1 \sigma_2 & \frac{ \sigma_2^2}{\beta^2} \end{bmatrix} $$
 
Then 
 
$$ \alpha = 2  \frac{2 \rho_{12} \sigma_1 \sigma_2}{2 \rho_{12} \sigma_1 \sigma_2 +  \beta^2 \sigma_1^2 + \frac{ \sigma_2^2}{\beta^2} } $$

The two out front is that fudge factor we have to include to keep the values in range. Consider everything fixed except for \( \beta \), so the extreme values of alpha will occur when the denominator is largest or smallest. Taking the derivative of the denominator, setting to zero, and solving gives

$$ \beta = \pm \sqrt{\frac{\sigma_2}{\sigma_1}} $$
 
Since the negative solution doesn't make sense in this context, we can see with a little more effort that alpha is maximized when the two variances are each equal to \( \sigma_1 \sigma_2 \) so that we have
 
$$  \begin{bmatrix} \sigma_1 \sigma_2  & \rho_{12} \sigma_1 \sigma_2 \\ \rho_{12} \sigma_1 \sigma_2 & \sigma_1 \sigma_2 \end{bmatrix}. $$
 
At that point we can divide the whole thing by  \( \sigma_1 \sigma_2 \), which won't change alpha, to get the correlation matrix
 
 
$$  \begin{bmatrix} 1  & \rho  \\ \rho  & 1 \end{bmatrix} $$
 
So for a 2x2 case, the largest alpha occurs when the variances are equal, and we can just consider the correlation. My guess is that this is true more generally, and that a convexity argument could show that, e.g. with Jensen's Inequality. But that's just a guess.
 
This fact for the 2x2 case seems to be an argument for scaling the variances to one before starting the computation. However, that isn't the usual approach used when assessing alpha as a test-retest reliability statistic, as far as I can tell.   

Rater Reliability

For two observers who assign numerical ratings to the same sources, the correlation between ratings is a natural statistic to assess reliability with. Note that the correlation calculation rescales the respective variances to one, which will maximize alpha as noted above.
 
In fact, calculating such correlations is the aim of the intra-class correlation (ICC), which you can read more about here:
 
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. 

The simplest version of ICC works by partitioning variance into between-students and  within-students, with the former being of most importance: we want to be able to distinguish cases, and the with-student variation is seen as error or noise due to imperfect measurement (rater disagreement in this case). The simple version of the ICC, which Shrout calls ICC(1,1) is equivalent to the correlation between scores for the same student. See here for a derivation of that. 
 
With that we can see that for the 2x2 case using variances = 1, the correlation \( \rho \), which is also the ICC, is related to Cronbach's alpha via

 
$$ \alpha = 2  \frac{ \rho } { \rho  + 1 } $$
 
You can graph it on google:


The point selected shows (upper right) that an alpha level of about .7 is equivalent to a within-student correlation of about .54. 

Implications

I occasionally use the alpha statistic when assessing survey items. I first correlate all the discrete-scale responses by respondent, dropping blanks on a case-by-case basis (in R it's cor(x, use = "pairwise.complete.obs")). Then collect the top five or so that are correlated and calculate alpha for those. Given the analysis above, I'll start using the correlation matrix instead of the covariance matrix for the alpha calculation, in order to standardize the metric across scales. This is because our item responses can vary in standard deviation quite a bit. 

In the paper cited about critical thinking, the authors find alphas > .7 and cite the prevailing wisdom that this is good enough reliability. I tracked down that .7 thing one time, and it's just arbitrary--like the .05 p-value ledge for "significance." Not only is it arbitrary, but the same value (.7) shows up for other statistics as a minimum threshold. For example, correlations. 

The meaninglessness of that .7 thing can be seen here by glancing at the graph above. If a .7 ICC is required for good enough reliability, that equates to a .82 alpha, not .7 (I just played around with the graph, but you could invert the formula to compute exactly). See the contradiction? Moreover, the square root of the ICC can also be interpreted as a correlation, which makes the .7 threshold even more of a non sequitur.
 
If you took the German mini-quiz, then the correct answer is "Das Auto ist rot." Unless you live in Cologne, in which case it's "Der Auto." Or so I'm told by a native who has a reliability of .7.
 

Edits

I just came across this article, which is relevant.
 

Despite its popularity, is not well understood; John and Soto (2007) call it the misunderstood giant of psychological research.
[...]

This dependence on the  number of items in a scale and the degree to which they covary also means that does not indicate validity, even though it is commonly used to do so in even flagship journals [...]
There's a discussion of an item independence assumption and much more.