Higher Ed/: Transforming Ordinal Scales

Introduction

A few days ago, I listed problems with using rubric scores as data to understand learning. One of these problems is how to interpret an ordinal scale for purposes of doing statistics. For example, if we have a rating system that resembles "poor" to "excellent" on a 5-point scale, it's a usual practice to just equate "poor" = 1, ..., "excellent" = 5, and compute an average based on that assignment.

The Liddell & Kruschke paper I cited gives examples of how this simple approach goes awry. From the abstract:

We surveyed all articles in the Journal of Personality and Social Psychology
(JPSP), Psychological Science (PS), and the Journal of Experimental Psychology: General (JEP:G) that mentioned the term “Likert,” and found that 100% of the articles that analyzed ordinal data did so using a metric model. We present novel evidence that analyzing ordinal data as if they were metric can systematically lead to errors.

Those examples are probably uses of Likert-type survey response scales, and the issue pertains to both survey responses and ordinal scale rubric ratings.

In this article I'll give the results of my weekend work on trying to understand the issue and build R functions to easily transform the 1-2-3 scale into a latent scale.

Intuition

If you use Amazon.com for online shopping, you've seen the 5-star product rating system, which shows a distribution of responses. Here's an example, for this product.

Notice that compared to a mound-shaped distribution, we have excess 5-star and 1-star ratings (i.e. the distribution is bimodal, more like a beta distribution than a normal curve).

One interpretation of the results is that the ratings on the end (5- and 1- star) include more within-rating variation than the other response levels. In that interpretation, some raters thought the product was "good" and gave it 5 stars. Another rater thought the product was truly extraordinary, and would have given it a 6 if possible, but the scale only goes to 5.

Another way to say this is that we can imagine that the limitations of the scale are "censoring" the data by clipping the high and low ends. A natural question is: what's the true average value of the five-star rating if we we imagine extending the scale? At first glance, it seems impossible to determine any such thing from the data we have. Latent scales are an attempt to try to provide a reasonable answer.

Latent Scales

As with any statistical model, there are assumptions required to get started. These assumptions always need to be carefully considered to see if they fit your use case. Here I'll assume that when raters (or survey respondents) respond on the ordinal scale, that there is a continuous scale at work in the background. For a rater looking at student writing samples, this means that there's a scale from very bad to most excellent that smoothly moves from one quality to the next. A practical test of this would be to see if there are any two papers that raters think are exactly the same quality.

In practice, we have the problem that these ratings are multi-dimensional, so a more realistic assumption is that there are multiple correlated continuous scales in the background. But that's beyond the scope of what we can address here.

The second assumption is that the latent scale gets translated into discrete response categories at breakpoints: that there's some threshold where we flip from "good" to "very good." Perhaps we feel that this choice is a little difficult--that would be another sign that the scale is inherently continuous, because at the boundary between the two ratings, there really is no difference.

Finally, we have to assume something about how values of the scale are distributed in an infinitely large population of ratings (or survey responses). We often choose a normal distribution, but the logistic distribution is another choice.

Visualizing the Transformed Scale

To illustrate the latent scale idea, I'll use the ratings data from Amazon.com, shown above. The proportions have been mapped to a normal distribution, and the scale values transformed from the original 1-5.

The color breaks on the normal curve show the thresholds between rating values 1-5. The area in each region (i.e. its idealized probability) is the same as the proportion of responses for that value--the same proportions as from the Amazon.com screenshot above. This is also shown as the height of the red "lollipop", so that the rightmost one is at .52 = 52%, which you can see on the distribution in the first image. The horizontal location of the lollipops shows the transformed value of the scale response.

The assumption of a latent scale combined with the assumption that the scale values are distributed in a normal density determine the cut-points on the distribution that logically must demark the boundaries between each response value (e.g. 1-5 stars). Each segment's area under the curve corresponds to one response and has the same probability. These are sorted left to right, so the dark blue area is the one-star review, with 21% of the area.

One of the assumptions for this model was that the break between 4 and 5 happens at 4.5, which you can see from the graph. We can make other assumptions about where the breaks should be, which result in somewhat different conclusions.

The most noticeable effects are that the 1 and 5 have been pushed out away from their nominal values. In this model they sit at the median values for their section of the distribution--the halfway point where there is just as much probability on the left as on the right. Intuitively, this is accounting for the likelihood that raters at the top and bottom need more "scale room" to tell us what they really think, and consequently the "true" value of the rating average for a five-star rating is quite a bit larger than 5.

Notice that the 3-star region (the middle one) gets squished so that it's not even one unit wide after the transformation. The model assumes fewer responses implies less variation is needed on the latent scale to cover that case, and assigns a smaller range accordingly.

Calculating the Transformation

There are multiple methods for calculating the latent scale. I'll describe the most common here and follow up later with more details.

Step 1. Take the frequencies of the responses 1-5 and make a cumulative version starting from the left. This gives [.21, .21 + .08, .21 + .08 + .07, .21 + .08 + .07 + .11] = [0.21, 0.29, 0.36, 0.47]. The fifth sum is always 1, so we can leave it off.

Step 2. Find the corresponding z-scores on the standard N(0,1) cumulative distribution S-curve that match the frequencies in step 1. In this case, it's $ z = [-0.80, -0.54, -0.35, -0.06] $. For example, the -.80, which is the cut point between the 1 and 2-star rating, is the left tail of the normal density curve, with area = .21.

Step 3. Fit a line to use the z-scores from step 2 to map to transformed scale values. There are various ways to do that, depending on how we want to anchor the new latent scale relative to the original. One is to assume that the cut-point between 1 and 2 occurs at 1.5 on new transformed scale, and that the cut-point between 4 and 5 occurs at 4.5. In this case, 1.5 and 4.5 will match on both the original scale and the new one. Here's the formula:

$$ L(z) = \frac{4.5 - 1.5}{-.06 - (-.80)}(z - .21) + 1.5 $$

This produces these values for the cut-points:

z	original	transformed
-0.80	1.5	1.50
-0.54	2.5	2.54
-0.35	3.5	3.34
-0.06	4.5	4.50

Again, we can see that the anchor points 1.5 and 4.5 match on both scales. This is not the only way to create the line that defines the latent scale.

Discussion

The latent scale transformation suggests that the difference between a 3 and 4 is not nearly as great as the difference between a 4 and 5-star rating for this product. Similarly a 1-star review is worse than we might expect if we just assume that the nominal distances between ratings are real.

This technique is applicable to a lot of the data we use in institutional research, including surveys of students, course evaluations, and rubric ratings.

The method illustrated here isn't magic. There isn't a lot of information to go on in this simple case of mapping from only response frequencies. It gets more interesting when we have other explanatory variables involved, e.g. to find the average difference between two groups on the latent scale. More on that anon. Before using a latent scale map, be sure that the assumptions make sense. For example, if a survey item asks for salary ranges, you already have a scale.

Code

You can find the code to reproduce the statistics and graphs used here on my github site. Use at your own risk--this is a work in progress.

You should be able to reproduce the Amazon example with:

ratings <- c(rep(1,21), rep(2,8), rep(3,7), rep(4,11), rep(5,52))

lstats <- latent_stats(ratings, method = "cuts")
plot_latent_x(lstats, lollipop = TRUE)

References

Liddell, T. M., & Kruschke, J. K. (2018). Analyzing ordinal data with metric models: What could possibly go wrong?. Journal of Experimental Social Psychology, 79, 328-348. [pdf]

Higher Ed/

Tuesday, November 26, 2019

Transforming Ordinal Scales