Wednesday, November 27, 2019

Variations on Latent Scales

Introduction

The last article showed how to take ordinal scale data, like Likert-type responses on a survey, and map them to a "true" scale hypothesized to exist behind the numbers. If the assumptions make sense for the data, this can lead to better estimates of averages, for example. 

In this article, I'll compare some different ways to calculate the latent scale. As with most math exercises, the devil is in the details. 

The basic idea behind all the models is that we choose:
  • a probability distribution, assumed to represent the spread-out-ness of the underlying scale
  • a way to make a linear map between the distribution's x-axis (the raw latent scale) and the original scale (the latent scale mapped back to familiar values like 1 = strongly disagree).
I will focus on the normal distribution in most of this article. That leaves us with the simple-seeming problem of how to draw a line. This step is really optional; the real work is done by mapping item frequencies to the distribution we chose. But the resulting latent scale values will come from the x-axis of that distribution. For the normal curve that means z-scores, so instead of 1 = strongly disagree, we might have -2 = strongly disagree, and 1.5 = strongly agree. In this example, those are the z-scores corresponding to the way the distribution gets chopped up. 

In IR, we already have enough trouble just explaining that the question "how many students do we have" has about a dozen answers. Trying to explain that survey results can have negative numbers will not help our weekly productivity. So it makes sense to map the distribution's native values back into a scale we recognize. 

Variations on a Line

Once we have applied the proportions of responses. In the Amazon review example from last time, the proportions of 1-star to 5-star reviews were [.21, .08, .07, .11, .52]. These proportions delimit the cut-points on the continuous latent scale where one ordinal value suddenly jumps to the next (the cut points). Theoretically, if we looked at the happiest of the 21% of reviewers that assigned a single star, and added just a little more product satisfaction, their review would jump to two stars as part of the 8%. When we divide up the N(0,1) distribution (normal with mean zero and standard deviation one) using the cumulative ratings proportions we get z-score cut-points  [-.8, -.54, -.35, -.06]. When we create the linear map, these are our x-values on the x/y plot.

The y-values--the outputs for the mapping--are the ordinal scale values. We could use anything we want here. Instead of mapping back to a 1-5 range, we could use 1-100, for example. But since the original scale is probably familiar to our audience, it's easiest to stick with that.

Here are some ways I found to do this mapping.

Variation 1: Use first and last cut-point

A line only needs two points for its definition, so we can use the first and last cut-point for the two x-values. The y-values are a bit of a problem, but the common solution is to map the left cut-point to the smallest ordinal value plus one half. That is, we assume that the cut-point between the 1 = strongly disagree and 2= disagree happens at 1.5. Similarly the cut-point between 4 = agree and 5 = strongly agree would happen at 4.5 on this Likert scale example.

Using the cut-points in this way gives the equation from Step 3 in the prior post:

$$ L(z) = \frac{4.5 - 1.5}{-.06 - (-.80)}(z - .21) + 1.5 $$

This creates a mapped scale where 1.5 and 4.5 on a (5-point scale) are fixed as cut-points. See the tick-marks on the graph below.


The horizontal position of the lollipops is on these graphs is at the median value of the appropriate segment of the normal distribution. So for the left-most region on the distribution above (dark blue), the red line is situated so that the dark blue area is the same on the left as it is on the right. This pushes all the point toward the peak (mode) of the distribution.

To reproduce this in the code from github, use method = "cuts" or method = "probit". They use different computational methods, but end up in the same place. 

Variation 2. Use median values

I don't like the assumption that the division between 1 and 2 happens at 1.5 on the latent scale. This seems like a crude and unnecessary assumption. An alternative is to find the median value of the latent scale for each ordinal response value and use that instead of cut-points. 

We can find the median values using the proportions with the following code, where I assume x is the vector of original ratings.


xtab  <- table(x)
K <- length(xtab)
xprop <- unname(prop.table(xtab))
xcum  <- cumsum(xprop)[1:K-1]
medians <- qnorm( (c(xcum,1) + c(0,xcum)) / 2 ) 

And here's the resulting distribution, with markers for the new latent scale map. 



Notice that in this case the 1 and 5 remain fixed in place, so we never get values less than 1 or more than 5. This is very convenient, and I prefer it to fixing the scale at 1.5 and 4.5.

Use method = "median" to reproduce this curve.

Variation 3. Use all the median points

So far we have either used the first and last cut-point (variation 1) or the first and last median value (variation 2) to define the line. But it seems inelegant to ignore all but the first and last points when creating the linear map. Why not linearly interpolate all the points with a regression? 

If we apply this idea to Amazon ratings, using the median value within each range to represent all points at that scale value, we get another map. In doing that, I weighted the points according to their proportional value. 

Use method = "ols".

Variation 4. Use all the points.

But why stop there? We can imagine that the true values of the ratings are spread out just like the distribution, and create as many points as we want at the various locations along the x-axis (z-scores). Then use weighted regression on those. 

Use method = "ols2".

Comparison of Methods

Rather than include graphs for the last two variations, I'll just put everything on one graph.



The x-axis here is our starting point--the 1-5 ordinal scale (e.g. Amazon star ratings or a Likert-type scale). For each of these values, the transformed latent scale moves the original value 1-5 up or down, depending on the proportions, the distribution we use, and the mapping method. 

The y-axis on the graph is the displacement from the original value after the transformation has been done. So anything above the dashed line means the transformation increased the value of the original scale, and negative numbers mean it decreased. 

The zigzag shape of all the lines shows that all of them but one compress the internal range relative to the endpoints. The exception is the ols2 variation, which just squishes everything together. 

The "cuts" and "probit" lines overlap. That variation pushes the bottom and top ratings away from the middle by subtracting more than 1 on the left and adding more than 2 on the right. The total range of the new scale is quite a bit larger than the original 1-5 range: more like 0 - 7. By contrast, the 2-3-4 values on the original scale are essentially unchanged.

The green line represents variation 2, using median values of the first and last ordinal value as anchor points. Notice that the displacement is zero on both ends, forcing the changes to be within the original 1-5 range. It pushes the 2 away from the 1 by adding about .25 to it, and pulls the 3 and 4 back, contracting the middle in away from the ends. 

The regression methods put more weight on the middle of the distribution, which can have dramatically different effects depending on the original distribution. This is most evident in the ols2 method here, which simulates putting weight on all the points on the distribution. The effect is to diminish the importance of the extreme values. The regression methods should be considered experimental. 

The logistic method (gold) closely tracks the probit/cuts methods. The only difference between the two is that it uses a logistic distribution instead of a normal distribution, and the "z-score" estimates of the cut points come from the MASS package in R.

Discussion

Of the methods surveyed here, I prefer the "median" variation, which keeps the end points of the scale in place. One way I use the latent scale is to allow users to switch back and forth between the naive average of the ordinal scales (probably wrong, but familiar) to the latent version. If the scale range is the same, it makes the comparison easier. At some point I'll write about our survey data warehouse and its data-mining interface, to show an example.

No comments:

Post a Comment