Skip to main content

Posts

Added Variable Graphs

Introduction A common difficulty in regression modeling is to figure out which input variables should be included. One tool at our disposal is the added-variable plot. Given an existing model, the added-variable plot lets us visualize the improvement we would get by adding one more variable of our choosing. This idea is described nicely in A Modern Approach to Regression with R by Simon J. Sheather, pages 162-166.
A function for displaying add-variable curves can be found in R's car package, among other places. That package is the support software for An R Companion to Applied Regression, and has other useful functions for regression modeling.
Simulated Data To illustrate the added-variable graphs, I'll use simulated data. Simulations are useful, because when we build the relationship between variables by hand, we know what the answer is when we start. Then we can check the actual answer versus the regression model. 

library(tidyverse) library(broom) add_names <-function(x…
Recent posts

Causal Inference

Introduction We all develop an intuitive sense of cause and effect in order to navigate the world. It turns out that formalizing exactly what "cause" means is complicated, and I think it's fair to say that there is not complete consensus. A few years ago, I played around with the simplest possible version of cause/effect to produce "Causal Interfaces," a project I'm still working on when I have time. For background I did a fair amount of reading in the philosophy of causality, but that wasn't very helpful. The most interesting work done in the field comes from Judea Pearl, and in particular I recommend his recent The Book of Why, which is a non-technical introduction to the key concepts.
One important question for this area of research is: how much can we say about causality from observational data? That is, without doing an experiment?
Causal Graphs The primary conceptual aid in both philosophy and in Pearl's work is a diagram with circles and arr…

Finding Latent Mean Scores

Introduction The use of the ordinal responses as scalar values was on my list of problems with rubric data. I showed how to estimate an underlying "latent" scale for ordinal data here and elaborated on the methods here. These methods cast the response values into a normal (or logistic, etc.) distribution, from which we can get the latent mean. However, by itself this isn't very interesting, because we usually want to compare the means of groups within the data. Do seniors score higher than freshmen?
This post shows how to make such comparisons by combining the latent scale model with a matrix of predictor variables to create a unified model from which to pull means.
For example, if we wanted to compare senior scores to freshmen scores, it wouldn't do to separate the data sets and calculate the mean for each; it's much better to keep them in one data set and use the class status as an explanatory variable within a common model. This idea can then be extended to c…

Transparent Pricing in Higher Ed

If you haven't read it yet, I recommend that you pick up a copy of Paul Tough's book The Years that Matter Most: How College Makes or Breaks Us. In particular, the section starting on page 182 of the hardcover is a revealing picture of how college finances affect recruiting.  You can find the same (or very similar) content in this New York Times article.

In short, most private colleges need to balance academic admissions requirements with what are essentially financial admissions requirements. The latter are needed to ensure sufficient revenue to make a budget. Public institutions are not immune either, since for many of them their revenue also depends heavily on tuition. Academic goals and financial goals for recruitment vary widely from one institution to the next, depending on market position, endowment, and other factors. This leads to a lot of variation in the actual price a given student might pay at different institutions.

Every college now has a net tuition calculator,…

Variations on Latent Scales

Introduction The last article showed how to take ordinal scale data, like Likert-type responses on a survey, and map them to a "true" scale hypothesized to exist behind the numbers. If the assumptions make sense for the data, this can lead to better estimates of averages, for example. 
In this article, I'll compare some different ways to calculate the latent scale. As with most math exercises, the devil is in the details. 
The basic idea behind all the models is that we choose: a probability distribution, assumed to represent the spread-out-ness of the underlying scalea way to make a linear map between the distribution's x-axis (the raw latent scale) and the original scale (the latent scale mapped back to familiar values like 1 = strongly disagree). I will focus on the normal distribution in most of this article. That leaves us with the simple-seeming problem of how to draw a line. This step is really optional; the real work is done by mapping item frequencies to the…

Transforming Ordinal Scales

IntroductionA few days ago, I listed problems with using rubric scores as data to understand learning. One of these problems is how to interpret an ordinal scale for purposes of doing statistics. For example, if we have a rating system that resembles "poor" to "excellent" on a 5-point scale, it's a usual practice to just equate "poor" = 1, ..., "excellent" = 5, and compute an average based on that assignment.

The Liddell & Kruschke paper I cited gives examples of how this simple approach goes awry. From the abstract:
We surveyed all articles in the Journal of Personality and Social Psychology
(JPSP), Psychological Science (PS), and the Journal of Experimental Psychology: General (JEP:G) that mentioned the term “Likert,” and found that 100% of the articles that analyzed ordinal data did so using a metric model. We present novel evidence that analyzing ordinal data as if they were metric can systematically lead to errors. Those examples a…