Monday, December 09, 2019

Added Variable Graphs

Introduction

A common difficulty in regression modeling is to figure out which input variables should be included. One tool at our disposal is the added-variable plot. Given an existing model, the added-variable plot lets us visualize the improvement we would get by adding one more variable of our choosing. This idea is described nicely in A Modern Approach to Regression with R by Simon J. Sheather, pages 162-166.

A function for displaying add-variable curves can be found in R's car package, among other places. That package is the support software for An R Companion to Applied Regression, and has other useful functions for regression modeling.

Simulated Data

To illustrate the added-variable graphs, I'll use simulated data. Simulations are useful, because when we build the relationship between variables by hand, we know what the answer is when we start. Then we can check the actual answer versus the regression model. 


library(tidyverse)
library(broom)

add_names <- function(x, .names){
  colnames(x) <- .names
  return(x)
}

# create three variables that have correlations
cm <- matrix( c( 1, .5, .6,
                 .5, 1, .3,
                 .6,.3,  1),
              nrow = 3, 
              ncol = 3)

sim_data <- MASS::mvrnorm(n = 1000, mu = c(0,0,0), Sigma = cm) %>% 
  scale() %>% 
  add_names(c("y","x","z")) %>% 
  as_tibble()

# check the variance properties
var(sim_data) %>% round(2)

    y    x    z
y 1.00 0.49 0.64
x 0.49 1.00 0.32
z 0.64 0.32 1.00

This code creates three variables: y (the output) and x,z (two inputs). They each have mean zero and variance 1, and--as you can see from the covariance matrix at the bottom--are each correlated with each other. Because of the scaling, the correlation coefficients between variables are the same as the covariances, so cor(x,y) = .49 in the sample data. Ideally it's .5, from the specification, but there's a bit of sampling error.

A Combined Model

One problem that bedevils regression analysis is that inputs can be correlated with each other, so the explanatory power of each in isolation may be quite good, but when we put them together in the model, they essentially provide the same information about the output--so the coefficients look weird. The coefficient may even have the "wrong" sign. This is not mistake; the output will always minimize squared error. Rather it's a consequence of the correlational structure of the inputs that confounds our intuition. 


m_xyz <- lm(y ~ x + z, sim_data)
summary(m_xyz)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.226e-17  2.250e-02    0.00        1    
x           3.145e-01  2.377e-02   13.23   <2e-16 ***
z           5.362e-01  2.377e-02   22.56   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7115 on 997 degrees of freedom
Multiple R-squared:  0.4948, Adjusted R-squared:  0.4938 
F-statistic: 488.3 on 2 and 997 DF,  p-value: < 2.2e-16

Because the means are all zero, the intercept is zero. The coefficient estimates are given by the formula \( \beta = r_{xy} s_x / s_y \) for simple two-variable models (e.g. y predicted by x alone), so in our case the individual betas would be cor(sim_data$x, sim_data$y) * sd(sim_data$y) / sd(sim_data$x) = .488. But it's less than that here, because we've added z as an explanatory variable, which is correlated with x.

Added-Variable Plots

With the two-input model, we can now ask for add-variable plots from the car package with 

car::avPlots(m_xyz)


The graph on the left starts with the model y ~ z and then figures out what the addition of x to this would add, to create y ~ x + z. The graph on the right does this in reverse, starting with y ~z and adding z. The slopes of the lines are the same as the regression coefficient that will result in the combined model for the given variable. So we want to see non-zero slopes. In this case, we can see that both of these inputs (x and z) are useful in explaining y. 

Model Outputs

Since we are comparing several models, I'll show the summary results using the broom package.


# Models
m_xy <- lm(y ~ x, sim_data)
m_xz <- lm(z ~ x, sim_data)
m_zy <- lm(y ~ z, sim_data)
m_xyz <- lm(y ~ x + z, sim_data)

# Assemble summary statistics
m_stats <- map_df(list(m_xy,m_xz,m_zy,m_xyz),glance)
m_stats$Model <- c("m_xy","m_xz","m_zy","m_xyz")
m_stats$r <- sqrt(m_stats$r.squared)
m_stats %>% select(Model,r, r.squared)
The magic happens in line 8, using the broom package's glance function to generate summary model statistics and the purrr package's map function to put them all into a data frame.


  Model     r r.squared
  <chr> <dbl>     <dbl>
1 m_xy  0.487     0.237
2 m_xz  0.321     0.103
3 m_zy  0.637     0.406
4 m_xyz 0.703     0.495

I added the r column, which is the correlation between variables. You can check the covariance matrix above and verify that these are the same values found there (e.g. m_xy predicts y from x, and has r = .487, which corresponds to the second element of the first row of the covariance matrix (rounded to .49).

Discussion

There's a lot more to say about these graphs. The mathematical motivation is found in Sheather's book. In the next article I'll unpack the geometry behind the graphs and show why they still don't entirely solve our problem with co-linearity among predictors. Still, added-value graphs are a nice tool to have in the kit.

Tuesday, December 03, 2019

Causal Inference

Introduction

We all develop an intuitive sense of cause and effect in order to navigate the world. It turns out that formalizing exactly what "cause" means is complicated, and I think it's fair to say that there is not complete consensus. A few years ago, I played around with the simplest possible version of cause/effect to produce "Causal Interfaces," a project I'm still working on when I have time. For background I did a fair amount of reading in the philosophy of causality, but that wasn't very helpful. The most interesting work done in the field comes from Judea Pearl, and in particular I recommend his recent The Book of Why, which is a non-technical introduction to the key concepts.

One important question for this area of research is: how much can we say about causality from observational data? That is, without doing an experiment?

Causal Graphs

The primary conceptual aid in both philosophy and in Pearl's work is a diagram with circles and arrows--called a "graph" in math, despite the confusion this can cause. In particular, these are Directed Acyclic Graphs (DAGs), where a-cyclic means the arrows can't be followed to arrive back at your starting point (no cycles). This encompasses a large variety of types of problems we are interested in, including confounding scenarios. 

If you want a quick primer on this subject, take a look at Fabian Dablander's "An introduction to causal inference." He does an excellent job at describing a hierarchy of knowledge about systems: look, do, imagine, which we could rephrase as observe, experiment, model. 

Application

In the comments to Fabian's article there's a reference to an R package that calculates generalized (i.e. non-linear) correlation coefficients, which looks interesting but complicated. I found another R package called calg, which directly addresses the graph operations that are integral to Pearl's theory. 

The image below comes from page 17 of the linked documentation. It's the result of simulating data with known properties, and then asking the algorithm to uncover the properties.

From calg documentation: left is the true model, right is the one discovered by the causal analysis

The diagram shows an example of the algorithms inducing the most likely causal structure when only given observational data. We can see that the structure is correct, but not all the arrow directions are discoverable from the data set (e.g. between 2 and 3).

This is fascinating stuff, and I look forward to trying it out on, say, retention analysis. 

Monday, December 02, 2019

Finding Latent Mean Scores

Introduction

The use of the ordinal responses as scalar values was on my list of problems with rubric data. I showed how to estimate an underlying "latent" scale for ordinal data here and elaborated on the methods here. These methods cast the response values into a normal (or logistic, etc.) distribution, from which we can get the latent mean. However, by itself this isn't very interesting, because we usually want to compare the means of groups within the data. Do seniors score higher than freshmen?

This post shows how to make such comparisons by combining the latent scale model with a matrix of predictor variables to create a unified model from which to pull means.

For example, if we wanted to compare senior scores to freshmen scores, it wouldn't do to separate the data sets and calculate the mean for each; it's much better to keep them in one data set and use the class status as an explanatory variable within a common model. This idea can then be extended to create hierarchical or mixed-effects models.

Writing Scores

Each fall and spring semester, our Assessment Office asks faculty for evaluations of student writing, if they have had an opportunity to observe it. The details of this data collection and the reliability and validity characteristics we've uncovered so far can be found here, here, and here

The resulting scores range from 0 = below minimum standards for college work, to 4 = writing at the level we expect of our graduates. Notice that this addresses problem 2 in my earlier post by intentionally pinning the scale to progress. The most stellar freshman is very unlikely to receive a 4.

The output below is based on about 17,000 data points gathered over four years.

A Proportional Odds Model

Our prior research has shown that cumulative GPA is important to understanding the scores, and we hope that time is, because we want to understand student development. We'll measure time in the number of fall/spring terms a student has been at college (TermNum). Finally, we want to understand the relationship between general academic ability (GPA) and growth, so we'll include an interaction term between grades and time.

As it turns out, the time variable by itself was not statistically distinguishable from zero, so I'll just show the final model here, with GPA and TermNum:GPA as inputs. 

One way to induce a latent scale is to use a proportional odds model, which can be accomplished in R using the MASS package's polr function. 


MASS::polr(as.factor(Rating) ~ GPA + GPA:TermNum, data = dasl, method = "probit")

Model Output

The results look like this.


Call:
MASS::polr(formula = as.factor(Rating) ~ GPA + GPA:TermNum, data = dasl, 
    method = "probit")

Coefficients:
             Value Std. Error t value
GPA         0.7130   0.015964   44.66
GPA:TermNum 0.0788   0.001188   66.31

Intercepts:
    Value   Std. Error t value
0|1  0.9424  0.0488    19.3201
1|2  2.2105  0.0485    45.5799
2|3  3.5963  0.0513    70.0575
3|4  4.7040  0.0542    86.8262

Residual Deviance: 40453.60 
AIC: 40465.60 

We can see from the standard errors that both the coefficients are convincingly non-zero. Unlike ordinary least squares, where we usually get a single constant coefficient, here we get four of them: one for each cut-point on the scale. So the 0|1 intercept is the dividing line between ratings of 0 and ratings of 1. The scale's units are z-scores (the x-axis for the normal density function).

These z-score intercepts at cutpoints get adjusted for each student based on GPA and the interaction between GPA and TermNum, according to the coefficients for those.

Simulating Output

Once we have the model, we can plug values in to see what the modeled distribution looks like.  The one below is for 4.0-GPA students at term 8, which is normally the last term before graduation.


We can see here that even the best graduating seniors are not all getting 4s in this model. Over half of them will, however, and almost all  receive either a 3 or a 4, but there's a small chance of a 2 showing up.

For 3.0 students, the curve is shifted to the left, which we get by plugging in GPA = 3 and TermNum = 8.

There's significantly less chance of a 4, and even a small chance that a 1 will be assigned.

Comparing Means

Notice in each of the graphs above that the average latent scale value can be estimated from the picture: find the mode (highest point), and look for the x-value at the bottom. For the GPA 4.0 students, it's something like 3.75, and for the 3.0 GPA students, it's about 3. We can extend this idea to look at a cross-section of GPAs for all eight terms.


Unlike ordinary least squares, pretending that the ordinal scores are scalar, the results here aren't all straight lines. We can see the ceiling effect of the scale limit on the highest GPA students, as it deviates from a line. 

The effect of the interaction term between grades and time is what causes the "fan" shape here, where the distance between student groups on the right side of the graph is larger than on the left. This suggests that a Matthew Effect is happening with writing. 

Note that most of our students are 3.0 or higher in GPA, so there aren't many students on those lower two trajectories. Still, the results here are significant in thinking about how college students learn academic writing skills.

Discussion

With easily-accessible software tools, like R and the MASS package, it's straightforward to create models of ordinal data that can test for model fit and allow for comparisons between groups. The challenge to using these tools is having a conceptual understanding of what's going on (which is the purpose of these articles) and having enough comfort with R (or other package) to do the work efficiently. If you're turned off by the cryptic-looking syntax you've seen, I recommend Hadley Wickham's book R for Data Science as a modern way to use R that takes advantage of accessible language features--after a bit of practice it's easy to read and write.