Thursday, November 07, 2019

Variable Selection for Predictive Models

Not so long ago, if you had a list of potential explanatory variables \(X\) to use to predict some output \(Y\), you'd ask the software to start somewhere and interatively build regression models to test the combination of variables. Often, the coefficients of \(X\) that had small p-values were kept and the others discarded. This step-wise approach has detractors, and there are methods available now that are recommended by people who should know. In particular, I recommend Regression Modeling Strategies by Frank Harrell.  In this post I'll demonstrate the use of one of the recommended replacements for step-wise variable selection.

Scanning \(X\)

In institutional research we often have dozens or hundreds of potential explanatory variables. For example, using The Freshman Survey to predict first year attrition gives us over a hundred data points for responders. I like to scan them quickly using AUC to see which ones are interesting, and built a tool to do that. You can find Survey Prospector on github here, with links to show how to use it. 

For the Titanic data, I'll just loop through the variables and compute the AUCs in order. To do this, the Sex variable needs to be converted to a number, so I turned it into Male, a dummy variable (1 = male, 0 = female), and added an interaction PclassM = passenger class times male status. 

Variable AUC
Survived 1.00
PclassM 0.80
Male 0.77
Fare 0.69
Pclass 0.68
Parents/Children Aboard 0.56
Siblings/Spouses Aboard 0.54
Age 0.52

The first entry just shows that the output variable Survived can predict itself perfectly. Surprisingly, Age isn't helpful at all, with an AUC barely above the guessing rate of .50.

This table is helpful in orienting us to which are the useful inputs, but this approach has two blind spots:
  • The AUC is agnostic as to the shape of the relationship between the explanatory and predicted varaibles: it's not a measure of a model quality. So we don't know if the relationship is linear, quadratic, etc. All we know is that it's an ordered relationship, so that higher values (lower if the scale is reversed) of the predictor mean higher probabilities of survival on average, when AUC > .5--measured in the sense already described, where we have to distinguish between the two cases.
  • The list of AUCs doesn't tell us anything about the correlational structure of the predictors. There's a tab in SurveyProspector for that, or we can run a correlation table to help. I'll skip that step here.

Model Selection

We have a sense of which variables are important. If the list was really long, we could now weed out the ones that aren't likely to be useful--at least for most purposes in higher education. It's possible that a variable with a low AUC happens to have an important interaction with some other variable, but I haven't come across a case like that in practice.

This leaves us with the problem of which of the "good" inputs to include in the model. Step-wise regression was supposed to solve this problem, but it turns out to have problems. With seven inputs, there are \(2^7 = 128\) possible models. One method is just to check all of them and use some metric for model selection. This is perilous, because we'll probably over-fit the data. 

The approach I'll demonstrate here uses a penalization term on the regression equation. It's similar to Lagrange multipliers, if you're familiar with that optimization technique. Every coefficient of \(X\) has to earn its keep, enforced by adding an artificial "error term" that adds in a multiple of the coefficient values. In order to make the total error decrease, the optimizer tries to keep the coefficients small (starting at zero). There are variations, including LASSO and ridge regression. I'll show off LASSO here. They differ only in how the extra error term is calculated.

Model error as number of variables increases (read right to left)



Reading the graph above from the left, it shows the effect of adding more variables to the model. The idea is to find a good trade-off between the number of variables (smaller is better) and smaller error. The dashed line at 4, with log(Lambda) close to -4, is a good compromise. The lambda is the penalty term that adds the extra error to be optimized away. It decreases as we move to the left, until we end up a ordinary regression with no extra error.
Model coefficients as number of variables increases (read right to left)


The corresponding graph of variable coefficients tells us which variables are important for each model we could choose from the menu. With four inputs (degrees of freedom, one of which is the regression constant), at log(Lambda) about -4, it looks like the variables included are Male, Pclass, and Sibs, but we can actually retrieve the coefficients and see more exactly:


We can leave off Fare, since it's coefficient is so close to zero. Age is also not very important.

The model's ROC curve is plotted below.

ROC curve for the LASSO model with log(lambda) = -4

R Code


library(tidyverse)
library(glmnet)
library(plotmo)

x <- pass %>% 
  select(Pclass,Sex,Age,Fare, Sibs= `Siblings/Spouses Aboard`, Child = `Parents/Children Aboard`) %>% 
  mutate(Male = (Sex == "male") + 0,
        PclassM = Pclass*Male) %>% 
  select(-Sex) %>% 
  as.matrix()

y <- pass$Survived

# glmnet(x,y) has inputs in matrix x and output in y
my_fit <- glmnet(x, y, alpha=1, family="binomial")

# graph it
plot_glmnet(my_fit, xvar="lambda", label = 10)

# print the coefficients
coef(my_fit, s = exp(-4))

# cross validation
plot(cv.glmnet(x, y, alpha=1))

# predictions
p <- predict(my_fit, x, s = exp(-4)) 

# we have to do the inverse transform for the logit
p <- 1/(1 + exp(-p)) 


No comments:

Post a Comment