Scanning \(X\)
In institutional research we often have dozens or hundreds of potential explanatory variables. For example, using The Freshman Survey to predict first year attrition gives us over a hundred data points for responders. I like to scan them quickly using AUC to see which ones are interesting, and built a tool to do that. You can find Survey Prospector on github here, with links to show how to use it.
For the Titanic data, I'll just loop through the variables and compute the AUCs in order. To do this, the Sex variable needs to be converted to a number, so I turned it into Male, a dummy variable (1 = male, 0 = female), and added an interaction PclassM = passenger class times male status.
Variable | AUC |
---|---|
Survived | 1.00 |
PclassM | 0.80 |
Male | 0.77 |
Fare | 0.69 |
Pclass | 0.68 |
Parents/Children Aboard | 0.56 |
Siblings/Spouses Aboard | 0.54 |
Age | 0.52 |
This table is helpful in orienting us to which are the useful inputs, but this approach has two blind spots:
- The AUC is agnostic as to the shape of the relationship between the explanatory and predicted varaibles: it's not a measure of a model quality. So we don't know if the relationship is linear, quadratic, etc. All we know is that it's an ordered relationship, so that higher values (lower if the scale is reversed) of the predictor mean higher probabilities of survival on average, when AUC > .5--measured in the sense already described, where we have to distinguish between the two cases.
- The list of AUCs doesn't tell us anything about the correlational structure of the predictors. There's a tab in SurveyProspector for that, or we can run a correlation table to help. I'll skip that step here.
Model Selection
We have a sense of which variables are important. If the list was really long, we could now weed out the ones that aren't likely to be useful--at least for most purposes in higher education. It's possible that a variable with a low AUC happens to have an important interaction with some other variable, but I haven't come across a case like that in practice.
This leaves us with the problem of which of the "good" inputs to include in the model. Step-wise regression was supposed to solve this problem, but it turns out to have problems. With seven inputs, there are \(2^7 = 128\) possible models. One method is just to check all of them and use some metric for model selection. This is perilous, because we'll probably over-fit the data.
The approach I'll demonstrate here uses a penalization term on the regression equation. It's similar to Lagrange multipliers, if you're familiar with that optimization technique. Every coefficient of \(X\) has to earn its keep, enforced by adding an artificial "error term" that adds in a multiple of the coefficient values. In order to make the total error decrease, the optimizer tries to keep the coefficients small (starting at zero). There are variations, including LASSO and ridge regression. I'll show off LASSO here. They differ only in how the extra error term is calculated.
Model coefficients as number of variables increases (read right to left) |
The corresponding graph of variable coefficients tells us which variables are important for each model we could choose from the menu. With four inputs (degrees of freedom, one of which is the regression constant), at log(Lambda) about -4, it looks like the variables included are Male, Pclass, and Sibs, but we can actually retrieve the coefficients and see more exactly:
We can leave off Fare, since it's coefficient is so close to zero. Age is also not very important.
The model's ROC curve is plotted below.
ROC curve for the LASSO model with log(lambda) = -4 |
R Code
library(tidyverse)
library(glmnet)
library(plotmo)
x <- pass %>%
select(Pclass,Sex,Age,Fare, Sibs= `Siblings/Spouses Aboard`, Child = `Parents/Children Aboard`) %>%
mutate(Male = (Sex == "male") + 0,
PclassM = Pclass*Male) %>%
select(-Sex) %>%
as.matrix()
y <- pass$Survived
# glmnet(x,y) has inputs in matrix x and output in y
my_fit <- glmnet(x, y, alpha=1, family="binomial")
# graph it
plot_glmnet(my_fit, xvar="lambda", label = 10)
# print the coefficients
coef(my_fit, s = exp(-4))
# cross validation
plot(cv.glmnet(x, y, alpha=1))
# predictions
p <- predict(my_fit, x, s = exp(-4))
# we have to do the inverse transform for the logit
p <- 1/(1 + exp(-p))
No comments:
Post a Comment