In the last few articles I explored the meaning of the AUC metric for predictor performance. Let's take it for a test drive with a model of Titanic survival that involves more than passenger class.
Using R's general linear model, we can create a binomial predictor of survival with one line:
logistic_model <- glm(Survived ~ Pclass*Sex + Age, family = "binomial", data = pass)
This model assumes that passenger class and sex of the passenger will interact, and that age is important. It's like "women and children first, starting with the wealthy."
Here's the output.
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 7.7418 | 0.9256 | 8.36 | 0.0000 |
Pclass | -2.3054 | 0.3089 | -7.46 | 0.0000 |
Sexmale | -6.0872 | 0.8900 | -6.84 | 0.0000 |
Age | -0.0355 | 0.0074 | -4.82 | 0.0000 |
Pclass:Sexmale | 1.3975 | 0.3260 | 4.29 | 0.0000 |
This is an old-fashioned way to examine model parameters, by looking at p-values. In that mode of thinking, if the p-values are near zero it means that the actual coefficient in question probably isn't zero. In this case, it says that all the coefficients are non-zero. In particular, the interaction between class and sex turned out to be important. You can see that Sexmale has a large negative coefficient, meaning men were at greater risk. to a lesser extent, the passenger class 1,2,3 increases risk as well. Putting those two things together, we actually take back a little of the risk for the men as the passenger class increases, since the interaction Pclass:Sexmale is positive.
ROC graph for the predictive model |
The shape of the ROC graph shows that we can easily predict half the survivors using this model, since the graph zooms up to about .5 before turning to the right.
When we only considered passenger class, we made a matrix showing the conditional probabilities of class 1,2,3 given the survival state. In the current model, the predictor has too many unique values to make a table like that, but we can make a graph instead. Imagine turning the table on its side and plotting the two values (survived or not) in different colors.
The x-axis shows the modeled probability of survival, with blue being actual survivors. |
At the far right, the probability of survival per the model is very high--near 100%--and the actual rate of survival there is much higher (blue) than the non-survivors (pink overlap). At the left end of the graph, the predicted probability of survival is near zero, and there's a large bulge of non-survivors there. This separation is a sign of a good predictor. If the blue and pink areas overlap, the predictor isn't distinguishing cases very well.
The Story
So far all of this is just rows of data arranged in different ways, but there were real people behind each of them. What if we ask this question: who should have survived, but did not?
In the last graph, we can see that there are a few passengers with high probabilities of survival (far right), who perished. Who are they?
Here's the query.
pass$Probability <- fitted.values(logistic_model) # append the original data with computed probability of survival
pass %>% # use passenger data with probabilities appended
filter(Survived == 0) %>% # filter to only non-survivors
arrange(desc(Probability)) %>% # sort by probability of survival in descending order
select(Name, Sex, Age, Pclass, Probability) %>% # pick the most useful fields to look at
head(10) # get the first ten of them
And the output.
Name | Sex | Age | Pclass | Probability |
---|---|---|---|---|
Miss. Helen Loraine Allison | female | 2 | 1 | 0.9953459 |
Mrs. Hudson J C (Bessie Waldo Daniels) Allison | female | 25 | 1 | 0.9895226 |
Miss. Ann Elizabeth Isham | female | 50 | 1 | 0.9749029 |
Miss. Henriette Yrois | female | 24 | 2 | 0.9070513 |
Mrs. William (Anna Sylfven) Lahtinen | female | 26 | 2 | 0.9008834 |
Mrs. William John Robert (Dorothy Ann Wonnacott) Turpin | female | 27 | 2 | 0.8976647 |
Miss. Annie Clemmer Funk | female | 38 | 2 | 0.8557757 |
Mrs. Ernest Courtenay (Lilian Hughes) Carter | female | 44 | 2 | 0.8274153 |
Mrs. (Mary) Mack | female | 57 | 2 | 0.7512785 |
Miss. Ellis Anna Maria Andersson | female | 2 | 3 | 0.6801742 |
Notice that the first two names on the list, with the highest probabilities, share the last name. Since the Titanic is so famous, there are whole sections of the Internet devoted to information about passengers. Here's the heart-breaking story of Bess Allison.
This example vividly shows the uncanny insights that can be obtained through careful data analysis.
Since I called the p-value approach "old-fashioned" earlier, next time I'll describe the new-fashioned version.
This example vividly shows the uncanny insights that can be obtained through careful data analysis.
Since I called the p-value approach "old-fashioned" earlier, next time I'll describe the new-fashioned version.
No comments:
Post a Comment