Tuesday, November 05, 2019

Stories from Data


In the last few articles I explored the meaning of the AUC metric for predictor performance. Let's take it for a test drive with a model of Titanic survival that involves more than passenger class.

Using R's general linear model, we can create a binomial predictor of survival with one line:

logistic_model <- glm(Survived ~ Pclass*Sex + Age, family = "binomial", data = pass)

This model assumes that passenger class and sex of the passenger will interact, and that age is important. It's like "women and children first, starting with the wealthy."

Here's the output.

Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.7418 0.9256 8.36 0.0000
Pclass -2.3054 0.3089 -7.46 0.0000
Sexmale -6.0872 0.8900 -6.84 0.0000
Age -0.0355 0.0074 -4.82 0.0000
Pclass:Sexmale 1.3975 0.3260 4.29 0.0000

This is an old-fashioned way to examine model parameters, by looking at p-values. In that mode of thinking, if the p-values are near zero it means that the actual coefficient in question probably isn't zero. In this case, it says that all the coefficients are non-zero. In particular, the interaction between class and sex turned out to be important. You can see that Sexmale has a large negative coefficient, meaning men were at greater risk. to a lesser extent, the passenger class 1,2,3 increases risk as well. Putting those two things together, we actually take back a little of the risk for the men as the passenger class increases, since the interaction Pclass:Sexmale is positive. 


ROC graph for the predictive model


The shape of the ROC graph shows that we can easily predict half the survivors using this model, since the graph zooms up to about .5 before turning to the right. 

When we only considered passenger class, we made a matrix showing the conditional probabilities of class 1,2,3 given the survival state. In the current model, the predictor has too many unique values to make a table like that, but we can make a graph instead. Imagine turning the table on its side and plotting the two values (survived or not) in different colors.

The x-axis shows the modeled probability of survival, with blue being actual survivors.


At the far right, the probability of survival per the model is very high--near 100%--and the actual rate of survival there is much higher (blue) than the non-survivors (pink overlap). At the left end of the graph, the predicted probability of survival is near zero, and there's a large bulge of non-survivors there. This separation is a sign of a good predictor. If the blue and pink areas overlap, the predictor isn't distinguishing cases very well.

The Story

So far all of this is just rows of data arranged in different ways, but there were real people behind each of them. What if we ask this question: who should have survived, but did not?

In the last graph, we can see that there are a few passengers with high probabilities of survival (far right), who perished. Who are they?

Here's the query.

pass$Probability <- fitted.values(logistic_model) # append the original data with computed probability of survival
pass %>%                                          # use passenger data with probabilities appended
  filter(Survived == 0) %>%                       # filter to only non-survivors
  arrange(desc(Probability)) %>%                  # sort by probability of survival in descending order
  select(Name, Sex, Age, Pclass, Probability) %>% # pick the most useful fields to look at
  head(10)                                           # get the first ten of them

And the output.

Name Sex Age Pclass Probability
Miss. Helen Loraine Allison female 2 1 0.9953459
Mrs. Hudson J C (Bessie Waldo Daniels) Allison female 25 1 0.9895226
Miss. Ann Elizabeth Isham female 50 1 0.9749029
Miss. Henriette Yrois female 24 2 0.9070513
Mrs. William (Anna Sylfven) Lahtinen female 26 2 0.9008834
Mrs. William John Robert (Dorothy Ann Wonnacott) Turpin female 27 2 0.8976647
Miss. Annie Clemmer Funk female 38 2 0.8557757
Mrs. Ernest Courtenay (Lilian Hughes) Carter female 44 2 0.8274153
Mrs. (Mary) Mack female 57 2 0.7512785
Miss. Ellis Anna Maria Andersson female 2 3 0.6801742

Notice that the first two names on the list, with the highest probabilities, share the last name. Since the Titanic is so famous, there are whole sections of the Internet devoted to information about passengers. Here's the heart-breaking story of Bess Allison. 

This example vividly shows the uncanny insights that can be obtained through careful data analysis.

Since I called the p-value approach "old-fashioned" earlier, next time I'll describe the new-fashioned version.

No comments:

Post a Comment