At www.tylervigen.com you can find graphs of two variables that are correlated over time, but aren't plausibly causal. For example, the divorce rate in Maine versus margarine consumption. On his blog, David R. MacIver argues that coincidences like these are inevitable in large data sets. He's right, but there's a more fundamental problem with "correlation doesn't imply causation."
Causality is a widely discussed topic by researchers, and Judea Pearl gives a historical perspective here. Correlation is a statistic computed from paired data samples that assesses how linear the relationship is.
Causation is one-directional. If A causes B, we don't normally assume that B causes A too. The latter implication doesn't make sense because we insist on A preceding B. Correlation, however, is symmetrical--it can't distinguish between these two cases. A causing B or B causing A give the same numerical answer. In fact, we can think of the correlation coefficient as an average causal index over A => B and B => A [1, pg 15-16].
What we should really say is that "implication doesn't imply causation," meaning that if our data supports A => B, this doesn't necessarily mean that A causes B. If we observe people often putting on socks and then shoes (Socks => Shoes), it doesn't mean that it's causal. The causes ?? => socks and ??? => shoes may be related somehow, or it may just be a coincidence. (We can mostly rule out coincidence with experimentation.)
Everyone knows that even if A and B are highly correlated, it doesn't necessarily identify a causal relationship between the two, but it's even worse than that. A and B can have a correlation close zero, and A can still cause B. So correlation doesn't work in either direction.
Example: Suppose that S1 and S2 control a light bulb L, and are wired in parallel, so that closing either switch causes the light to be on. An experimenter who is unaware of S2 is randomly flipping S1 to see what happens. Unfortunately for her, S2 is closed 99% of the time, so that L is almost always on. During the remaining 1%, S1 perfectly controls L as an on/off interface. The correct conclusion is that closing S1 causes L to be on, but the correlation between the two is small. By contrast, the implication [S1 closed => L is on] is always true. Note that this is different from [S1 open => L is off]. The combination of the two is called an interface in [1], and methods are given to generate separate coefficients of causality.
This masking is very common. Your snow tires may work really well on snow, but if you live in Florida, you're not going to see much evidence of it. Because correlation is blind to the difference between [A => B] and [~A => ~B], it is an average indicator over the whole interface. It's heavily weighted by the conclusion that ~A does not imply ~B, and therefore the statistic doesn't accurately signal a causal connection.
One last problem with correlation I'll mention: it's not transitive the way we want causality to be. If A causes B and B causes C, we'd like to be able to reach some conclusion about A indirectly causing C. It's easy to produce examples of A and B having positive correlation and the same with B and C, but A and C have zero correlation.
Tomorrow I'll resume the "A Cynical Argument for the Liberal Arts" series with part seven.
[1] Eubanks, D.A. "Causal Interfaces," arXiv:1404.4884 [cs.AI]
Subscribe to:
Post Comments (Atom)
-
The student/faculty ratio, which represents on average how many students there are for each faculty member, is a common metric of educationa...
-
(A parable for academic workers and those who direct their activities) by David W. Kammler, Professor Mathematics Department Southern Illino...
-
The annual NACUBO report on tuition discounts was covered in Inside Higher Ed back in April, including a figure showing historical rates. (...
-
In the last article , I showed a numerical example of how to increase the accuracy of a test by splitting it in half and judging the sub-sco...
-
Introduction Stephen Jay Gould promoted the idea of non-overlaping magisteria , or ways of knowing the world that can be separated into mutu...
-
I'm scheduled to give a talk on grade statistics on Monday 10/26, reviewing the work in the lead article of JAIE's edition on grades...
-
Introduction Within the world of educational assessment, rubrics play a large role in the attempt to turn student learning into numbers. ...
-
"How much data do you have?" is an inevitable question for program-level data analysis. For example, assessment reports that attem...
-
Inside Higher Ed today has a piece on " The Rise of Edupunk ." I didn't find much new in the article, except that perhaps mai...
-
Introduction A few days ago , I listed problems with using rubric scores as data to understand learning. One of these problems is how to i...
No comments:
Post a Comment