## Thursday, May 15, 2014

### Why "Correlation doesn't imply Causation" isn't very sophisticated

At www.tylervigen.com you can find graphs of two variables that are correlated over time, but aren't plausibly causal. For example, the divorce rate in Maine versus margarine consumption. On his blog, David R. MacIver argues that coincidences like these are inevitable in large data sets. He's right, but there's a more fundamental problem with "correlation doesn't imply causation."

Causality is a widely discussed topic by researchers, and Judea Pearl gives a historical perspective here. Correlation is a statistic computed from paired data samples that assesses how linear the relationship is.

Causation is one-directional. If A causes B, we don't normally assume that B causes A too. The latter implication doesn't make sense because we insist on A preceding B. Correlation, however, is symmetrical--it can't distinguish between these two cases. A causing B or B causing A give the same numerical answer. In fact, we can think of the correlation coefficient as an average causal index over A => B and B => A [1, pg 15-16].

What we should really say is that "implication doesn't imply causation," meaning that if our data supports A => B, this doesn't necessarily mean that A causes B. If we observe people often putting on socks and then shoes (Socks => Shoes), it doesn't mean that it's causal. The causes ?? => socks and ??? => shoes may be related somehow, or it may just be a coincidence. (We can mostly rule out coincidence with experimentation.)

Everyone knows that even if A and B are highly correlated, it doesn't necessarily identify a causal relationship between the two, but it's even worse than that. A and B can have a correlation close zero, and A can still cause B. So correlation doesn't work in either direction.

Example: Suppose that S1 and S2 control a light bulb L, and are wired in parallel, so that closing either switch causes the light to be on. An experimenter who is unaware of S2 is randomly flipping S1 to see what happens. Unfortunately for her, S2 is closed 99% of the time, so that L is almost always on. During the remaining 1%, S1 perfectly controls L as an on/off interface. The correct conclusion is that closing S1 causes L to be on, but the correlation between the two is small. By contrast, the implication [S1 closed => L is on] is always true. Note that this is different from [S1 open => L is off]. The combination of the two is called an interface in [1], and methods are given to generate separate coefficients of causality.

This masking is very common. Your snow tires may work really well on snow, but if you live in Florida, you're not going to see much evidence of it. Because correlation is blind to the difference between [A => B] and [~A => ~B], it is an average indicator over the whole interface. It's heavily weighted by the conclusion that ~A does not imply ~B, and therefore the statistic doesn't accurately signal a causal connection.

One last problem with correlation I'll mention: it's not transitive the way we want causality to be. If A causes B and B causes C, we'd like to be able to reach some conclusion about A indirectly causing C.  It's easy to produce examples of A and B having positive correlation and the same with B and C, but A and C have zero correlation.

Tomorrow I'll resume the "A Cynical Argument for the Liberal Arts" series with part seven.

[1] Eubanks, D.A. "Causal Interfaces," arXiv:1404.4884 [cs.AI]