There is a viral website being passed around. Harvard Law student Tyler Vigen made an online tool that allows one to choose a time series and to choose another time series that it is potentially highly correlated with. This has inspired many people to make graphs of time-varying variables that have little to do with each other but are still highly correlated.
For example, the amount of money spent on pets and the number of people who died by falling down the stairs may have little to do with each other, yet there is almost a perfect correlation of r=0.97.
The media has covered the popularity of Vigen’s website through a variety of angles. Most of them suggest that we shouldn’t be too quick to trust graphs and that, once again, correlation is not causation.
I agree fully that we should be critical of charts and graphs, that we should be skeptical of their assumptions, and that we should look into their hidden meanings. There’s nothing wrong with that sentiment of skepticism.
I also agree that, on the most basic and technical level, correlation is not causation. What I think is less commonly understood is the fact that correlation is actually evidence for causation, since causation correlates with correlation.
But there are deeper and more more nuanced truths about these types of graphs that people tend to gloss over. It has to do with the fact that these data are time series data. Time series data have certain properties that are unique. Correlating time series is not the same as correlating non-time-varying data.
Why do we find spurious correlations?
The first and obvious thing to note is that times series can be misleading because they appear to be authoritative in their sample size. The graph above shows data from 2000 to 2009, which seems like a long time, but is actually only 10 data points. Correlations merely describe the association of two directional relationships from the respective means; they don’t even begin to tell us whether we should put much confidence on whether the association really exists.
A better way to do this would be to use a statistical test for your correlation value or to run a regression. If your regression coefficient is positive, you might be onto something.
But with time series data, that could still be wrong.
You see, times series is special; realizations of variables are not independent and can evolve according to a specific process across time. There’s a very peculiar process that is known to generate spurious correlations.
It’s called a random walk. It’s literally like walking around in random directions over time.
The most basic form of this data generating process can be described as the following:
In other words, we can make a random walk by choosing some initial value of x. The next period’s value will be the current period’s value added to a realization of a value from a normal distribution centered at zero. I repeat this to get my entire series. Since the values from the normal distribution are random, I can’t predict which way I’m going to go; hence, the random walk.
This is an example of a random walk, with initial value at zero and disturbances from a standard normal:
Long story short, it is hard (impossible) to predict which way the series will go. Here’s the million dollar question:
What happens when we look at the correlation of two completely unrelated random walks? For this exercise, I simulate two completely separate random walks of 100 time periods long. I calculate the correlation them and save that number. I do this 10,000 times and plot the distribution of the correlations. If it is true that two series are completely unrelated to each other, then we might expect that their correlations are zero, so the distribution should be tightly centered around zero, and not going anywhere close to 1 or -1. However, the distribution from the simulation looks like this:
It isn’t tight around zero at all! In fact, it is pretty common to get values over 0.7 or under -0.7, and extremely common to get values over 0.5 or under -0.5. A huge number of these things are spurious correlations, and we made it by merely relating two things that are most definitely not related.
Why does this trick work? Because correlations are covariances divided by variances. However, the variance of a random walk is infinite; if you walk around randomly, there’s no pressure to return to where you started from. Dividing by an infinite value is taboo by math standards. The correlation you calculate with completely separate random walks does not make sense.
It gets even worse. There are things called random walks with drift and random walks with trend. The former has been commonly used to describe prices of things (like stocks), which are unpredictable in the short run but predictable in the long run. The positive “drift” term basically means you can make money in the long-run. For example, the S&P 500 looks like a random walk with positive drift.
But the drift term, represented by a constant alpha, can be anything, positive or negative. It is just added to the series in every time period.
So the next exercise is similar to the last one. Again, I make two completely separate random walks with the same parameters, but this time, I choose an an alpha value by taking a draw from a normal distribution of standard deviation 0.5 before I simulate each random walk. The distribution of the correlations for 10,000 correlations is described below.
That’s right. It gets a lot worse. Not only are the values far from zero, but getting strong correlations near 1 and -1 is actually more common than not. This is because the drift term is chosen beforehand; if the two drift terms are of the same sign, then the two series tend to go the same direction. If the two drift terms are of opposite sign, the two series tend to go opposite directions. This phenomenon generates strong correlations, yet they are spurious correlations.
So what’s the solution to this?
First of all, we want to get rid of the infinite variance that is associated with the random walk. What happens when we make the series “mean revert” back to a constant? It turns out, if we have the following specification, the variance is no longer infinite.
This is because the 0.5 multiplicative factor in front of the lagged values of x “pulls” the series towards zero at each time period. Specifically, it halves the previous period’s value.
As it turns out, just doing this makes the distribution of the correlations much closer to what we want. In fact, a lot of the spurious correlations between two unconnected time series simply go away.
So the bottom line is that it only makes sense to calculate correlations of time series that look a lot like the mean-reverting process described above, rather than the random walk and drifting processes described above. The problem with “money spent on pets” and “people who fall down the stairs” is that they may both be random walks with drift. (I’m not so sure about random walks with deaths, but there’s definitely drift.) The same reasoning goes for a lot of the series on Vigen’s websites.
If we suspect drift and/or random walk, the way to solve this problem is to first-difference the series. In other words, we generate a whole new series that is made of the difference between the current period’s value and last period’s value. In our random walk with drift process, it would essentially cancel out the x values AND the alpha values, leaving us with the difference in the realizations of epsilons, which is random. Correlating the two differenced series from two different random walks with drift will give us very close to zero correlation. On the other hand, if the two series are connected in some way; e.g., if the epsilons are truly correlated, then we should be able to detect it.
In short, there’s a lot going on with time series. Don’t believe every graph you see, but don’t dismiss the importance of correlation in statistics either.