Spurious Correlations With Time Series: What We Can Learn

There is a viral website being passed around. Harvard Law student Tyler Vigen made an online tool that allows one to choose a time series and to choose another time series that it is potentially highly correlated with. This has inspired many people to make graphs of time-varying variables that have little to do with each other but are still highly correlated.

For example, the amount of money spent on pets and the number of people who died by falling down the stairs may have little to do with each other, yet there is almost a perfect correlation of r=0.97.

The media has covered the popularity of Vigen’s website through a variety of angles. Most of them suggest that we shouldn’t be too quick to trust graphs and that, once again, correlation is not causation.

I agree fully that we should be critical of charts and graphs, that we should be skeptical of their assumptions, and that we should look into their hidden meanings. There’s nothing wrong with that sentiment of skepticism.

I also agree that, on the most basic and technical level, correlation is not causation. What I think is less commonly understood is the fact that correlation is actually evidence for causation, since causation correlates with correlation.

But there are deeper and more more nuanced truths about these types of graphs that people tend to gloss over. It has to do with the fact that these data are time series data. Time series data have certain properties that are unique. Correlating time series is not the same as correlating non-time-varying data.

Why do we find spurious correlations?

The first and obvious thing to note is that times series can be misleading because they appear to be authoritative in their sample size. The graph above shows data from 2000 to 2009, which seems like a long time, but is actually only 10 data points. Correlations merely describe the association of two directional relationships from the respective means; they don’t even begin to tell us whether we should put much confidence on whether the association really exists.

A better way to do this would be to use a statistical test for your correlation value or to run a regression. If your regression coefficient is positive, you might be onto something.

But with time series data, that could still be wrong.

You see, times series is special; realizations of variables are not independent and can evolve according to a specific process across time. There’s a very peculiar process that is known to generate spurious correlations.

It’s called a random walk. It’s literally like walking around in random directions over time.

The most basic form of this data generating process can be described as the following:

$\\x_{t} = x_{t-1} + \epsilon_t$

In other words, we can make a random walk by choosing some initial value of x. The next period’s value will be the current period’s value added to a realization of a value from a normal distribution centered at zero. I repeat this to get my entire series. Since the values from the normal distribution are random, I can’t predict which way I’m going to go; hence, the random walk.

This is an example of a random walk, with initial value at zero and disturbances from a standard normal:

Long story short, it is hard (impossible) to predict which way the series will go. Here’s the million dollar question:

What happens when we look at the correlation of two completely unrelated random walks? For this exercise, I simulate two completely separate random walks of 100 time periods long. I calculate the correlation them and save that number. I do this 10,000 times and plot the distribution of the correlations. If it is true that two series are completely unrelated to each other, then we might expect that their correlations are zero, so the distribution should be tightly centered around zero, and not going anywhere close to 1 or -1. However, the distribution from the simulation looks like this:

It isn’t tight around zero at all! In fact, it is pretty common to get values over 0.7 or under -0.7, and extremely common to get values over 0.5 or under -0.5. A huge number of these things are spurious correlations, and we made it by merely relating two things that are most definitely not related.

Why does this trick work? Because correlations are covariances divided by variances. However, the variance of a random walk is infinite; if you walk around randomly, there’s no pressure to return to where you started from. Dividing by an infinite value is taboo by math standards. The correlation you calculate with completely separate random walks does not make sense.

It gets even worse. There are things called random walks with drift and random walks with trend. The former has been commonly used to describe prices of things (like stocks), which are unpredictable in the short run but predictable in the long run. The positive “drift” term basically means you can make money in the long-run. For example, the S&P 500 looks like a random walk with positive drift.

But the drift term, represented by a constant alpha, can be anything, positive or negative. It is just added to the series in every time period.

$\\x_{t} = \alpha + x_{t-1} + \epsilon_t$

So the next exercise is similar to the last one. Again, I make two completely separate random walks with the same parameters, but this time, I choose an an alpha value by taking a draw from a normal distribution of standard deviation 0.5 before I simulate each random walk. The distribution of the correlations for 10,000 correlations is described below.

That’s right. It gets a lot worse. Not only are the values far from zero, but getting strong correlations near 1 and -1 is actually more common than not. This is because the drift term is chosen beforehand; if the two drift terms are of the same sign, then the two series tend to go the same direction. If the two drift terms are of opposite sign, the two series tend to go opposite directions. This phenomenon generates strong correlations, yet they are spurious correlations.

So what’s the solution to this?

First of all, we want to get rid of the infinite variance that is associated with the random walk. What happens when we make the series “mean revert” back to a constant? It turns out, if we have the following specification, the variance is no longer infinite.

$\\x_{t} = 0.5x_{t-1} + \epsilon_t$

This is because the 0.5 multiplicative factor in front of the lagged values of x “pulls” the series towards zero at each time period. Specifically, it halves the previous period’s value.

As it turns out, just doing this makes the distribution of the correlations much closer to what we want. In fact, a lot of the spurious correlations between two unconnected time series simply go away.

So the bottom line is that it only makes sense to calculate correlations of time series that look a lot like the mean-reverting process described above, rather than the random walk and drifting processes described above. The problem with “money spent on pets” and “people who fall down the stairs” is that they may both be random walks with drift. (I’m not so sure about random walks with deaths, but there’s definitely drift.) The same reasoning goes for a lot of the series on Vigen’s websites.

If we suspect drift and/or random walk, the way to solve this problem is to first-difference the series. In other words, we generate a whole new series that is made of the difference between the current period’s value and last period’s value. In our random walk with drift process, it would essentially cancel out the x values AND the alpha values, leaving us with the difference in the realizations of epsilons, which is random. Correlating the two differenced series from two different random walks with drift will give us very close to zero correlation. On the other hand, if the two series are connected in some way; e.g., if the epsilons are truly correlated, then we should be able to detect it.

In short, there’s a lot going on with time series. Don’t believe every graph you see, but don’t dismiss the importance of correlation in statistics either.

Happy Carl Sagan Day 2013

There are so many videos of Carl Sagan’s inspirational and life-changing narration about the Pale Blue Dot. Here are two of my favorites.

The first is a recently released crowd-sourced video from the skeptic and nonbelieving community. Carl Sagan certainly has a way to bring people from all walks of life together, and that makes me so happy.

The second is a breathtakingly beautiful production from the Sagan Series.

This is Water: One of the Best Commencement Speeches Ever

So much of our lives revolve around monotony and dead time. But it’s not about you. David Foster Wallace reminds us to consider the richness of our surroundings and to understand the value of a true education.

What is your goal for the relationship between humanity and religion in the future?

First of all, I want to state what my goals definitely are not.

I don’t want to stop you from enjoying a transcendent experience at church or synagogue on the weekends. I don’t want the world to stop reading religious texts or to extinguish every last trace of the Bible or Koran from material existence (What other kind of existence is there anyways?).

(How atheists really want to spread their views. Not.)

I don’t even necessarily want people to stop identifying as Christians or Muslims or whatever, even though much can be said about what identities we should hold primary, and whether those considerations put primacy on individual choice, or whether society has a role in directing people’s energies towards a more harmonious, tolerant whole.

And there are many different reasons why my goals wouldn’t matter to you even if you are religious. If you think of religion in a vague, academic sense as Durkheim did (in the categorization of the sacred and the profane), then pretty much anything is religion, and you’ll still have it. If you think of God as something that goes beyond theism in the Paul Tillich sense (God Above God, faith as ultimate concern), then your religious beliefs are solid (I think) even if all my goals become reality.

And yes, I understand that for many traditional theists, my demands are outrageous, even offensive. So here it goes.

1) All religions have to lose their supernatural claims.

This means a full embrace of science. If you think Jesus literally rose from the dead, that’s not going to cut it. If you think maybe, just maybe there was a real Garden of Eden, that’s probably not going to pass either. You can’t believe that prayer actually does something. (I’m not against people praying though, oddly.) The reason is that we’re tired of humanity having to spend energy to fight claims that seem really trivial at first, but somehow cause parents to watch their children die in front of them, make people think that creationism is actually worth teaching in the schools, or take the existence of heaven and hell way too seriously. Removing actual supernaturalism from religion makes adherents more likely to not descend into a state of extreme denial of reality.

2) Religion has to give way to secular ethics.

I’m willing to compromise on this, but not by much. You can use religion to explain why you feel passionately about a moral issue, and why this religious motivation inspires you to be an activist. But humanity shouldn’t take religious arguments for why something is right or wrong seriously. We should stop fooling ourselves into thinking that debates over whether homosexuality or slavery is right or wrong belong in the sphere of people arguing about the proper exegesis of the Bible. We have much better ways to settle issues like that. It’s called the body of secular ethics (notably utilitarianism) in modern philosophy.

3) Religion has to give way to secular politics.

It’s similar to ethics. But no quote I’ve found has been as good as this one:

4) We should promote general religion over specific religion.

I think there’s a a need and a good that specific religions provide. Just as pro-lifers say they want to promote a “culture of life”, I think there’s a case to be made that promotion of civil religion, the kind that emphasizes common values and universal humanistic truths, is important. How I want this civil religion to play out I have not fully decided, but I don’t see why it can’t take many forms to promote the spiritual and psychological health people in general. One need not have the fantastical visions of Alain de Botton to find a practical way to keep the good parts of religion or to recognize that there’s something valuable in promoting those good parts in a reasonable, *socialized* manner.

Example

I believe there was once a commenter on this blog who asked that if I wasn’t content with Christians holding the position that “homosexuality is a sin and therefore gay people must be celibate,” what he could possibly do without leaving Christianity. My answer was that Christianity needed to change, drastically, just as it had done so in the past with all the schisms and scandals in the Church. I also noted that there are many LGBTQ-affirming Christians out there, and that it wasn’t a practical impossibility to change one’s position.

Of course, I pretty much got yelled at online for suggesting that Christians fundamentally change the Word of God, or at least the orthodox interpretation of such revelations.

Unfortunately, that’s kind of my demand. And if you’re not happy with it, then I’m afraid you’re not going to enjoy the work that this secular movement is going to do to fundamentally change the culture and ethos of this country.

Depiction of Asian Americans in the media

This post will be fairly brief because this topic has been discussed a lot. All you have to do is seek out all the sources that already exist!

I remember when I was in elementary school, I went with a group of Asian friends to a parade on the South Side because we were paid to be next to a float and walk alongside it in promotion of a city project (it was the repair of the Dan Ryan Expressway). It was arguably one of the most awkward and most terrifying experience of my life. It was pretty obvious that we were the only *different* ethnicity there. But that wouldn’t have been a problem if it hadn’t been for the kids around us insisting that we act like Bruce Lee and that we fight them with karate or something. When you’re surrounded by kids that seem to want to fight you, you tend remember it. We tried to walk away and forget it.

Of course, feeling terror is not interesting in and of itself. There are many times when people are afraid, and there have been many people who have had much worse experiences that I did.

The incident had me thinking a lot about why people form the conception of others in the way they do, and whether it’s the media or something else.

(Martial arts is really cool. Fighting and all. But there’s a philosophy of discipline and hard work behind it.)

The funny thing is that I rather like Bruce Lee. He brought the martial art of an entire country to America and the world, and popularized respect for it. More importantly, he was as American and human as any of us. He went to college here and studied drama and philosophy. He married an American and was actually an American citizen. His movies, though very imperfect, showed many ethical and philosophical sides to martial arts and human existence in the characters that he played. Bruce Lee himself was a physical trainer, a filmmaker, an artist, and a poet. He was also, somewhat in an irrelevant way, an atheist.

The question is how did all of this richness and depth get all lost and reduced to the image of merely fighting multiple people on the street?

This is not to defend the media or anything like that. Even Bruce Lee left Hollywood in the 1970’s because he felt he had a lack of opportunity there due to discrimination.

As articles like this suggest, there is still a lot more to be done in the media to change the status quo, even though not all stereotypes are media driven.

From New York Knicks basketball star Jeremy Lin to Priscilla Chan, wife of Facebook founder Mark Zuckerberg, the mainstream media usually portray Asian-Americans as wealthy, well-educated and foreign. The dominant cultural narrative routinely ignores working and middle class Asian-Americans, people of various nationalities who struggle with the same socioeconomic conditions as do other Americans.

Despite shortcomings, mainstream media are rarely criticized for the way they depict Asian-Americans, even though the lack of depth in the coverage is stunning.

Yes, the media sucks. But we have to do our part in not selectively remembering what’s actually depicted either. Sometimes there’s more richness and depth in people, if not in the media, than you realize. Doing otherwise is just confirmation bias acting on our stereotypes.

And by the way, Happy Asian-Pacific Heritage Month!

Whether non-subjective value might be possible, and what value would mean in that case.

I admit that one of the most annoying things about philosophical discourse is this assumption that there exists this sacred, unbreakable distinction between the “subjective” and the “objective”. Whether we are talking about things like beauty, morality, value, or even probability, there is a tendency to think of the subjective as that realm where it is merely opinion or feelings, fleeting as they may be, and the objective as that realm which is eternally and cosmically really really true no matter what you think or say.

I think these poor definitions (or maybe misconceptions of the definitions) put unnecessary restrictions on our thinking, especially if all we want is a coherent and satisfying framework for believing that the things we care about have value even when we aren’t here. If, on the other hand, you aren’t satisfied with anything but a cosmic, utterly transcendent, nearly magical idea of “value”, then I can’t help you.

As a utilitarian, I like to think of value as determined by, well, utility. But doesn’t utility depend on beings to be in existence, and what if those beings aren’t there anymore? Isn’t value completely subjective?

Let’s do a thought experiment. Suppose we got to explore a new planet in some far away solar system, and we discovered that the inhabitants of this planet had all disappeared or died, leaving their valuables like their cars and dishwashers in their houses. Are we to say that these artifacts have no value (apart from artifact value) even though we don’t know how to drive alien cars and their dishwashers are worse than ours? Does the nonexistence of the creatures affect the value?

I think we can safely say that value does definitely depend on how it once related to the subjective lives of those around them. If there were no conscious creatures ever, the Universe would just be a barren place, and the idea of value just wouldn’t have coherent meaning.

I think think value is not merely what we think or feel. When I say a car (or something abstract) has value, I don’t mean that I like cars and you should too. I mean that I recognize that conscious creatures do (or had once) like cars, and that this object-person relationship that emerges from the consideration of utility in others is something I recognize. It’s the difference between saying, “I like chocolate ice cream” and “I observe that chocolate ice cream has increased the utility of many people and thus I recognize that it has value, especially given that I think there’s something objectively real in chocolate ice cream (its sugary, creamy awesomeness in the form of certain chemical arrangements)”

After all, the ability to go beyond one’s own mind and recognize others is the basis of science, morality, and the typical ways Bayesians converge towards truth, or at least agreement.

So yes, I think value is subjective, but it isn’t as subjective as you subjectively think.

Does the gender wage gap exist and why?

It’s curious that I get economic questions a lot. But let’s roll.

Does the gender wage gap exist? I think it does. My priors are that it does because we know that in experimental settings, employment discrimination is very real. But what does the observed evidence in the labor market tell us?

First of all, good interpretation of statistical evidence requires that we evaluate not just individual studies and papers, but the entire literature. Reviews of this literature suggest that “there is considerable agreement that gender wage discrimination exists“.

The parsimonious “let’s control for observables” approaches have yielded mixed results. Most of the wage gap disappears, but leaving some significant difference behind. That difference has been the subject of many arguments on both sides. But let me suggest a different way to think about the wage data (or any kind of data).

For many things in life, the fact that you can observe something is information in and of itself. The fact that for a specific individual, a wage is offered and accepted (and then by random chance recorded in population surveys) is telling. Surveys generally do a great job of randomly selecting a sample from the population, but the market does not do a good job of randomly choosing who works at what wage, or whether certain people work at all. Selection bias is at work, whether you like it or not. The only thing I want to convince you of is that the existence of selection bias is something to really consider when thinking about “controlling for other factors”.

Selection, its effects and more specifically how to correct for them, is the area of research that got James Heckman his Nobel Prize in Economics in 2000. Most interestingly, it has been used extensively to study the determinants of wages.

How exactly does selection work in our setting? Let me draw you a few pictures! But I don’t have my awesome graphic design software installed, so I’ll have to do with MS Paint.

If controlled wages for females (sorry everyone, the U.S. surveys don’t code for third gender or anything like that) are lower, and if we want to think about selection into the sample, then we have to ask, “what would make a a person work?”

We all have an intuition that there’s a wage low enough where we would choose not to work. We call it the reservation wage. There are many reasons to think why this wage is nontrivial. Maybe spousal income might be a good substitution for individual income, so we choose to work only if we have to or we’re paid a lot for it. Maybe you believe that the evil welfare state is causing many low-wage workers to rely on unemployment benefits and food stamps because welfare is supposedly better than working at some low wage.

The effect of this reservation wage, if it were a strict thing, would mask more of the lower end of the lower distribution than the lower end of the higher distribution. Take a look at the graph, and imagine all the points under the black horizontal line are not observed. In reality, it would mean that many women choose not to work, which is empirically true because female labor force participation is not all that high.

When there’s selection on the lower end of the spectrum, it makes the slope of the line flatter, which means it makes the estimate of the gender wage gap smaller than it actually is.

An important point is that the graph is a little exaggerated. Specifically, the reservation wages are different for everyone. So there’s no clear black line you can draw on the graph where all the points below that line are unobserved. Instead, we should think of the appearance of a point in the data as a probability–a probability that increases as the wage gets higher because you are more likely to work!

I’m still trying to find a really good paper on specifically the U.S. gender wage gap and selection bias correction (across all different kinds of ways to correct for it), and it’s not been going well. There’s this paper using data from Columbia that suggests “that self selection into the labor force is crucial for gender gaps: if all women participated in the labor force, the observed gap would be roughly 50% larger at all quantiles.” Of course, we need to review the entire literature.