The fact that the universe is complex shouldn’t surprise anyone yet one of the things I find surprising how vehemently people try to argue that it’s not. I have already written a blog post about correlation vs. causation, but I’d like to talk about something a little broader than that. Correlations try to demonstrate the relationship between one variable and another. It would be a sensible decision to try and correlate the amount of gun ownership with gun deaths. I don’t think anybody would argue that it might be a relationship. And even if we do find a correlation what does it mean? What is a good correlation, or a bad correlation?
So what can we expect out of a correlation? First we might ask, “Can any two things to ever be perfectly correlated, even if we have a mathematical relationship?” The answer is no. Quite simply for data to be perfectly correlated it would require perfect measurements. Since every measurement always has some error there will always be a less than perfect correlation. The more difficult the measurement the less likely we are to get a good correlation at all. In social sciences measurements may depend on a survey, qualitative and/or subjective observations, and complex sampling techniques. This will all impact the correlation we calculate. Correlations range from 0 (no correlation) to 1 (perfect correlation) and it depends on the quality of the data and the nature of the problem to really determine what correlation is high enough for us to be positive of a relationship, but generally anything about 0.5 is significant. Correlations can also be negative as well. A negative correlation actually implies that as one variable increases, the other decreases, and those are important relationships as well.
Now, what happens when we try to correlate things that have no relationship to each other? It turns out we can get all sorts of results. The often cited graphic to the right is often used to show the difference between correlation and causation. Two variables may appear to be correlated but have no relationship at all even if the results are repeatable. Correlations turn out to be even a little trickier than that. Some types of data show natural variability that is higher and lower than the average. A couple of examples would be the amount of blood sugar or atmospheric pressure. Now let’s say we correlated those two things. Let’s say I have two years of daily air pressure values. And let’s say I have sampled the amount of blood sugar in my blood every for 30 straight days. The smart thing to do is select the atmospheric pressure values that correspond to the days that I took the sample and correlate them. But what might I find. Well it could be that in those 30 days, a slow moving high pressure system moved in as we approached the end of the period meaning that in general pressure increased. And then perhaps my diet was not so good, or I ate infrequently, irregularly, or had a few big meals one day and then light snacking the next, blood sugar will fluctuate. I might find that there is a negative correlation between pressure and blood sugar. But what would happen if I did the experiment for another 30 day period. Depending on the specific conditions I might find a positive correlation. But what I’d find, that over time, after doing many of these repeated tests is that I would have no correlation. The average of all the correlation calculations I perform will be essentially zero. This speaks to the importance of having a larger sample size and whether or not results are repeatable for the same experiment. Had I taken two years worth of pressure data and 2 years worth of blood sugar data, I would have found zero correlation. Furthermore even if I did find some correlation, my results would not be repeatable over many studies. The media often exaggerates findings from experiments that try to say link certain foods or cell phones to cancer. Variables that naturally fluctuate may randomly show a correlation, but further experiments reveal no correlation. But the media usually just picks up on the one study that found a correlation.
To further illustrate this point, take a look at the following two images. The one on the left is a large sample of raw data trying to correlate to variables. If these two variables were correlated we would see the data points approximate a line. Maybe a straight one, maybe a curved one, but definitely some sort of trend would be apparent. You are right if you think that whatever these two variables are they have no relationship two each other. Next to it is the same data set except that some of the data points are missing. Now it looks like there might be a correlation between these two variables. Sure I have selectively chosen in this case which ones to eliminate, but it is possible that a random selection of some of the data points could be these same data points on the left. Thus it is possible that we might think we’ve found a correlation when we didn’t. Once again the more data we take and the more we try to repeat the experiment we would show no correlation.
In the next part we will investigate how we hypothesize about correlations, and how complex most relationships are.