The Nature of Correlation – Part II

Since we now understand some basics about correlation I think it would be interesting to try and understand “How do we initially hypothesize a connection between variables?”  If we analyze something simple like Newton’s second law which says that the force exerted by an object is directly related to its mass and acceleration.  How might such a relationship have been devised initially?  Well this is something that can be easily observed.  You push an object like a


cart and it accelerates.  If you push it harder, it accelerates faster.  If the cart is full of objects it becomes heavier and requires more force to get the same acceleration.  Such observations are the basis for an experiment which can show us the nature of the relationship.  Our experiment might end up only being mostly successful as we might be confused about why the force might change depending on the surface across which we move the object.  Until we understand friction we might not be fully aware of the other forces that are working against us when we make our measurements.  Newton was able to also explain the frictional force as well, but the fact that the Earth is turning on its axis revealed Newton’s second law to be incomplete for Earth and only true in a reference frame that is not in motion.   Newton’s description of force came first and it was only through further experimentation and testing that we came to understand the limitations associated with his relationship between force, mass and acceleration.  So the basis for a correlation comes through a lot of trial and error after some initial observations.  Proper application of the scientific method (adequate sampling in particular) along with result repeatability can demonstrate the correlation between two variables.

Finally it’s important to understand the complexity of relationships that exist.  Just like the force is dependent on the mass and acceleration, most things in this world are not as simple as one cause and one effect.  Most things are complex systems in that one variable may change as a result of several other variables.   The global average surface temperature is a function of the amount of solar radiation hitting the Earth, the concentration and location of greenhouse gases in the atmosphere, the amount of geothermal energy released at the surface, and the amount of energy released through radioactivity.  The last two tend to be fairly small.  Solar radiation is the most important factor then followed by the greenhouse effect.  So if we want to look at how CO2 varies with global temperature, we are never going to get a perfect correlation, but we are going to see a correlation.  And if we understand the role the sun plays in heating, we can easily delineate between which part of the heating is due to changes in solar radiation, and which part of the heating is due to changes in greenhouse gases.  Just like with our example of force, we can determine whether that force is because of a light object traveling at a fast speed or whether it is a heavy object traveling at a low speed.  Our knowledge of the relationship allows us to make that determination.

In social sciences the variables impacting a system can be numerous.  As I’ve argued before gun control is an extremely complex issue.  The number of gun deaths is dependent on the types of laws we have, the number of guns available, quality of mental health care, attitude towards mental health care, income inequality, education (both general education and education about use of guns), the role of the media and politicians fear mongering, culture attitude towards violence and death, and probably more than that.  With these types of issues it’s easy to point to all the other things that could be causes to try to show that changing one variable isn’t going to have an impact.  But neither side is completely valid here, because the argument should really be about the factors that are more important and which ones are less important.   Just because one variable is more important however doesn’t eliminate all other variables from having an influence.  Just like coalition governments where the dominant party can lose power if the other parties combined outweigh them in votes, a dominant variable in a complex system may be outweighed by the combined importance of the other variables.  For instance when it comes to the greenhouse effect, carbon dioxide is not the only important greenhouse gas.  Numerous greenhouse gasses like CFC’s, nitrous oxide, ozone, and methane are released as pollutants and if industrialization continues at the pace it is at, the combined impact of those other gasses may become similar to the impact of CO2, even though individually those gases have a very small effect.  It is important to understand all the variables that are involved and address them, especially when harm is being caused to people, because even a small variable that we can fix might reduce that harm.

As the complexity of a system increases the direct correlations between one variable and another generally decrease.  A correlation of 0.2 might be significant if there are many variables all impacting the state of a system, especially if all those variables might be of similar importance.  People like to keep relationships simple, but by doing so fail to solve problems that are usually far more complex.  This is also why complex systems are some of the easiest for those who don’t really understand it to mislead others.  Climate change is a great example.  A change to the climate system depends on many factors and thus makes it easy for someone to try and emphasize one part to make their argument.  Like the oft used “Carbon dioxide is necessary for plants to grow, how can more be a bad thing?” This ignores the role carbon dioxide plays in the greenhouse effect, ocean acidification, and what happens when plants decompose.

Happiness is often brought about by the simple things in life, but it is also important to remember that there are lots of things happening out there that aren’t so simple.  We are a part of a complex universe.  Part of why we continue to survive better is that we continue to breakdown the complexities of the university into things that we can understand.  Also remember that just because things are complex doesn’t mean that there aren’t those who truly understand the problem and that with patience and effort you can too if you choose.

The Nature of Correlations – Part I


The fact that the universe is complex shouldn’t surprise anyone yet one of the things I find surprising how vehemently people try to argue that it’s not. I have already written a blog post about correlation vs. causation, but I’d like to talk about something a little broader than that. Correlations try to demonstrate the relationship between one variable and another. It would be a sensible decision to try and correlate the amount of gun ownership with gun deaths. I don’t think anybody would argue that it might be a relationship. And even if we do find a correlation what does it mean? What is a good correlation, or a bad correlation?

So what can we expect out of a correlation? First we might ask, “Can any two things to ever be perfectly correlated, even if we have a mathematical relationship?” The answer is no. Quite simply for data to be perfectly correlated it would require perfect measurements. Since every measurement always has some error there will always be a less than perfect correlation. The more difficult the measurement the less likely we are to get a good correlation at all. In social sciences measurements may depend on a survey, qualitative and/or subjective observations, and complex sampling techniques. This will all impact the correlation we calculate. Correlations range from 0 (no correlation) to 1 (perfect correlation) and it depends on the quality of the data and the nature of the problem to really determine what correlation is high enough for us to be positive of a relationship, but generally anything about 0.5 is significant. Correlations can also be negative as well. A negative correlation actually implies that as one variable increases, the other decreases, and those are important relationships as well.


Now, what happens when we try to correlate things that have no relationship to each other? It turns out we can get all sorts of results. The often cited graphic to the right is often used to show the difference between correlation and causation.  Two variables may appear to be correlated but have no relationship at all even if the results are repeatable.  Correlations turn out to be even a little trickier than that. Some types of data show natural variability that is higher and lower than the average. A couple of examples would be the amount of blood sugar or atmospheric pressure. Now let’s say we correlated those two things. Let’s say I have two years of daily air pressure values. And let’s say I have sampled the amount of blood sugar in my blood every for 30 straight days. The smart thing to do is select the atmospheric pressure values that correspond to the days that I took the sample and correlate them. But what might I find. Well it could be that in those 30 days, a slow moving high pressure system moved in as we approached the end of the period meaning that in general pressure increased. And then perhaps my diet was not so good, or I ate infrequently, irregularly, or had a few big meals one day and then light snacking the next, blood sugar will fluctuate. I might find that there is a negative correlation between pressure and blood sugar. But what would happen if I did the experiment for another 30 day period. Depending on the specific conditions I might find a positive correlation. But what I’d find, that over time, after doing many of these repeated tests is that I would have no correlation. The average of all the correlation calculations I perform will be essentially zero. This speaks to the importance of having a larger sample size and whether or not results are repeatable for the same experiment. Had I taken two years worth of pressure data and 2 years worth of blood sugar data, I would have found zero correlation. Furthermore even if I did find some correlation, my results would not be repeatable over many studies. The media often exaggerates findings from experiments that try to say link certain foods or cell phones to cancer. Variables that naturally fluctuate may randomly show a correlation, but further experiments reveal no correlation. But the media usually just picks up on the one study that found a correlation.

To further illustrate this point, take a look at the following two images.  The one on the left is a large sample of raw data trying to correlate to variables.  If these two variables were correlated we would see the data points approximate a line.  Maybe a straight one, maybe a curved one, but definitely some sort of trend would be apparent.  You are right if you think that whatever these two variables are they have no relationship two each other.  Next to it is the same data set except that some of the data points are missing.  Now it looks like there might be a correlation between these two variables.  Sure I have selectively chosen in this case which ones to eliminate, but it is possible that a random selection of some of the data points could be these same data points on the left.  Thus it is possible that we might think we’ve found a correlation when we didn’t.  Once again the more data we take and the more we try to repeat the experiment we would show no correlation.

LDrXq LDrXq2

In the next part we will investigate how we hypothesize about correlations, and how complex most relationships are.