What does it all ‘mean’?

I was listening to an interview with Charles Murray recently.  For those of you who aren’t familiar with him, he is the person who wrote, what became a controversial book, The Bell Curve.  Through some careful research he showed that white people had a higher IQ than black people, and had a lower IQ than Asians.  He was labeled a nazi and a racist, and many other things for this research.  Now I don’t want to get into this debate, I think that as a scholar his methodology was sound, but what his book doesn’t answer the reason for such differences.  Is it genetic?  Is it environmental?  I am sure it is more often the latter.  The more important question is, what is the value of such research?  Is it really doing any good, or does it just feed the bias of a racist, while angering others?  If you’d like to listen to the interview, I think you will find him cordial, but it’s not really the main thing I want to talk about, although what I do want to talk about is mentioned several times throughout out the interview, and that is what differences really mean.  How should perceive differences in IQ or any quality for that matter?  There are certain issues that have become very taboo in our society, and things become emotionally charged quickly when one tries to talk about them.  These includes talk about differences between people of different races, different religions, different genders, and different sexual orientation.  A discussion about any statistical differences between different populations along those lines usually doesn’t end well for the person trying to bring them up.  And it’s possible that there is little value in discussing these differences, but I thought a short post to really visualize things from a statistical perspective is important, because I think people often don’t view the statistics properly when these issues come up.  And it’s true not only for these “hot button” issues, but a lot of issues in which scientists discuss differences between populations.

For many studies, particularly in the social sciences and biological sciences you will find data is distributed.  For any two variables that you are trying to find a relationship with, you will find the outcomes range across a particular set of values.  For instance, if we were trying to determine how depression influences someone’s eating habits, even in a perfect experiment we would likely find that most people eat more to comfort themselves.  Perhaps a large majority would say take in 50% more calories than they normally would.  But a small minority would take less calories, perhaps -20%, and another small minority would take in twice as many calories.  This is called a frequency distribution.  We plot the range of outcomes versus the amount of time those outcomes occur.  There are several types of distributions.  There are skewed distributions, bimodal distributions, and then there is the normal distribution.  This is generally the most common one and the easiest to say something about statistically.  As our sample size increases, a relationship between two variables that are related to each other should get closer to a normal distribution.  I realize that I am simplifying here, but my goal is not to get deep into statistical theory, but simply to illustrate why differences between populations might be more or less meaningful.

In a normal distribution the most common occurrence (the mode of the distribution) is the mean, and it is the value you get at the middle and tallest part of the curve.  First let’s ask how useful is this to begin with?  By definition of the mean is the middle value, half of the people lie above and half of the people lie below.  So when we look at the means of two different populations we might see an overlap as illustrated here:

Despite the different averages we can see that much of the populations span the same range of values.  The source for this graph discusses the meaning of overlapping means in more detail. A more specific example is here:

This graph comes from an interesting discussion about differences between populations of men and women.  In this example we can see that the average height of women and men are different, but of course no one would say that any given man will be taller than any given woman. What this means is that if we are talking about people there is very little we can assume a priori meeting any individual member of a group.  We can only say this is how things are on average and we can decide if anything should be done about it or anything can be done about it if we desire those averages to be the same.

Averages are talked about far more than perhaps they should.  While it is a good summary of data, frequently the devil is in the details and we can say little concrete on averages alone.  Rarely do researchers themselves so narrowly focus on the statistical analysis they do, but I think much gets lost when a journalist tries to report on the findings.  The average, being the easiest to understand, is thus the easiest to report on and that’s when people start making assumptions about what the data are actually saying.  Read an actual paper and you will find all sorts of other statistics discussed.  Averages are all too common though.  We get them in school, they are reported in sports statistics, the news.  But one has to be put it in context of the entire set of data.  Let’s not define people by an average.  What is equally relevant is the variance among the population as well.

People are often easily fooled by statistics because they don’t understand them adequately.  Statistics also deals with probabilities.  Something we are terrible at from an intuitive level.  If you are interested in having a better understanding of basic statistics, I found this website to be quite helpful.   I believe that by having a better understanding of statistics we can have more meaningful reactions to the findings of data analysis, and thus have more meaningful discussions about what we can really conclude from those data.

The Nature of Correlations – Part I

 

The fact that the universe is complex shouldn’t surprise anyone yet one of the things I find surprising how vehemently people try to argue that it’s not. I have already written a blog post about correlation vs. causation, but I’d like to talk about something a little broader than that. Correlations try to demonstrate the relationship between one variable and another. It would be a sensible decision to try and correlate the amount of gun ownership with gun deaths. I don’t think anybody would argue that it might be a relationship. And even if we do find a correlation what does it mean? What is a good correlation, or a bad correlation?

So what can we expect out of a correlation? First we might ask, “Can any two things to ever be perfectly correlated, even if we have a mathematical relationship?” The answer is no. Quite simply for data to be perfectly correlated it would require perfect measurements. Since every measurement always has some error there will always be a less than perfect correlation. The more difficult the measurement the less likely we are to get a good correlation at all. In social sciences measurements may depend on a survey, qualitative and/or subjective observations, and complex sampling techniques. This will all impact the correlation we calculate. Correlations range from 0 (no correlation) to 1 (perfect correlation) and it depends on the quality of the data and the nature of the problem to really determine what correlation is high enough for us to be positive of a relationship, but generally anything about 0.5 is significant. Correlations can also be negative as well. A negative correlation actually implies that as one variable increases, the other decreases, and those are important relationships as well.

From http://www.venganza.org

Now, what happens when we try to correlate things that have no relationship to each other? It turns out we can get all sorts of results. The often cited graphic to the right is often used to show the difference between correlation and causation.  Two variables may appear to be correlated but have no relationship at all even if the results are repeatable.  Correlations turn out to be even a little trickier than that. Some types of data show natural variability that is higher and lower than the average. A couple of examples would be the amount of blood sugar or atmospheric pressure. Now let’s say we correlated those two things. Let’s say I have two years of daily air pressure values. And let’s say I have sampled the amount of blood sugar in my blood every for 30 straight days. The smart thing to do is select the atmospheric pressure values that correspond to the days that I took the sample and correlate them. But what might I find. Well it could be that in those 30 days, a slow moving high pressure system moved in as we approached the end of the period meaning that in general pressure increased. And then perhaps my diet was not so good, or I ate infrequently, irregularly, or had a few big meals one day and then light snacking the next, blood sugar will fluctuate. I might find that there is a negative correlation between pressure and blood sugar. But what would happen if I did the experiment for another 30 day period. Depending on the specific conditions I might find a positive correlation. But what I’d find, that over time, after doing many of these repeated tests is that I would have no correlation. The average of all the correlation calculations I perform will be essentially zero. This speaks to the importance of having a larger sample size and whether or not results are repeatable for the same experiment. Had I taken two years worth of pressure data and 2 years worth of blood sugar data, I would have found zero correlation. Furthermore even if I did find some correlation, my results would not be repeatable over many studies. The media often exaggerates findings from experiments that try to say link certain foods or cell phones to cancer. Variables that naturally fluctuate may randomly show a correlation, but further experiments reveal no correlation. But the media usually just picks up on the one study that found a correlation.

To further illustrate this point, take a look at the following two images.  The one on the left is a large sample of raw data trying to correlate to variables.  If these two variables were correlated we would see the data points approximate a line.  Maybe a straight one, maybe a curved one, but definitely some sort of trend would be apparent.  You are right if you think that whatever these two variables are they have no relationship two each other.  Next to it is the same data set except that some of the data points are missing.  Now it looks like there might be a correlation between these two variables.  Sure I have selectively chosen in this case which ones to eliminate, but it is possible that a random selection of some of the data points could be these same data points on the left.  Thus it is possible that we might think we’ve found a correlation when we didn’t.  Once again the more data we take and the more we try to repeat the experiment we would show no correlation.

LDrXq LDrXq2

In the next part we will investigate how we hypothesize about correlations, and how complex most relationships are.