I was listening to an interview with Charles Murray recently. For those of you who aren’t familiar with him, he is the person who wrote, what became a controversial book, The Bell Curve. Through some careful research he showed that white people had a higher IQ than black people, and had a lower IQ than Asians. He was labeled a nazi and a racist, and many other things for this research. Now I don’t want to get into this debate, I think that as a scholar his methodology was sound, but what his book doesn’t answer the reason for such differences. Is it genetic? Is it environmental? I am sure it is more often the latter. The more important question is, what is the value of such research? Is it really doing any good, or does it just feed the bias of a racist, while angering others? If you’d like to listen to the interview, I think you will find him cordial, but it’s not really the main thing I want to talk about, although what I do want to talk about is mentioned several times throughout out the interview, and that is what differences really mean. How should perceive differences in IQ or any quality for that matter? There are certain issues that have become very taboo in our society, and things become emotionally charged quickly when one tries to talk about them. These includes talk about differences between people of different races, different religions, different genders, and different sexual orientation. A discussion about any statistical differences between different populations along those lines usually doesn’t end well for the person trying to bring them up. And it’s possible that there is little value in discussing these differences, but I thought a short post to really visualize things from a statistical perspective is important, because I think people often don’t view the statistics properly when these issues come up. And it’s true not only for these “hot button” issues, but a lot of issues in which scientists discuss differences between populations.

For many studies, particularly in the social sciences and biological sciences you will find data is distributed. For any two variables that you are trying to find a relationship with, you will find the outcomes range across a particular set of values. For instance, if we were trying to determine how depression influences someone’s eating habits, even in a perfect experiment we would likely find that most people eat more to comfort themselves. Perhaps a large majority would say take in 50% more calories than they normally would. But a small minority would take less calories, perhaps -20%, and another small minority would take in twice as many calories. This is called a frequency distribution. We plot the range of outcomes versus the amount of time those outcomes occur. There are several types of distributions. There are skewed distributions, bimodal distributions, and then there is the normal distribution. This is generally the most common one and the easiest to say something about statistically. As our sample size increases, a relationship between two variables that are related to each other should get closer to a normal distribution. I realize that I am simplifying here, but my goal is not to get deep into statistical theory, but simply to illustrate why differences between populations might be more or less meaningful.

In a normal distribution the most common occurrence (the mode of the distribution) is the mean, and it is the value you get at the middle and tallest part of the curve. First let’s ask how useful is this to begin with? By definition of the mean is the middle value, half of the people lie above and half of the people lie below. So when we look at the means of two different populations we might see an overlap as illustrated here:

Despite the different averages we can see that much of the populations span the same range of values. The source for this graph discusses the meaning of overlapping means in more detail. A more specific example is here:

This graph comes from an interesting discussion about differences between populations of men and women. In this example we can see that the average height of women and men are different, but of course no one would say that any given man will be taller than any given woman. What this means is that if we are talking about people there is very little we can assume a priori meeting any individual member of a group. We can only say this is how things are on average and we can decide if anything should be done about it or anything can be done about it if we desire those averages to be the same.

Averages are talked about far more than perhaps they should. While it is a good summary of data, frequently the devil is in the details and we can say little concrete on averages alone. Rarely do researchers themselves so narrowly focus on the statistical analysis they do, but I think much gets lost when a journalist tries to report on the findings. The average, being the easiest to understand, is thus the easiest to report on and that’s when people start making assumptions about what the data are actually saying. Read an actual paper and you will find all sorts of other statistics discussed. Averages are all too common though. We get them in school, they are reported in sports statistics, the news. But one has to be put it in context of the entire set of data. Let’s not define people by an average. What is equally relevant is the variance among the population as well.

People are often easily fooled by statistics because they don’t understand them adequately. Statistics also deals with probabilities. Something we are terrible at from an intuitive level. If you are interested in having a better understanding of basic statistics, I found this website to be quite helpful. I believe that by having a better understanding of statistics we can have more meaningful reactions to the findings of data analysis, and thus have more meaningful discussions about what we can really conclude from those data.