**Researchers must be wary of the common mistakes of correlation analysis when drawing conclusions about the nature of their data.**

By Vladica M. Velickovic | July 31, 2013

Statistics are the basis of scientific data analysis, and with the flood of data coming from new genomics technologies, biostatistics has truly become an inseparable part of modern science. Nevertheless, a fundamental statistical technique—correlation analysis, which measures the relationship between two variables—is often employed incorrectly, leading to erroneous conclusions about the true nature of the relationship between the studied phenomena.

The primary task of correlation analysis is to test for a relationship, or agreement, between two variables of interest—say smoking and higher incidence of lung cancer. Furthermore, provided that the survey was carried out on a sufficiently large sample, a rough assessment of the degree of correlation between the observed phenomena, quantified as the linear correlation coefficient, can be performed.

This coefficient must then be interpreted and critically analyzed, as correlation analysis does not aim at explaining the nature of the quantitative agreement—in other words, the causal relationship between the two variables. In addition to assuming causality, researchers commonly fall victim to two other misconceptions: inferring the nature of the individual based on the group findings, and thinking that a correlation of zero implies independence. Each of these errors in analysis can lead to inadequate conclusions.

**Misconception #1: Correlation implies causality**

Every scientist knows that “correlation does not imply causation.” Indeed, both variables may incidentally show the same tendency of quantitative variability without any logical and natural relationship between them at all. Alternatively, two variables may trend together since they are under the impact of the same confounding factors that are causing the changes in both. Nevertheless, the inappropriate assumption of causality is the biggest source of error in interpreting the results of correlation analysis.

In 2008, for example, the *Journal of Pediatrics* published a study in which the authors concluded that eating breakfast can solve the problem of teenage obesity, based simply on the fact that teenagers who do eat breakfast are less likely to be obese. Although the correlation found by the authors indicates a possible causality, it is unlikely that eating breakfast can solve the potential problem of teenage obesity. More likely, there is a common cause behind these two phenomena (eating breakfast and teenage obesity)—poverty, for example—but no direct relationship between them. Similar examples of authors misinterpreting the correlation coefficient are common in the epidemiological literature. One group of researchers, for example, found a correlation between women taking combined hormone replacement therapy (HRT) and a lower-than-average incidence of coronary heart disease (CHD) and concluded that HRT lowered the risk of CHD. However, randomized controlled trials have found the contrary: HRT caused the increase in risk of CHD. It was later determined that lower-than-average incidence of CHD is caused by the benefits associated with a higher average socioeconomic status of those taking HRT, not by therapy itself.

Studies including this type of error are published even in leading biomedical journals. For example, a 1999 *Nature *study found a strong association between myopia, or near-sightedness, and night-time ambient light exposure during sleep in children. The authors concluded that it seems prudent that infants and young children sleep at night without artificial lighting in the bedroom. A later study refuted these findings and reported that, in this case, the cause of myopia was genetic, not environmental, as many of the study participants’ parents also suffered from the condition.

Of course, the fact that “correlation does not imply causation” should not lead towards diametrically opposite conclusions that correlation could not point to a possible existence of causality. Correlations, especially the high value of the linear correlation coefficient, may point to the existence of causality, but the conclusion requires systematic examination.

**Misconception #2: Individuals follow the group**

It is not always possible to make inferences about the nature of individuals from information about the group to which those individuals belong. Many researchers do make such assumptions, however, thereby falling victim to the ecological inference fallacy.

One example of ecological inference fallacy is a 2012 paper in a *New England Journal of Medicine*: the study author found that there was a close and significant linear correlation between chocolate consumption per capita and the number of Nobel laureates per 10 million persons in a total

of 23 countries. On the basis of this finding, he concluded that chocolate consumption enhances cognitive function and closely correlates with the number of Nobel laureates in each country. But without accurate data at the individual level, it is impossible to draw such a conclusion. For example, it was unknown how much and whether Nobel laureates consumed chocolate.

**Misconception #3: A correlation of zero implies independence**

Based on the previous two examples, it is clear that high values of the linear correlation coefficient cannot by themselves be sufficient to conclude about the relationship between the variables. Conversely, a correlation coefficient of zero does not mean that the variables are independent. That is because the correlation coefficient measures linear association only. A U-shaped, non-monotonic relationship, for example, may have a correlation of zero, such as the dose-response relationship in steroid hormone receptor-mediated gene expression.

**Conclusion**

Proper, clear, and correct use of biostatistical methods requires not only adequate knowledge in biostatistics but also continuing education in this field. In that regard, biostatisticians trained in these methods should be involved in the research from the very beginning, not after the measurement, observation, or experiments are completed.

*This blog originally appeared at TheScientist, authored by Vladica M. Velickovic*

## rimunozc

View all posts by rimunozc