Monday, March 02, 2009

Stephen Ziliak and Deirdre McCloskey: The Cult of Statistical Significance

Stephen Ziliak and Deirdre McCloskey's The Cult of Statistical Significance is a poorly argued rant about what appears to be an important topic on the pursuit of scientific knowledge. Ziliak and McCloskey argue that many of the statistical sciences have been using the wrong metric to determine whether the results of experiments are interesting and relevant. They report on a few detailed reviews of articles in top journals in economics, psychology, and other fields to show that the problem they describe is real and pervasive. Unfortunately, they are much more interested in casting aspersions on the work and influence of Ronald Fisher and building up his colleague William Gosset, and so they don't actually explain how to apply their preferred approach. In amongst the rant, they do manage to make the defects of Fisher's approach clear, though it's tedious reading.

The basic story is that Fisher argued that the main point of science is establishing what we know, and to that end, the important result of any scientific experiment is a clear statement of whether the results are statistically significant. According to Fisher, that tells you what confidence you should have that the results would be repeated if you ran the experiment again. Ziliak and McCloskey want you to understand that a result can be statistically significant but practically useless. And there are worse cases, where statistical significance and Fisher's approach leads scientists to hide more relevant results, or worse to conclude that a proposal was ineffective when the data show that a large effect might be present, but the experiment failed to show that it was certain. Ziliak and McCloskey want scientists to primarily report the size of the effects they find, and their confidence in the result. To Ziliak and McCloskey, a large effect discovered in noisy data is far more important than a small effect in very clear data. They point out that with a large enough sample, every effect will be statistically significant. (Though they don't explain this point in any detail, nor give any numbers on what "large enough" means. I have an intuitive feeling for why this might be true, but this was just one of many points that wasn't presented clearly.)

They describe a few stories in detail to show the consequences for public policy. Vioxx was approved, they claim, because the tests of statistical significance allowed the scientists to fudge their results sufficiently to hide the deleterious effects. (It's not clear why this should be blamed on statistical significance rather than corruption.) They also present a case that a study of unemployment insurance in Illinois found a large effect ($4.29 in benefit for every dollar spent), but gave the Fisherian conclusion, not just that the result wasn't statistically significant, but that there was no effect. It turned out that a careful review of the data showed that the program had a statistically significant benefit-cost ratio of $7.07 for white women, but the overall benefit-cost ratio was not statistically significant because the $4.29 was only statistically significant at the .12 level, while under .05 or less is required by Fisher's followers.

Ziliak and McCloskey demonstrate that they're on the right side of the epistemological debate by supporting the use of Bayes' Law in describing scientific results, but beyond one example, they don't explain how a scientific paper should use it in presenting results. The use of Fisher's approach gives a clear guide: describe some hypotheses, perform some tests, finally analyze the results to show which relationships are significant. With Bayes, the reasoning, approach and explanation are more complicated; but Ziliak and McCloskey don't tell how to do it. Of the 29 references to Bayesian Theory in the index, 24 of them have descriptions like "Feynman advocates ...", or "Orthodox Fisherians oppose ...". There aren't any examples of how one might write a conclusion to a paper and show Bayesian reasoning, even though they pervasively give examples of analogous Fisherian reasoning that they find unacceptable.

Another significance question that Ziliak and McCloskey argue is important (but that they don't explain adequately) and that statistical significance hides is how much various treatments or alternate policy approaches might cost. Fisher's approach allows authors to publish that some proposal would have a statistically significant effect on a societal problem or the course of a disease and not mention that the cost is exorbitant and the effect small (though likely). Ziliak and McCloskey argue that journal editors should require authors to publish the magnitude of any effects and a comparison of costs and benefits. According to the reviews they've done and others they cite, it's common in top journals to omit this level of detail and to focus on whether experimental results are significantly different from zero.

Another of the authors' pet peeves is "testing for difference from zero". They claim that it's common for papers to report results as "statistically different from zero", when they're barely so. They use the epithet "sign testing" for this case. The lack of attention to the size of an effect that significance testing allows means that papers get published showing that some effects have a positive effect on a problem, even when the effect is barely different from a placebo. And there are enough scientists performing enough experiments today that many treatments with no real effect will reach this level of significance purely by chance.

Overall, the book spends far too much time on personalities and politics. Even when the discussion is substantive, too much effort goes into why the standard approach is mistaken and far too little on how to do science right, or why their preferred approaches would actually lead to better science.

For the layperson trying to follow the progress of science, and occasionally to dip into the literature to make a decision about what treatment to recommend to a family member or what supplements would best enhance longevity or health, the point is that scientific papers have to be read more carefully. Ziliak and McCloskey argue that editors, even of prestigious journals, are using the wrong metrics in choosing what papers to accept, and often pressure authors to present their results in formats that aren't useful for this purpose.

When reading papers, concentrate on the size and the costs of the effects being described. Significance can be relevant, but the fact that a paper appeared in a major publication doesn't mean that the effects being described are important or useful. Don't be surprised if the most-cited papers in some area don't actually present the circumstances in which an intervention would be useful. Don't assume that all "significant" effects are relevant or strong.


Algosome said...

This isn't really hard, but it's amazing how many people don't get it. It's a two-step process: Statistical significance is an indicator of whether an effect is real, while statistical power is an indicator of potential importance. With a small sample, you may find a big effect purely by accident. If a result fails the reality criterion, you should ignore it regardless.

Unfortunately the publish-or-perish paradigm of academic advancement requires researchers to promote their results no matter how small or useless they actually are. This turns out to be effectively in collusion with the marketplace that wants to promote products even if they are totally worthless. Nobody is motivated to take the second step and differentiate products by effectiveness except the poor consumer who is not in a position to know that there is better information to be obtained, and who is usually not educated enough to understand what it would mean even if it were made available.

Caveat emptor.

Anonymous said...

The previous commentator says that "Statistical significance is an indicator of whether the effect is real." That is simply not true.

Statistical significance merely says that the probability of sampling error is low. But that is far from implying that the effect is real. A lot of other errors such as ommited variable bias, measurement error can keep the effect from being real. There is no alternative but to look at the size to guess if the effect is real.

Unknown said...


We're glad to get support from Algosome, (and even from the original review, which is strangely harsh considering that he agrees with us: with such friends, who needs enemies?). But we beg to disagree on one important point. It's not true that statistical significance tells that an effect is real. Realness is not an on/off characteristic in a science: in math and philosophy, yes; not in history or physics. In a science we always want to know How Big. In some contexts an insignificant coefficient can be important, and very commonly a significant one (the correlation of GDP with ice cream fat content) is not important. Bigness is (you can tell from its name) quantitative. It's never qualitative, to be decided by some characteristic inhering in the number itself, independent of an exercised human judgment. And Algosome would probably agree that to always set p = .05 is therefore irrational. But Algosome is quite right that then there's a second step (having gotten a regression coefficient), which is a judgment for the particular scientific or policy purpose. It is ask, How Big is Big.


Deirdre McCloskey