In the December 13 edition of the New Yorker, Jonah Lehrer writes about a worrisome observation about science: it seems that
all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable. [The Truth Wears Off]
He calls it the “decline effect.” The idea is that many early, positive, evidence for some phenomenon are in fact statistical anomalies. As more studies are done, Lehrer observes that the size of many effects—pharmaceutical effectiveness, alleged psychic powers—reduces over time.
At first blush this may be expected. As more studies are done, early results reveal themselves to be statistical fluctuations, and the true value comes out. So the surprise is that results that researchers felt were tested enough, weren’t really. This is the first problem for people who would like to believe in scientific results: (1) we’re not good enough at determining when a result is certain.
This could point to technical problems. Perhaps our standards of statistical significance are bad. They’re certainly arbitrary: why is a p-value of 0.05 considered significant in many sciences (meaning that you have a 5% chance of seeing an effect where there isn’t one)? Why not 0.01, or 0.000001? Or perhaps experimental scientists don’t get enough training in statistics. Lehrer quotes epidemiologist John Ioannidis on the scope of the problem. Ioannidis published a paper in PLoS Medicine called “Why most published research findings are false” (open access here) discussing these issues. Many (most?) studies are poorly designed, such that “for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias.”
More than simply statistical analysis is in question here. Study design is a cultural and structural feature of scientific fields. Ioannidis summarizes the factors he sees at play:
In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance.
We can get an intuitive grasp on how much these non-statistical factors matter by asking what we would expect if the problem was wholly due to bad statistics (or bad luck). If the first 10 studies about some effect, say the effectiveness of a drug, had results that were statistical fluctuations we would expect them to be roughly equally divided between positive (the drug helps) results and negative (the drug harms) results. But that’s not what Lehrer finds when he talks to scientists. Most often, the early results are positive ones.
This is usually explained as publication bias: journals want to publish positive findings, so that’s what gets published. Scientists also self-select null results out by not submitting them in the first place. Following Ioannidis’s work, we can ask about monetary conflicts of interest as well. It is well known that pharmaceutical companies actively suppress studies that they commission but that find null results. Philosopher and sociologist Sergio Sismondo, just up the 401 from me at Queen’s University, has done some excellent work detailing the current research situation in biomedicine (papers here). We can call this problem (2), the structure of modern science biases results.
But there is another more personal source of bias that runs through Lehrer’s article. Scientists want to find positive results. They may choose the statistical method that creates the best result, not the method best suited to their data; they may improperly dismiss “outlying” data”; or they may unconsciously see what they want to see in the experiment itself. Lehrer and Ioannidis refer to this as “selective reporting,” but attention to the limits of scientists’ perception dates back (at least) to nineteenth-century German astronomer Friedrich Bessel. Simon Schaffer has a classic paper on the astronomer’s “personal equation” here. And these issues have not left physics. Peter Galison has detailed how “scanning girls” were used in particle physicists’ bubble chamber experiments in the 1960s. They scanned the thousands of photographs the bubble chambers produced to identify possibly interesting physics events. Eventually computers were invented that were designed to correct errors of identification the scanners made. At Berkeley, physicists used computer simulations to create “fake” event pictures to test how well their programs worked. And at an even higher level, the physicists created fake histograms—the same way particle physics results are presented in journals—full of statistical fluctuation. If they couldn’t identify the real data that was a candidate for the detection of a new particle from the crowd of 99 fakes, they couldn’t trust their result (p. 397). The lengths to which these scientists put themselves to address this version of selective reporting speaks to how important it is, and how difficult it is to deal with. This is a third problem (3) people see what they want to see.
To recap, there are three problems for people who want to believe in the results of science here:
(1) we’re not good enough at determining when a result is certain;
(2), the structure of modern science biases results; and,
(3) people see what they want to see.
Lehrer got a number of letters in response to his article, which he links to in a follow-up post here. I won’t recap his discussion, but in the end he restates his original conclusion:
The larger point, though, is that there is nothing inherently mysterious about why the scientific process occasionally fails or the decline effect occurs. As Jonathan Schooler, one of the scientists featured in the article told me, “I’m convinced that we can use the tools of science to figure this”—the decline effect—“out. First, though, we have to admit that we’ve got a problem.”
It is inevitable, I suppose, that one of the letter writers challenged Lehrer for writing something critical of science when there are creationists running amok, and so he ends on a note that justifies his writing as the first step on the road to science’s recovery. I don’t think we need to heed scaremongering about creationists when writing about science, and I certainly don’t adhere to the AA model of criticism (“I’m a scientist, and I believe in naive objectivity…”). This is especially true when some of the major lessons one can draw from Lehrer’s piece is that there is a problem with the scientific method (such as it is). It is reassuring, I suppose, that the scientists Lehrer quotes think that science will solve science’s problems here, but we need some reasons for optimism.
The decline effect is a philosophical problem … and I’ll have some more thoughts on this in part 2, to come.



I wonder how Kuhn would deal with the notion that normal science has to deal with both statistical outliers, paradigm breaking anomalies and the inherent decline of certainty over time as a result of the three points you raised.