The Centre for Evidence-Based Medicine develops, promotes and disseminates better evidence for healthcare.

May 23, 2018

*‘When did we start thinking that a single study was enough to prove a scientific hypothesis?*

Richard Stevens, Director of the MSc in EBHC Medical Statistics

There are many things I learnt as a statistics student that I don’t teach as a medical statistics lecturer. One example is the “Central Limit Theorem”. EBHC students don’t need to know the formal conditions and mathematical proof I learnt as a maths student. But it does help to know how to communicate the result in plain English: when your study has a large sample size, you don’t need a perfect normal distribution to get valid confidence intervals and *p*-values.

However, one thing we do emphasize is the correct interpretation of *p*-values. Not all students need to know how each *p*-value is calculated, in an age when computers do the number crunching. But they do need to understand that the *p*-value for the result of a trial is the chance of getting a result “like this” (example: a difference at least this big between drug and placebo), if the null hypothesis (example: drug has no effect) is true. If the *p*-value is small (less than 5%) we have found what is referred to as a “statistically significant” effect.

We also insist that our students understand this distinction: the p-value is the probability of getting a result like this, assuming the null hypothesis is true; it is *not* the same thing as the probability our hypothesis is true, given that we’ve seen a result like this. When students in the medical school look at me as if this is a meaningless distinction, I say: did those two statements sound similar? They are as different as telling you that half of all Welsh people are women, or telling you that half of all women are Welsh people. Students are usually willing to accept that these two sentences are very different, even if they sound closely related. But I wonder if they understand why we statisticians place such an emphasis on the difference?

The recent buzz in scientific journals about an alleged “replication crisis” shows how widespread this misunderstanding is. The medical publishing world seems to be very surprised that studies that achieve “statistical significance” (defining significance with a 5% threshold and using 95% confidence intervals) can’t be replicated much of the time. Did we think that because we use a 5% threshold for statistical significance, and 95% confidence intervals, that means that 95% of positive findings studies should be successfully replicated? Or, to put it another way – did we think that because half of the Welsh people are women, it follows that half of the women will turn out to be Welsh?

I have never seen this explained better than by Professor Alexander Bird of King’s College London. In a talk in Oxford for our EBHC seminar series, he explained with beautiful clarity that we should not expect 95% of studies to replicate, just because we are using 95% confidence intervals and a threshold of 5% for our *p*-values. He also demonstrated that this is not an issue of statistical power, a measure of whether the sample size is ‘big enough’. Alexander’s talk describes in part what David Colquhoun termed the false discovery rate, that is the chance a study is wrong when it states a ‘statistically significant’ discovery.

Alexander concluded with three possible approaches to tackling the “replication crisis”. My preferred solution is to stop thinking of it as a crisis! When did we start thinking that a single study was enough to prove a scientific hypothesis? When did we forget the importance of confirmatory studies? Confirmatory studies are a fundamental of science, and for the very reason, that replication of the original result is far from guaranteed.

I’m delighted that when Alexander spoke to our staff and students in April, he gave us permission to record his presentation for our audio podcast series. Firstly, this talk is essential listening for anyone who takes the “replication crisis” seriously. Secondly, it’s a reminder that good statistical understanding is essential for science. Finally, this elegant explanation (from a philosopher, not a mathematician) is a perfect demonstration that good statistical understanding is not the exclusive territory of professional statisticians.

1. Prof Alexander Bird speaking to the Evidence-Based Health Care programme in Oxford, April 2018.

*Want to learn how we teach statistics and other key topics in Evidence-Based HealthCare then join us at our annual teaching course 10 – 13 September 2018. More details here.*

I can’t agree that it isn’t a crisis. It’s bad and it is discrediting science as a whole.

You say “Don’t ditch p-values: understand them”. The problem with that is that when you understand p values you realise that they don’t answer a relevant question.

Incidentally, it was unfortunate that I used the term “false discovery rate”in the 2014 paper to which you refer. I now call this quantity the false positive risk: see

http://rsos.royalsocietypublishing.org/content/4/12/171085

and

https://arxiv.org/abs/1802.04888

Incidentally, I think that the problem is worse than Bird says. He uses the p-less-than interpretation, I think that the p-equals interpretation is appropriate. This distinction is discussed in section 3 of the 2017 paper (first of the links, above).

I can’t agree that it isn’t a crisis. It’s bad and it is discrediting science as a whole.

You say “Don’t ditch p-values: understand them”. The problem with that is that when you understand p values you realise that they don’t answer a relevant question.

Incidentally, it was unfortunate that I used the term “false discovery rate”in the 2014 paper to which you refer. I now call this quantity the false positive risk: see

http://rsos.royalsocietypublishing.org/content/4/12/171085

and 2018 paper in arXiv

Incidentally, I think that the problem is worse than Bird says. He uses the p-less-than interpretation, I think that the p-equals interpretation is appropriate. This distinction is discussed in section 3 of the 2017 paper (first of the links, above).