The Centre for Evidence-Based Medicine develops, promotes and disseminates better evidence for healthcare.

September 21, 2018

**Looking for the truth about sample size in clinical trials.**

“The average power of a clinical trial is only 9%, according to a recent analysis.”

Richard Stevens, Course Director, M.Sc. in EBHC Medical Statistics

I always keep an eye out for papers by Paul Glasziou; this month, it led me to a remarkable paper by Herm Lamberink and colleagues, looking at sample size and statistical power (whether the sample size is big enough) in clinical trial from 1975 to 2014. The scope of this study is epic. The authors used resources from the Cochrane collaboration to study over 100,000 trials – aiming for every trial that has appeared in a Cochrane review, from the relevant time range.

The scale of the project is impressive, and an example of the kind of sweeping, comprehensive, meta-research that has become possible in an era of large databases and plentiful computing power. But what really caught my eye was the findings. The authors estimate that the average power of a clinical trial is only 9%.

When teaching statistics to clinicians on our M.Sc. courses, we describe statistical power as “the chance that a clinical study succeeds in finding an effect”, or more strictly, the chance that it finds an effect, if we are right about the effect. In the Study Design course we emphasize that we can only estimate power; and we can only estimate the power while making an assumption about the effect. So, for example, if we design an intervention study to have 85% power, when we unpack the jargon we mean that *if* we have correctly guessed the true effect of the intervention, *then* there is an 85% chance that the study will succeed in demonstrating it. That means that there is still a 15% chance that the study will “fail”, ending with an inconclusive result, even if we are right about the effect size.

The method of Lamberjink and colleagues was to wait until a systematic review has been conducted on each intervention. Then, taking the result of each systematic review to be the true effect size, they looked back at the individual trials in each review to ask, was their sample size big enough? Did the sample size, in that trial, give a high chance (power) that the study would succeed? (There are great pitfalls in carrying out a power calculation retrospectively, but this method is safe enough.)

When designing a study we usually aim for a power of 80%, 90% or even 95%. Of course we want the chance a study to succeed to be high, but the benefits of higher sample sizes have to be traded off against cost and practical considerations. If Lamberjink and colleagues are correct that the average power is only 9%, then most studies are doomed to “failure” – or at least, doomed to an inconclusive result. The estimate from their secondary analysis is 20% and I happen to think this is more relevant – see the paper for details – but still very low.

Should we be worried that so many trials have such low power? I’m not as worried as the authors, following a discussion with my colleagues in the statistics team here. Lamberjink et al. argue that ideally, every trial should be interpretable in isolation. Perhaps that’s the ideal but I don’t think it’s achievable. My colleagues from the trials unit point out that it’s perfectly reasonable that phase I and phase II clinical trials would have low power for many outcomes, but are still an essential step towards safely the definitive trial. Similarly, my own experience is that is often a good use of resources to conduct a pilot trial or a feasibility study, in a modest number of patients, before asking a funder to commit to the huge cost of a fully-powered trial. It is also possible that some studies will be well-powered for one clinical outcome, but not for another, which may be the outcome in the review. The authors allude to this briefly in the paper, but are not convinced. I think it is entirely possible that a study has been designed to have adequate power for some continuous outcome such as blood pressure, but not for a “clinical event” outcome such as heart attack or stroke. Of course, this example also illustrates that often the continuous outcomes are less directly relevant to patients than hard clinical events.)

Another summary of the data in the Lamberjink paper is that 12% of studies (one in eight; taken from row 2 of Table 1) have power above 80%. Perhaps, in the light of my colleagues comments about early phase studies, pilot studies and secondary endpoints, that isn’t so bad.

*Richard Stevens is director of the M.Sc. in EBHC Medical Statistics, Oxford’s part-time M.Sc. in statistics for people in full-time clinical practice.*

While there’s nothing wrong with a study with low power per session, it’s certainly apparent that there is an enormous disconnect between the measured power of trials and the frequency with which positive results are being reported. If we accept an environment where we accept trials with a 9% power and 95% confidence, we’re guaranteeing that a large portion of reported significant outcomes are in fact spurious. This is compounded by issues with the regulatory approval process at the FDA whereby drugs are coming to market based in interim and subgroup analysis of surrogate outcomes. Mathematically these trials may be sound, but in the context of our current systems it is extremely hazardous.