How to laugh in the face of propensity
October 20, 2016
Richard Stevens, Director MSc EBHC Medical Statistics
Sometimes colleagues ask me whether they should use “propensity scores” in their next study. I’m far from an expert in propensity score methods. I’ve very little experience in using them because I’ve never yet seen a reason to use them over older methods – except that they’re fashionable!
I ran a quick search in Scopus (see Figure 1) to see just how fast their popularity is growing.
We already have at least two ways to tackle confounding in observational studies. One of them is matching; the other is adjustment. I’m going to assume here that you are familiar with one or the other, if not both; propensity scores are a relatively new approach that date back to the 1980s but are increasingly high-profile in the age of big data.
Figure 1. Propensity scores by year: search terms “propensity score” in title, abstract or keywords.
Suppose we want to study whether metformin (a drug for diabetes) reduces the risk of cancer. We know that in studies using big data, people with diabetes using metformin have lower rates of cancer than other people without, but this could be attributable to confounders: age, obesity, duration and severity of diabetes, co-morbid illnesses and so on. To overcome this we could match on potential confounders; or we could adjust for potential confounders in a statistical model; or we could build a propensity score.
The idea is that for each person in the data we measure their “propensity” for being prescribed metformin. We use the potential confounders to build a statistical model that assigns, to each person, a number called their propensity score: the people with high scores are those most likely to be prescribed metformin (whether because of the length of their diabetes, or because of their body mass index, and/or some other reasons) and the people with low scores are those unlikely to be prescribed metformin (for whatever reasons). There are then several ways that a propensity score can be used to try to overcome confounding.
We could, for example, do a matched study by matching on propensity score. For each metformin user with a high propensity score, find a non-metformin-user who has a similarly high propensity score (there won’t be as many, but there are usually some you can find with high propensity but still no metformin). For each metformin user with a low propensity score (there won’t be so many, but you’ll still find a few), match them to a non-metformin-user who has a similarly low propensity score. Now we can study the effect of metformin by comparing the metformin users, not to all non-metformin-users, but to the non-metformin-users with similar propensity. If you still find a lower rate of cancer in the metformin users, it’s more likely to be a real effect of metformin, rather than a side-effect of other differences between the users and the non-users. When I think about it, this doesn’t seem to be very different from the matched studies that have always been done, except that now the method of matching is by propensity scoring.
Alternatively, you can enter the propensity score into a statistical model for the relationship between metformin and cancer. When I first heard about this method I was surprised, but it turns out there are some carefully proven mathematical theorems that show this is just as reasonable as entering the confounders into the model individually.
There are also more radical ways to use propensity scores too (such as weighting methods). Some propensity score methods attract so much enthusiasm that I’ve even seen investigators claiming to be creating a “pseudo-randomized” study. Am I being too cynical if I think that’s a rather grand phrase to use, in a study in which no actual randomisation has actually taken place?
In my reading on propensity scores so far, I’ve seen many authors (here, for example, and here and here) make a theoretical argument that propensity score methods are superior to traditional ways of matching or adjusting. But I’m getting too old and cynical to believe theoretical arguments by themselves: I’d really like to see an example, a real clinical study, in which propensity matching is achieving something that old-fashioned matching does not.
In the case of metformin, a recent propensity score study found – just like the study using traditional matching – that cancer risk is lower in metformin users. But the randomized trial evidence, as far as it goes (it certainly has limitations, not least wide confidence intervals) shows no convincing evidence that metformin has any effect.
In my next article I will discuss some studies that have explicitly compared propensity score methods to traditional methods, and my search for a study that will persuade me to follow the fashion for propensity scores.
Richard Stevens is Course Director of the MSc in EBHC Medical Statistics, Oxford’s part-time course in medical statistics for clinicians and other healthcare providers.