How to laugh in the face of propensity (part two)
November 3, 2016
A/Prof Richard Stevens
In my previous blog I discussed the booming popularity of propensity score methods, and the claims made for them by enthusiasts. On the other hand, I also commented on their similarities to older ways to control for confounding. I mentioned that in an example that interests me – metformin and risk of cancer – a propensity score study reached similar conclusions to other observational studies.
So are propensity score methods really any better than existing methods such as matching and multivariate adjustment? Shah and colleagues decided to find out by systematic review. They found 43 studies that directly compared propensity score methods to long-standing methods such as matching and adjustment. I don’t need to discuss the results in detail here: it’s enough to say that when they wrote the paper, they give it the title “Propensity score methods gave similar results to traditional regression modeling in observational studies”.
Shah’s paper was published in 2005. Propensity scores have continued to soar in popularity since then (ten times as many publications in 2015 than in 2005, according to my quick search). I haven’t read them all (two thousand published already this year!) but I’m on the look-out for one that will convince me that propensity scores can achieve something the classic methods can’t do. I said in the last blog post that I won’t settle for a theoretical argument, because whatever the theoretical advantages of propensity scores, in practice they give – as it says in the title of Shah’s paper – “similar results”.
One of the rare exceptions is a study of treatment for ischemic stroke. The authors decided to investigate a possible increased risk of death when patients with ischaemic stroke are treated with tissue plasminogen activator (t-PA). This adverse effect on mortality had been seen in observational studies but not in randomised trials, so they decided to try analysing an observational study of 212 treated patients and 6,057 untreated patients with a variety of methods including propensity scores.
First, they tried adjusting for confounders in a traditional statistical model (multivariate logistic regression). According to this method, the odds of death is roughly doubled by t-PA treatment (adjusted odds ratio 1.9, 95% confidence interval 1.2 to 3.1). Then they tried addressing confounding with propensity score methods. According to one propensity score method, the effect is very much larger, with odds of death more than ten times as large in the treatment arm (odds ratio 10.8 with confidence interval from 2.5 to 47). According to another propensity score method, the effect is so small it may not be there at all (odds ratio 1.1 with 95% confidence interval 0.7 to 1.8). Other propensity score methods gave a variety of results in between these extremes. (If you’d like to read more, you’ll find some excellent discussion in Kurth’s paper about the way these different results arise and how they should be interpreted.)
So, I have now found an example in which propensity score methods give substantially different results to traditional methods. Unfortunately, in this study the propensity score methods also give wildly different results from each other! That really doesn’t convince me that I should be leaving my old-fashioned, tried-and-trusted statistical adjustment methods. I’m feeling left behind, though – thousands of authors so excited about propensity scores; why is it only me that can’t see it?
I challenged my colleagues to find a paper that would change my mind. They sent me a promising study in which matching for propensity scores clearly gives a better answer than adjusting by traditional statistical models. It’s only a simulation study, but it does seem to demonstrate that stratifying by propensity score method gives different results – on average, more conservative estimates of effect – than traditional multivariate adjustment.
But does that mean that in this example propensity scores are better than traditional methods – or does that mean that stratification (a sort of broad-brush version of matching) is better than adjustment? All the propensity score fans will think me a terrible cynic, but – surely another interpretation of their results would be that in this example matching succeeds where adjustment fails.
There’s an intriguingly worded reference in the Discussion to additional, unpublished results for adjustment by propensity scores. I’ve written to the authors, who’ve promised to look for those unpublished results for me. They did the study nearly ten years ago so I can’t expect them to find the results in a hurry. If the unpublished results show that adjustment for propensity agrees with matching for propensity, then that at last is what I’m looking for – a clear win for propensity scores when my old-fashioned methods fail. At last I’ll be able to shrug off my unfashionable scepticism and add my voice to the happy throng of propensity score enthusiasts. As I write, I’m waiting to hear back. I’ll let you know.
Richard Stevens is Course Director of the MSc in EBHC Medical Statistics. If you’ve found a study where propensity scores make the difference, he’d be glad to hear from you on twitter at @ebhcmedstats or by email at firstname.lastname@example.org.