Tip for data extraction for meta-analysis – 26

February 20, 2020

What if you’re missing a standard deviation and only a similar summary statistic is given?

Kathy Taylor

Previously, I highlighted a list of ways where, when extracting data for meta-analysis of continuous outcomes, you might find that a summary statistic that you want is missing. In my last post I gave the 3rd waya similar summary statistic is reported, but it’s not the statistical measure that I want and I focused on missing means. In this post I’ll show you what you can do with missing standard deviations (SDs).

Instead of the SD, another measure of dispersion may be reported, either the standard error (SE), confidence interval (CI), interquartile range (IQR) or range. The SD describes how measurements of participants naturally differ (which is saying something about the population) whilst the SE describes how accurately the mean has been estimated (which is saying something about a study). Sometimes it’s not what clear if the reported statistic is the SE or the SD and so comparing its value with the established SEs or SDs of other studies may help you decide.

The Cochrane Handbook (6.5.2.2.) divides the equations for calculating SDs into those for group means (when you want the SD of a mean value for the intervention group or the control group) and difference in means (when you want the SD of a difference in means between the intervention and control groups). In this post I deal with SDs of group means and I will look at SDs of difference in means and other effect measures in a future post.

Calculating SDs from SEs:

Obtaining SDs from SEs is very simple

SD=SE√n

Calculating SDs from confidence intervals:

A 95% confidence interval is expressed in terms of the SE and gives the range in which we are 95% sure that the sample mean lies. For data that is normally distributed, the confidence interval will be symmetric about the mean and therefore,

For a 90% confidence interval, divide by 3.29, and for a 99% confidence interval, divide by 5.15. These divisors are derived from the standard normal distribution. If the sample size is small (<60 in each group), the divisors should be replaced by slightly larger numbers, derived from the t-distribution. Tables for these two distributions are given at the end of this post.

Calculating SDs from IQRs:

The Cochrane Handbook states that for normally distributed data, you can estimate

Calculating SDs from other summary statistics:

There are a number of ways of calculating the SD from the range but they are not generally recommended by Cochrane Handbook because the range is so unstable, as it is determined by extreme values rather than providing an average measure of variation.

A common approach is to estimate

 

 

Walter and Yao provide a table of conversion factors (f) according to the sample size to estimate
SD=f × range
Their table suggest that the common formula only applies to a sample size of around size 25 (f=0.254).

Other methods estimate the SD by equations of several other statistics. These equations have been evaluated by simulation but not empirically so the Cochrane Handbook (section 6.5.2.6) do not recommend them “as a general rule” but these estimates could still be used and the studies removed in a sensitivity analysis.

Hozo et al provide an estimate of the SD using the range with the median and sample size

which they simplify for large n to

Bland provides an estimate based on the range and interquartile range with the mean and sample size:

Where

Wan et al estimate the SD from the range with the median and sample size:

They estimate the SD from the range, interquartile range, median and sample size,

and from the interquartile range and sample size (for large sample sizes)

Where

Φ-1(z) is the inverse function of Φ(z) (the cumulative distribution function of the standard normal distribution). Φ-1(z) is also the upper zth percentile of the standard normal distribution. It can be calculated using the R software command ‘qnorm(z)’.

Examples of studies with missing data

Let me show you some examples from studies of people with diabetes which were included in systematic reviews carried out by our group.

A study by Chaisson et al 2001 reported the effect of metformin on change from baseline of HbA1c in terms of mean and SE.
For the intervention group
SD = 0.12√81 = 1.08%
For the control group
SD = 0.12√82 = 1.09%

Kemal et al reported the effects of rosiglitazone on plasma glucose and other laboratory variables at 6 months in terms of median and range.

Three studies from one review where we extracted data on the effects of renin-angiotensin-aldosterone system inhibitors on albumin excretion rates were Tan et al who reported the effects of losartan at 6 months in terms of the median and interquartile range (IQR). Bojestig et al reported the effects of ramipril at 2 years in terms of median and range, and Tong et al reported the effects of fosinopril, also at 2 years in terms of median and range. Table shows the SD calculations using the different equations that I have shown above. Albumin excretion is measures in µg/min for all studies. For Tong et al, I converted the data from mg/24 hours, using the conversion factor that I showed previously (no.5).

Table. Estimating standard deviations

StudyTan et alBojestig et alTong et alKemal et al
DATA
StatisticInterventionControlIntervention

(low dose)

Intervention

(high dose)

ControlInterventionControlInterventionControl
n404016171818201117
median79558194968942432.712.64
lower IQR103107
upper IQR33181836
IQR1015832151729
min102348
max14501112308
Range/4144010892602.381.55
f0.2830.2790.2750.3150.279
SD ESTIMATIONS
EquationInterventionControlIntervention

(low dose)

Intervention

(high dose)

ControlInterventionControlInterventionControl
Common360.00272.2565.000.600.39
Walter & Yao407.52303.8371.500.750.43
Wan et al 1407.05272.2565.000.750.43
Wan et al 277.6644.602586.331379.48
Cochrane74.8142.962381.171280.86

Common approach – range/4; Cochrane Handbook – IQR/1.35

For the data from Tan et al, the equations of Wan et al and Cochrane Handbook produce similar results, which suggests that the distribution of the data were not highly skewed as the latter equation is based on assumption that the data are normally distributed. A similar point could be made for Tong et al. For the data reported by Kemal et al, the equations of Wan et al and Walter and Yao produced identical results to 2 decimal places, but the simple common approach underestimated the SDs. Applying the equations to the data of Bojestic et al shows how wide ranges can produce unstable results.

Another strategy which I will cover in my next post is dealing with missing SDs by imputation. Which SD should you use? Take an average, use the lowest value or highest value, or try them all? I will address these questions in a future post on sensitivity analysis.

Here’s a tip…

You can derive estimates of standard deviations from other reported summary data, but be aware of the assumptions underlying your estimates.

In my next post, I’ll focus on some other examples of the 4th way of how a summary statistic that you want may be missing for some cases: neither the summary statistic you want, nor a similar statistic are reported.

Where did the equations come from?

(You can skip this if you are only interested in carrying out the calculations)

Calculating SDs from SEs

The standard error of the mean (SEM, which is often abbreviated to SE) is the standard deviation of the means of multiple samples:

Where
n = sample size
σ = population standard deviation

The SE can be estimated from a single sample using the observed sample standard deviation, s:

Let x1, x2, x3….xn be n independent observations from a population with mean µ and standard deviation σ (and variance σ2)

This used the result
Var(aX) = a2 Var(X)
which comes from
Var(X) = E((X-μ)2) where μ=E(X)
Var(X) = E(X2) – 2E(X)μ+μ2
Var(X) = E(X2) – 2μ22
Var(X) = E(X2) – μ2
Var(X) = E(X2) – (E(X))2
Therefore,
Var(aX) = E((aX)2) – (E(aX))2
Var(aX) = a2E(X2) – a2(E(X))2 = a2Var(X)

Returning to

Rearranging SD=SE√n

Calculating SDs from confidence intervals:

If we call the upper and lower limits of the 95% confidence interval upperCI and lowerCI. A symmetric confidence interval means that#

upperCI = mean + 1.96SE
lowerCI = mean – 1.96SE

1.96 is the Z value taken from the standard normal distribution table with the area in each tail of (1-0.95)/2=0.025 and therefore, using the one-sided table (Figure 1, red), the shaded area is 1 – 0.025 = 0.975.

Figure 1. Standard normal distribution table (p=0.95,0.975,0.995)

As shown before, rearranging the equations for upperCI and lowerCI

(2×1.96)SE = upperCI – lowerCI

Rearranging,

Similarly, for a 90% confidence interval, the area in each tail is (1-0.90)/2=0.05 and the shaded area corresponding to a one-sided standard normal distribution table is (1-0.05)=0.95. The corresponding z value is 1.645 (Figure 1, green).

2 x 1.645 = 3.29 and therefore,

For a 99% confidence interval, the area in each tail is (1-0.99)/2=0.005 and the shaded area corresponding to a one-sided standard normal distribution table is (1-0.005)=0.995. The corresponding z value is 2.575 (Figure 1, blue).

2 x 2.575 = 5.15 and therefore,

Calculating SDs from IQRs:

From a standard normal distribution table (Figure 2), the Z value for shaded area 0.75 (upper quartile) is approximately 0.67. The upper quartile is 0.67 SDs from the mean so

IQR = 2 × 0.67 × SD ≈ 1.35 SD

 

Figure 2. Standard normal distribution table (p=0.75)

Calculating SDs from other summary statistics:

Walter and Yao provide information about the sources of their table of conversion factors. Estimates of Hozo et al, Bland and Wan et al all provide detailed derivations of their equations in their papers. Wan also provide an online spreadsheet to calculate and compare their estimates. The common estimate of the SD as ¼ of the range comes from the fact that in normally distributed data, approximately 95% of values lie between 2 standard deviations either side of the mean (Figure 3). The shaded area in the one sided standard normal table is 1-0.0228=0.9972 (Figure 4).

Figure 3. Probability of being within ±2SD of the mean for data normally distributed

Figure 4. Standard normal distribution table (p=0.9772)

So ignoring the 4.56% in the tails, the range is estimated as

range = 4SD

The estimate of the SD then follows

Dr Kathy Taylor teaches data extraction in Meta-analysis. This is a short course that is also available as part of our MSc in Evidence-Based Health CareMSc in EBHC Medical Statistics, and MSc in EBHC Systematic Reviews. 

Follow me on Twitter @dataextips for updates on my blog, related news, and to find out about other examples of statistics being made more broadly accessible.

A full directory of blog posts can be found at  https://www.cebm.net/2014/06/data-extraction-in-meta-analysis/

One comment on “Tip for data extraction for meta-analysis – 26

  1. Really fine exposition of a wide range of problems and sensible ways of dealing with them. Very useful.

Leave a Reply

Your email address will not be published. Required fields are marked *