The Centre for Evidence-Based Medicine develops, promotes and disseminates better evidence for healthcare.

February 20, 2020

*Kathy Taylor*

**Previously**, I highlighted a list of ways where, when extracting data for meta-analysis of continuous outcomes, you might find that a summary statistic that you want is missing. **In my last post I gave the 3rd way** – **a similar summary statistic is reported, but it’s not the statistical measure that I want** and I focused on missing means. In this post I’ll show you what you can do with missing standard deviations (SDs).

Instead of the SD, another measure of dispersion may be reported, either the standard error (SE), confidence interval (CI), interquartile range (IQR) or range. The SD describes how measurements of participants naturally differ (which is saying something about the population) whilst the SE describes how accurately the mean has been estimated (which is saying something about a study). Sometimes it’s not what clear if the reported statistic is the SE or the SD and so comparing its value with the established SEs or SDs of other studies may help you decide.

The **Cochrane Handbook (6.5.2.2.)** divides the equations for calculating SDs into those for group means (when you want the SD of a mean value for the intervention group or the control group) and difference in means (when you want the SD of a difference in means between the intervention and control groups). In this post I deal with SDs of group means and I will look at SDs of difference in means and other effect measures in a future post.

*Calculating SDs from SEs:*

Obtaining SDs from SEs is very simple

*SD=SE√n*

*Calculating SDs from confidence intervals:*

A 95% confidence interval is expressed in terms of the SE and gives the range in which we are 95% sure that the sample mean lies. For data that is normally distributed, the confidence interval will be symmetric about the mean and therefore,

For a 90% confidence interval, divide by 3.29, and for a 99% confidence interval, divide by 5.15. These divisors are derived from the standard normal distribution. If the sample size is small (<60 in each group), the divisors should be replaced by slightly larger numbers, derived from the t-distribution. Tables for these two distributions are given at the end of this post.

*Calculating SDs from IQRs:*

The **Cochrane Handbook** states that for normally distributed data, you can estimate

*Calculating SDs from other summary statistics:*

There are a number of ways of calculating the SD from the range but they are not generally recommended by **Cochrane Handbook** because the range is so unstable, as it is determined by extreme values rather than providing an average measure of variation.

A common approach is to estimate

**Walter and Yao** provide a table of conversion factors (f) according to the sample size to estimate

*SD=f × range *Their table suggest that the common formula only applies to a sample size of around size 25 (f=0.254).

Other methods estimate the SD by equations of several other statistics. These equations have been evaluated by simulation but not empirically so the **Cochrane Handbook (section 6.5.2.6)** do not recommend them “as a general rule” but these estimates could still be used and the studies removed in a sensitivity analysis.

**Hozo et al** provide an estimate of the SD using the range with the median and sample size

which they simplify for large n to

**Bland** provides an estimate based on the range and interquartile range with the mean and sample size:

Where

**Wan et al** estimate the SD from the range with the median and sample size:

They estimate the SD from the range, interquartile range, median and sample size,

and from the interquartile range and sample size (for large sample sizes)

Where

Φ^{-1}(*z*) is the inverse function of Φ(*z*) (the cumulative distribution function of the standard normal distribution). Φ^{-1}(*z*) is also the upper zth percentile of the standard normal distribution. It can be calculated using the R software command ‘qnorm(z)’.

**Examples of studies with missing data**

Let me show you some examples from studies of people with diabetes which were included in systematic reviews carried out by our group.

A study by **Chaisson et al 2001** reported the effect of metformin on change from baseline of HbA1c in terms of mean and SE.

For the intervention group

*SD *= 0.12√81 = 1.08%

For the control group

*SD* = 0.12√82 = 1.09%

**Kemal et al** reported the effects of rosiglitazone on plasma glucose and other laboratory variables at 6 months in terms of median and range.

Three studies from one review where we extracted data on the effects of renin-angiotensin-aldosterone system inhibitors on albumin excretion rates were **Tan et al** who reported the effects of losartan at 6 months in terms of the median and interquartile range (IQR). **Bojestig et al** reported the effects of ramipril at 2 years in terms of median and range, and **Tong et al** reported the effects of fosinopril, also at 2 years in terms of median and range. Table shows the SD calculations using the different equations that I have shown above. Albumin excretion is measures in µg/min for all studies. For Tong et al, I converted the data from mg/24 hours, using the conversion factor that I showed **previously (no.5)**.

Table. Estimating standard deviations

Study | Tan et al | Bojestig et al | Tong et al | Kemal et al | |||||

DATA | |||||||||

Statistic | Intervention | Control | Intervention (low dose) | Intervention (high dose) | Control | Intervention | Control | Intervention | Control |

n | 40 | 40 | 16 | 17 | 18 | 18 | 20 | 11 | 17 |

median | 79 | 55 | 81 | 94 | 96 | 894 | 243 | 2.71 | 2.64 |

lower IQR | 103 | 107 | |||||||

upper IQR | 3318 | 1836 | |||||||

IQR | 101 | 58 | 3215 | 1729 | |||||

min | 10 | 23 | 48 | ||||||

max | 1450 | 1112 | 308 | ||||||

Range/4 | 1440 | 1089 | 260 | 2.38 | 1.55 | ||||

f | 0.283 | 0.279 | 0.275 | 0.315 | 0.279 | ||||

SD ESTIMATIONS | |||||||||

Equation | Intervention | Control | Intervention (low dose) | Intervention (high dose) | Control | Intervention | Control | Intervention | Control |

Common | 360.00 | 272.25 | 65.00 | 0.60 | 0.39 | ||||

Walter & Yao | 407.52 | 303.83 | 71.50 | 0.75 | 0.43 | ||||

Wan et al 1 | 407.05 | 272.25 | 65.00 | 0.75 | 0.43 | ||||

Wan et al 2 | 77.66 | 44.60 | 2586.33 | 1379.48 | |||||

Cochrane | 74.81 | 42.96 | 2381.17 | 1280.86 |

Common approach – range/4; Cochrane Handbook – IQR/1.35

For the data from Tan et al, the equations of Wan et al and Cochrane Handbook produce similar results, which suggests that the distribution of the data were not highly skewed as the latter equation is based on assumption that the data are normally distributed. A similar point could be made for Tong et al. For the data reported by Kemal et al, the equations of Wan et al and Walter and Yao produced identical results to 2 decimal places, but the simple common approach underestimated the SDs. Applying the equations to the data of Bojestic et al shows how wide ranges can produce unstable results.

Another strategy which I will cover in my next post is dealing with missing SDs by imputation. Which SD should you use? Take an average, use the lowest value or highest value, or try them all? I will address these questions in a future post on sensitivity analysis.

Here’s a tip…

You can derive estimates of standard deviations from other reported summary data, but be aware of the assumptions underlying your estimates.

In my next post, I’ll focus on some other examples of the **4th way** of how a summary statistic that you want may be missing for some cases: **neither the summary statistic you want, nor a similar statistic are reported.**

(You can skip this if you are only interested in carrying out the calculations)

*Calculating SDs from SEs*

The standard error of the mean (SEM, which is often abbreviated to SE) is the standard deviation of the means of multiple samples:

Where

n = sample size

σ = population standard deviation

The SE can be estimated from a single sample using the observed sample standard deviation, s:

Let x_{1}, x_{2}, x_{3}….x_{n} be n independent observations from a population with mean µ and standard deviation σ (and variance σ^{2})

This used the result

*Var(aX) = a ^{2} Var(X)*

which comes from

Therefore,

*Returning to*

*Rearranging SD=SE√n*

*Calculating SDs from confidence intervals:*

If we call the upper and lower limits of the 95% confidence interval upperCI and lowerCI. A symmetric confidence interval means that#

*upperCI = mean + 1.96SE*

*lowerCI = mean – 1.96SE*

1.96 is the Z value taken from the standard normal distribution table with the area in each tail of (1-0.95)/2=0.025 and therefore, using the one-sided table (Figure 1, red), the shaded area is 1 – 0.025 = 0.975.

Figure 1. Standard normal distribution table (p=0.95,0.975,0.995)

**As shown before**, rearranging the equations for upperCI and lowerCI

*(2×1.96)SE = upperCI – lowerCI*

Rearranging,

Similarly, for a 90% confidence interval, the area in each tail is (1-0.90)/2=0.05 and the shaded area corresponding to a one-sided standard normal distribution table is (1-0.05)=0.95. The corresponding z value is 1.645 (Figure 1, green).

2 x 1.645 = 3.29 and therefore,

For a 99% confidence interval, the area in each tail is (1-0.99)/2=0.005 and the shaded area corresponding to a one-sided standard normal distribution table is (1-0.005)=0.995. The corresponding z value is 2.575 (Figure 1, blue).

2 x 2.575 = 5.15 and therefore,

*Calculating SDs from IQRs:*

From a standard normal distribution table (Figure 2), the Z value for shaded area 0.75 (upper quartile) is approximately 0.67. The upper quartile is 0.67 SDs from the mean so

*IQR* = 2 × 0.67 × *SD* ≈ 1.35 *SD*

Figure 2. Standard normal distribution table (p=0.75)

*Calculating SDs from other summary statistics:*

Walter and Yao provide information about the sources of their table of conversion factors. Estimates of Hozo et al, Bland and Wan et al all provide detailed derivations of their equations in their papers. Wan also provide an online spreadsheet to calculate and compare their estimates. The common estimate of the SD as ¼ of the range comes from the fact that in normally distributed data, approximately 95% of values lie between 2 standard deviations either side of the mean (Figure 3). The shaded area in the one sided standard normal table is 1-0.0228=0.9972 (Figure 4).

Figure 3. Probability of being within ±2SD of the mean for data normally distributed

Figure 4. Standard normal distribution table (p=0.9772)

So ignoring the 4.56% in the tails, the range is estimated as

*range *= 4*SD*

The estimate of the SD then follows

*Dr Kathy Taylor teaches data extraction in Meta-analysis. This is a short course that is also available as part of our MSc in Evidence-Based Health Care, MSc in EBHC Medical Statistics, and MSc in EBHC Systematic Reviews.*

Follow me on Twitter **@dataextips** for updates on my blog, related news, and to find out about other examples of statistics being made more broadly accessible.

A full directory of blog posts can be found at ** https://www.cebm.net/2014/06/data-extraction-in-meta-analysis/**

Really fine exposition of a wide range of problems and sensible ways of dealing with them. Very useful.