# Tip for data extraction for meta-analysis – 26

February 20, 2020 ## What if you’re missing a standard deviation and only a similar summary statistic is given?

Kathy Taylor

Previously, I highlighted a list of ways where, when extracting data for meta-analysis of continuous outcomes, you might find that a summary statistic that you want is missing. In my last post I gave the 3rd waya similar summary statistic is reported, but it’s not the statistical measure that I want and I focused on missing means. In this post I’ll show you what you can do with missing standard deviations (SDs).

Instead of the SD, another measure of dispersion may be reported, either the standard error (SE), confidence interval (CI), interquartile range (IQR) or range. The SD describes how measurements of participants naturally differ (which is saying something about the population) whilst the SE describes how accurately the mean has been estimated (which is saying something about a study). Sometimes it’s not what clear if the reported statistic is the SE or the SD and so comparing its value with the established SEs or SDs of other studies may help you decide.

The Cochrane Handbook (6.5.2.2.) divides the equations for calculating SDs into those for group means (when you want the SD of a mean value for the intervention group or the control group) and difference in means (when you want the SD of a difference in means between the intervention and control groups). In this post I deal with SDs of group means and I will look at SDs of difference in means and other effect measures in a future post.

Calculating SDs from SEs:

Obtaining SDs from SEs is very simple

SD=SE√n

Calculating SDs from confidence intervals:

A 95% confidence interval is expressed in terms of the SE and gives the range in which we are 95% sure that the sample mean lies. For data that is normally distributed, the confidence interval will be symmetric about the mean and therefore, For a 90% confidence interval, divide by 3.29, and for a 99% confidence interval, divide by 5.15. These divisors are derived from the standard normal distribution. If the sample size is small (<60 in each group), the divisors should be replaced by slightly larger numbers, derived from the t-distribution. Tables for these two distributions are given at the end of this post.

Calculating SDs from IQRs:

The Cochrane Handbook states that for normally distributed data, you can estimate Calculating SDs from other summary statistics:

There are a number of ways of calculating the SD from the range but they are not generally recommended by Cochrane Handbook because the range is so unstable, as it is determined by extreme values rather than providing an average measure of variation.

A common approach is to estimate Walter and Yao provide a table of conversion factors (f) according to the sample size to estimate
SD=f × range
Their table suggest that the common formula only applies to a sample size of around size 25 (f=0.254).

Other methods estimate the SD by equations of several other statistics. These equations have been evaluated by simulation but not empirically so the Cochrane Handbook (section 6.5.2.6) do not recommend them “as a general rule” but these estimates could still be used and the studies removed in a sensitivity analysis.

Hozo et al provide an estimate of the SD using the range with the median and sample size which they simplify for large n to Bland provides an estimate based on the range and interquartile range with the mean and sample size: Where Wan et al estimate the SD from the range with the median and sample size: They estimate the SD from the range, interquartile range, median and sample size, and from the interquartile range and sample size (for large sample sizes) Where

Φ-1(z) is the inverse function of Φ(z) (the cumulative distribution function of the standard normal distribution). Φ-1(z) is also the upper zth percentile of the standard normal distribution. It can be calculated using the R software command ‘qnorm(z)’.

Examples of studies with missing data

Let me show you some examples from studies of people with diabetes which were included in systematic reviews carried out by our group.

A study by Chaisson et al 2001 reported the effect of metformin on change from baseline of HbA1c in terms of mean and SE.
For the intervention group
SD = 0.12√81 = 1.08%
For the control group
SD = 0.12√82 = 1.09%

Kemal et al reported the effects of rosiglitazone on plasma glucose and other laboratory variables at 6 months in terms of median and range.

Three studies from one review where we extracted data on the effects of renin-angiotensin-aldosterone system inhibitors on albumin excretion rates were Tan et al who reported the effects of losartan at 6 months in terms of the median and interquartile range (IQR). Bojestig et al reported the effects of ramipril at 2 years in terms of median and range, and Tong et al reported the effects of fosinopril, also at 2 years in terms of median and range. Table shows the SD calculations using the different equations that I have shown above. Albumin excretion is measures in µg/min for all studies. For Tong et al, I converted the data from mg/24 hours, using the conversion factor that I showed previously (no.5).

Table. Estimating standard deviations

 Study Tan et al Bojestig et al Tong et al Kemal et al DATA Statistic Intervention Control Intervention(low dose) Intervention(high dose) Control Intervention Control Intervention Control n 40 40 16 17 18 18 20 11 17 median 79 55 81 94 96 894 243 2.71 2.64 lower IQR 103 107 upper IQR 3318 1836 IQR 101 58 3215 1729 min 10 23 48 max 1450 1112 308 Range/4 1440 1089 260 2.38 1.55 f 0.283 0.279 0.275 0.315 0.279 SD ESTIMATIONS Equation Intervention Control Intervention(low dose) Intervention(high dose) Control Intervention Control Intervention Control Common 360.00 272.25 65.00 0.60 0.39 Walter & Yao 407.52 303.83 71.50 0.75 0.43 Wan et al 1 407.05 272.25 65.00 0.75 0.43 Wan et al 2 77.66 44.60 2586.33 1379.48 Cochrane 74.81 42.96 2381.17 1280.86

Common approach – range/4; Cochrane Handbook – IQR/1.35

For the data from Tan et al, the equations of Wan et al and Cochrane Handbook produce similar results, which suggests that the distribution of the data were not highly skewed as the latter equation is based on assumption that the data are normally distributed. A similar point could be made for Tong et al. For the data reported by Kemal et al, the equations of Wan et al and Walter and Yao produced identical results to 2 decimal places, but the simple common approach underestimated the SDs. Applying the equations to the data of Bojestic et al shows how wide ranges can produce unstable results.

Another strategy which I will cover in my next post is dealing with missing SDs by imputation. Which SD should you use? Take an average, use the lowest value or highest value, or try them all? I will address these questions in a future post on sensitivity analysis.

Here’s a tip…

You can derive estimates of standard deviations from other reported summary data, but be aware of the assumptions underlying your estimates.

In my next post, I’ll focus on some other examples of the 4th way of how a summary statistic that you want may be missing for some cases: neither the summary statistic you want, nor a similar statistic are reported.

### Where did the equations come from?

(You can skip this if you are only interested in carrying out the calculations)

Calculating SDs from SEs

The standard error of the mean (SEM, which is often abbreviated to SE) is the standard deviation of the means of multiple samples: Where
n = sample size
σ = population standard deviation

The SE can be estimated from a single sample using the observed sample standard deviation, s: Let x1, x2, x3….xn be n independent observations from a population with mean µ and standard deviation σ (and variance σ2) This used the result
Var(aX) = a2 Var(X)
which comes from
Var(X) = E((X-μ)2) where μ=E(X)
Var(X) = E(X2) – 2E(X)μ+μ2
Var(X) = E(X2) – 2μ22
Var(X) = E(X2) – μ2
Var(X) = E(X2) – (E(X))2
Therefore,
Var(aX) = E((aX)2) – (E(aX))2
Var(aX) = a2E(X2) – a2(E(X))2 = a2Var(X)

Returning to Rearranging SD=SE√n

Calculating SDs from confidence intervals:

If we call the upper and lower limits of the 95% confidence interval upperCI and lowerCI. A symmetric confidence interval means that#

upperCI = mean + 1.96SE
lowerCI = mean – 1.96SE

1.96 is the Z value taken from the standard normal distribution table with the area in each tail of (1-0.95)/2=0.025 and therefore, using the one-sided table (Figure 1, red), the shaded area is 1 – 0.025 = 0.975. Figure 1. Standard normal distribution table (p=0.95,0.975,0.995)

As shown before, rearranging the equations for upperCI and lowerCI

(2×1.96)SE = upperCI – lowerCI

Rearranging, Similarly, for a 90% confidence interval, the area in each tail is (1-0.90)/2=0.05 and the shaded area corresponding to a one-sided standard normal distribution table is (1-0.05)=0.95. The corresponding z value is 1.645 (Figure 1, green).

2 x 1.645 = 3.29 and therefore, For a 99% confidence interval, the area in each tail is (1-0.99)/2=0.005 and the shaded area corresponding to a one-sided standard normal distribution table is (1-0.005)=0.995. The corresponding z value is 2.575 (Figure 1, blue).

2 x 2.575 = 5.15 and therefore, Calculating SDs from IQRs:

From a standard normal distribution table (Figure 2), the Z value for shaded area 0.75 (upper quartile) is approximately 0.67. The upper quartile is 0.67 SDs from the mean so

IQR = 2 × 0.67 × SD ≈ 1.35 SD Figure 2. Standard normal distribution table (p=0.75)

Calculating SDs from other summary statistics:

Walter and Yao provide information about the sources of their table of conversion factors. Estimates of Hozo et al, Bland and Wan et al all provide detailed derivations of their equations in their papers. Wan also provide an online spreadsheet to calculate and compare their estimates. The common estimate of the SD as ¼ of the range comes from the fact that in normally distributed data, approximately 95% of values lie between 2 standard deviations either side of the mean (Figure 3). The shaded area in the one sided standard normal table is 1-0.0228=0.9972 (Figure 4). Figure 3. Probability of being within ±2SD of the mean for data normally distributed Figure 4. Standard normal distribution table (p=0.9772)

So ignoring the 4.56% in the tails, the range is estimated as

range = 4SD

The estimate of the SD then follows Dr Kathy Taylor teaches data extraction in Meta-analysis. This is a short course that is also available as part of our MSc in Evidence-Based Health CareMSc in EBHC Medical Statistics, and MSc in EBHC Systematic Reviews.

1. Gene V Glass