Research Figures Demystified: Bland-Altman Plot

While working your shift in a small community ED, you overhear that EMS is on their way to you with a five-year-old child in respiratory distress after eating a peanut butter sandwich. Anticipating the patient to be in anaphylactic shock, you and the senior resident begin planning the course of action. The resident asks, “how much do you think a five-year-old weighs?” While you begin fumbling for your Broselow tape, a nurse seated near you confidently responds, “That’s easy, just count your fingers! One, three, five. Ten, fifteen, twenty! The child weighs approximately twenty kilograms!”.

Sure enough, your Broselow tape confirms the approximate weight of a five-year-old is 20 kilograms. But how the heck did she do that? Is this method of estimation accurate across all age groups? You decide to seek evidence.

To your surprise, you find a paper by Young et al from March 2014 published in the American Journal of Emergency Medicine titled, “Finger counting: an alternative method for estimating pediatric weights”. The authors of this paper conclude that ”the finger counting method is an acceptable alternative to the Broselow method for weight estimation in children aged 1 to 9 years.” As you dig a bit deeper, you find that the results are presented in the form of an alien graph that you’ve never heard of before — the Bland-Altman Plot [1].

Background

In the past, the use of correlation coefficients (r) was common for reporting the level of agreement between different measures. However, this strategy was criticized by many statisticians for being misleading. In a landmark paper published in The Lancet in 1986, John Bland and Douglas Altman first introduced their solution to the medical community, and thus the “Bland-Altman Plot” was born [2].

The Bland-Altman Plot (BAP) provides a visual representation of the level of agreement between two different measures. This comparison is important because “comparison of a new measurement technique with an established one is often needed to see whether they agree sufficiently for the new to replace the old” [2]. Often in medicine, new tests are introduced to replace accurate existing ones for reasons such as cost reduction, quicker results, and safety profile.

Components of the Bland-Altman Plot

Essentially, the BAP is a derivative of the rudimentary scatterplot. Like a scatter plot, each point considers two pieces of information. In a traditional scatter plot, the outcome values from both the experiment and control groups are plotted against each other on a single point. The purpose of this is to visualize the relationship between both groups. If the values increase together (positive correlation), a 45 degree upward line will be apparent from left to right . If one of the values decreases as the other increases (negative correlation), a 45 degree downward line will be visible from left to right.

Although scatter plots are an excellent way of checking correlation between two outcomes, it is an insensitive method of assessing the agreement between two measures. What makes a BAP unique is that, instead of plotting the values against each other, we plot calculated measures. On the x-axis, the mean of the two measurements is plotted (M1_i + M2_i / 2). On the y-axis, the difference between the two measurements is plotted (M1_i – M2_i). After the points are plotted, the mean difference of all subjects is calculated and represented as a horizontal line (Diff1 + Diff2 + … Diffk / N). The closer this horizontal line is to zero, the better. This becomes more clear with the practice cases below.

After calculating the mean difference, the limits of agreement (bias) are calculated (SD(mean difference) * 1.96) and plotted (Mean +/- error). This is represented by the outermost horizontal lines on the plot. The most important component to consider when interpreting a BAP is the spread of the error margins. The wider the spread, the less precise the agreement between measures. There is no one defined value of “too wide”. Instead, the clinician must decide what level of error is acceptable.

Practice Cases

The following examples utilize random data generated for purposes of instruction. Consider tests X, Y, and Z to be three different novel methods of estimating serum lactate for patients in septic shock. Three separate “studies” were performed comparing these novel methods against the current standard serum lactate measurement.

Case 1

[su_spoiler title=”Interpretation” style=”fancy” icon=”chevron-circle”]

Does the new test tend to underestimate or overestimate the old test? Very slight underestimation (Old test was subtracted from new test – be careful here)
Is the distance between the upper and lower margins large enough to be clinically important? NO: variability appears to only be +/- 0.3.
Is the variability around the mean constant? YES
GOOD AGREEMENT!

[/su_spoiler]

Case 2

[su_spoiler title=”Interpretation” style=”fancy” icon=”chevron-circle”]

Does the new test tend to underestimate or overestimate the old test? Very slight overestimation
Is the distance between the upper and lower margins large enough to be clinically important? YES: variability appears to be +/- 4.5.
BAD AGREEMENT!

[/su_spoiler]

Case 3

[su_spoiler title=”Interpretation” style=”fancy” icon=”chevron-circle”]

Does the new test tend to underestimate or overestimate the old test? Very slight overestimation
Is the distance between the upper and lower margins large enough to be clinically important? POSSIBLY: Variability appears to be +/- 2.3 (might be acceptable if the test is non-invasive and/or cheap).
Is the variability around the mean constant? NO
Is there an observable trend? YES: It appears that agreement is significantly worse after a serum lactate of approximately 5.0.
Could this trend become a clinically important problem in practice? YES
BAD AGREEMENT!

[/su_spoiler]

Scenario Wrap Up

Now back to the case…

After hearing your nurse’s trick of the trade in pediatric weight estimation, you look more into the paper by Young et al [1]. This paper made the conclusion that pediatric weight estimation can be accurately accomplished by using the “finger counting method”. The authors report their findings in the form of a Bland-Altman plot, comparing the estimated weights with measured weights from a total of 207 subjects (23/age group).

Previously unpublished figure used with permission from Tim Young

How would I interpret this?

Does finger counting tend to underestimate or overestimate weights predicted by the Broselow Tape? Very slight underestimation (This time, the NEW test was subtracted from the OLD test)
Is the distance between the upper and lower margins large enough to be clinically important? NO
Is the variability around the mean constant? NO
Is there an observable trend? YES: there appears to be a trend toward underestimation of pediatric weights as the estimated weights approach 30-35 kg (~9-11 years).
Could this trend become a clinically important problem in practice? NO
GOOD AGREEMENT!

Take-Home Points

Bland-Altman plots are an incredibly simple and intuitive way of visually assessing the agreement between two different measures.
The most important considerations when interpreting a Bland-Altman plot are:
- Where does the mean difference line fall?
- How large is the spread of the error margins?
- Is the spread large enough to be clinically important?

Additional Reading

Dewitte et al. Application of the Bland-Altman plot for interpretation of method-comparison studies: a critical investigation of its practice. Clin Chem 2002; 48(5): 799-802. [open access]
Bland M. How can I decide the sample size for a study of agreement between two methods of measurement? [website, accessed May 20, 2014]
GraphPad website. Interpretation of Results: Bland Altman. [website, accessed May 26, 2014]
Young, Timothy. Finger Counting Method for Pediatric Weight Estimation.

Bio
Twitter
LinkedIn
Latest Posts

Derek Sifford

Premedical student, Wayne State University
Critical Care Paramedic, serving the Henry Ford Health System (Detroit, MI)
ALiEM CTO
Founder, www.hurtregistry.org