Keith R. Eberhardt, William F. Guthrie
Statistical Engineering Division, ITL
In interlaboratory comparisons, two or more laboratories measure
the same artifact to compare the relative biases of their measurement
processes. One summary of interest is the pairwise difference, say X1-X2,
between two laboratory's results, along with a confidence interval for the
true difference. Since the labs have unequal variances, the confidence interval
is usually computed by the Welch-Satterthwaite procedure, which approximates
the distribution of the pivot statistic
by a Student-t
distribution with effective degrees of freedom defined as a particular function
of the data.
In the course of analyzing the data for a major international interlaboratory
comparison, an awkward and counterintuitive property of the Welch-Satterthwaite
procedure was observed. Namely, the 95% confidence interval for a between-lab
difference,
,
can be narrower than the corresponding 95%
interval for one of the component results, say
.
Using the symbol Uto denote the half-width of a confidence interval, this condition is
U1-2<U1. This occurs when
has low degrees of freedom
(say 1 or 2), and therefore a large Student-t multiplier for 95%
confidence, while the effective degrees of freedom obtained from the
Welch-Satterthwaite approximation is larger.
The typical reaction to this situation is to suspect the Welch-Satterthwaite
procedure of failing to achieve the nominal 95% confidence level. However,
this is not the correct explanation. In fact, situations exist where all
three of the confidence intervals involved, for
,
,
and for
,
achieve the stated 95% level of confidence, yet
U1-2<U1.
The figure on the facing page illustrates a simulation study in which 10,000
sets of confidence intervals were computed for a situation where
and the degrees of freedom were 1 and 4, for labs 1 and
2, respectively. These results show that for this situation, the coverages for
all three intervals achieve the desired 95% confidence level. In the
simulation, the counterintuitive condition
U1-2<U1 occurs most of the
time (in 86% of the simulations), as shown by the preponderance of points
plotted below the diagonal in the figure. Even more surprising is that the
(conditional) coverage of the interval for the difference
gets
worse when that interval is wider than the corresponding interval for
alone. The simulation shows that the conditional coverage of the intervals for
is only 91% when
U1-2>U1, the condition that agrees with
intuition, as compared to 96.2% when the counterintuitive condition holds.
The Bayesian approach to this problem leads to using the Behrens-Fisher
distribution to obtain a 95% uncertainty interval for
.
Since
it can be shown that the Behrens-Fisher distribution does not exhibit the
counterintuitive property described above, this fact may help convince
physical scientists to make more use of Bayesian methods.
Figure 16: Comparison of half-widths of 95% confidence intervals for one mean, U1,
and for the difference of two means, U1-2.
Results for which the interval for
fails to cover the true value are shown in red.
The preponderance of points
below the diagonal illustrates that, for the situation studied, the most
common outcome yields the counterintuitive condition that
U1-2<U1.
Further, the location of the red points in the
figure illustrates that the interval for the difference
is relatively more likely to fail (to cover the true value) when the outcome
falls above the diagonal, i.e. when the uncertainties are more consistent with intuition.