Or why “The error bars overlap” is a meaningless statement.¶
In my work as an astrophysicist I have encountered quite a bit of confusion on how to quantify the level of disagreement between two or more measurements with error bars, or even more complex multi-dimensional confidence volumes. This is even more true this term when I am teaching an undergraduate lab course. Thinking a little about this myself, I realized that this is not something I have ever seen written up. At least not in any concise form. So I decided to create this little document, not only to educate others, but also to organize my own thoughts. I put an emphasis on developing the concepts and I make no attempt at rigour.
Suppose we have three measurements of the gravitational acceleration on the Earth’s surface, all with their own error bars (or measurement uncertainties, if you prefer). These are shown in the plot below and the error bars indicate their $1\sigma$ standard deviation.
For the first two of these measurements we usually say that they agree within their errors with the known true value. The third data point is at $9.89\pm0.03$. We call this a $2.7\sigma$ disagreement with the known value of $9.81\,\mathrm{m\,s}^{-2}$. This is easy to do because the true value is known so accurately that we can treat it as error free. But what about the mutual agreement between the three measurements? They all have errors. How do we decide whether they are in conflict and how significant this conflict is? Visually it is still easy for measurements 2 and 3. The data point of measurment 3 is within the upper error bar of measurement 2. We can say these two measurment agree within their errors, even though point 3 is mildly significant outlier from the canonical value of $g$.
What about measurements 1 and 2? For such cases we often find phrases like “the estimates agree with overlapping $1\sigma$ error bars” or “they agree within their $1\sigma$ error bars” in the astronomical literature. What is the first phrase supposed to mean and is the second one even correct? We use the $\sigma$ notation as a way of assigning probabilities that measurements agree, e.g., a $1\sigma$ discrepancy means that in 31.7% of all such cases the deviation is purely random, a $2\sigma$ disagreement happens randomly in only 4.5% of all cases. Statements like the ones above do not allow us to deduce such probabilities. If the $1\sigma$ error bars of two measurements overlap but mutually exclude the other value, what is the probability that these two data points actually agree or disagree?
Deciding whether single data points with errors agree with an error free value was easy. We justed checked whether the difference between the measure and true value is smaller than the measurement uncertainty. This description already holds the key to the procedure to determine whether measurements 1 and 2 compatible. If two measurements agree their difference must be compatible with zero. So we just compute the difference of measurements 2, $\Delta_{12} = g_1 - g_2$, and check that the error bar of the difference includes zero, where here and in the following the subscripts denote the measurement number.
What is the error bar of the difference $\Delta$? Simple, we just use propagation of uncertainties: $$ \sigma_{\Delta_{12}}^2 = \sigma_1^2 \left(\frac{\partial \Delta}{\partial g_1}\right)^2 + \sigma_2^2 \left(\frac{\partial \Delta}{\partial g_2}\right)^2 + 2 \sigma_{12}^2 \frac{\partial \Delta}{\partial g_1} \frac{\partial \Delta}{\partial g_2} $$ The last term in the sum may be unfamiliar and many introductory texts simply omit it. It takes into account possible correlations between measurements 1 and 2. For now we simply assume that $\sigma_{12}$ is zero. Then we have $$ \sigma_{\Delta_{12}} = \sqrt{\sigma_1^2 + \sigma_2^2}\;. $$
The results for our measurements are $\Delta_{12} = (-0.11 \pm 0.09)\,\mathrm{m\,s^{-2}}$, $\Delta_{13} = (-0.14 \pm 0.08)\,\mathrm{m\,s^{-2}}$, and $\Delta_{23} = (-0.03 \pm 0.07)\,\mathrm{m\,s^{-2}}$, or in graphical form:
As expected we find that measurements 2 and 3 agree. After all, one point is contained within the error bar of another. Also as expected we see that measurements 1 and 3 are incompatible at the $1\sigma$ level. Their error bars do not even overlap. Now the interesting case, the difference between data points 1 and 2 also shows a disagreement at the $1\sigma$ level. This is the case were the error bars overlap but the data points are mutually outside each other’s uncertainty. Can we conclude now that in such cases a significant disagreement exists between data points?
Unfortunately not. Consider another data point in our experiment:
Again the error bars of points 1 and 4 overlap without containing each others’ central value. Yet the $1\sigma$ error bar of the difference includes zero, $\Delta_{14} = (-0.10 \pm 0.11)\,\mathrm{m\,s^{-2}}$:
What we learn from this is that a statement a la “the error bars overlap” does not allow us to judge at what two measurements agree or disagree. In a setting in which we often have formal requirements on what we call significant and what not, such statements are thus not helpful and it would be better to retire them in favor of more quantitative expressions.