We provide datasets with certified values for the posterior mean and standard deviation, to assess the accuracy of Markov chain Monte Carlo (MCMC) calculations in statistical software. Computational inaccuracy in MCMC has at least 5 sources:
Truncation error is the inexact binary representation error in storing decimal numbers according to the IEEE standard arithmetic. Of course, once these representational digits are truncated, they cannot be recovered; their effect can at best be held constant, and at worst propagated to larger errors through computations.
Cancellation error is an error that occurs when analyzing data that has low relative variation; that is, data with a high level of "stiffness". In "Assessing the Accuracy of ANOVA Calculations in Statistical Software" (Computational Statistics & Data Analysis 8 (1989), pp 325-332), Simon and Lesage noted that as the number of constant leading digits in a particular dataset increases and the data grows more nearly constant (i.e., the stiffness increases) accurate computation of standard deviations becomes increasingly difficult. This also holds for other similarly computed summary statistics, like the autocorrelation coefficient. In both cases computation is hindered by subtracting data from a mean quite close to the data, leaving behind the digits from the mantissa of each data element that are most likely to have been misrepresented. In other words, the accuracy in calculating the difference is difficult to maintain if the true difference is very small and is from the difference of two large numbers.
Accumulation error (also as noted by Simon & Lesage) is the error that occurs in direct proportion to the total number of arithmetic computations, which in turn in this case is proportional to the number of observations. The accumulation of small errors makes accurate computations difficult.
Digital computers use deterministic recursive mathematical rules to generate random numbers and hence they are predictable and called pseudo-random numbers. The quality of a pseudo-random number generator with respect to true randomness varies and depends on:
Simulation error for MCMC can come both from the randomness in the simulated values, and from the fact that the observations in the simulated Markov chain are not independently distributed, even asymptotically. Furthermore, there is also the "burn-in" error due to using a large finite number to approximate infinity.
Levels of Difficulty:
We use generated datasets to examine computational accuracy at different stiffness levels. Using the benchmark work of Simon and Lesage (1989), our generated datasets have the number of constant leading digits set to 9, 10, 11, 12, 13 or 14, with 11 observations in each of the six datasets.
Datasets are ordered by level of difficulty (lower, average, and higher) according to their stiffness, the number of constant leading digits. This ordering is simply meant to provide rough guidance for the user; producing correct results on a dataset of higher difficulty does not imply that your software will correctly solve all datasets of average or even lower difficulty. Of the 6 datasets, two datasets are of the lower level of difficulty, two are of average level of difficulty, and two of higher level of difficulty corresponding to having the leading digits 9 and 10, 11 and 12, 13 and 14 respectively.
Possible Computational Accuracy Improvements:
To improve computational accuracy to a dataset, one remedial measure is to carry out the MCMC computation with the leading digit subtracted from the datasets. Then insert the leading digit back to the MCMC result to restore the original scale of the data before performing the final statistical analysis. Another improvement is to make sure that the sample standard deviation is computed by the formula which first computes deviations about the mean before squaring and summing, as opposed to using the old desk calculator formula of a generation ago which involves the (computationally unstable) difference of 2 large numbers: the sums of squares of the raw data (uncentered) and the sum of the squared sample mean.
As noted in the General Background Information producing correct results for all datasets in this collection does not imply that your software will do the same for your own particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software.
We plan to update this collection of datasets in the future, and welcome your feedback on specific datasets to include, and on other ways to improve this web service.