Your document treats at least partly the same problems I had in mind when I promised Carroll Croarkin to write a contribution. You have touched upon some essential problems and your approaches seem to be interesting but there are also disagreements between us as concerns how to solve the problems. I have found several points which I consider as questionable or obvious mistakes and you will probably have the corresponding objections to my document, which I will send to you as soon as I have finished a first draft. However, I usually consider disagreements as a good starting point for improvements and I look forward to creative discussions.
As a background to some of my comments I have given some views on the top-down approach for evaluation of measurement uncertainty (a more detailed presentation appears in my contribution mentioned above).
For evaluation of the uncertainty according to a top-down approach a realistic error component model is needed.
Let us assume that measurements are performed batch-wise and each batch (in the following called run) is calibrated by using some reference materials (called calibrators).
In the ideal situation, when the measured objects (samples) behave in the same way as the calibrators in the sense that a change in the measurement conditions has the same effect on both the measured object and the calibrators, we should have the following components of variation:
If a measured object and the calibrators don't behave in the same way there will also be a variation due to for instance changes in reagents and environmental conditions. These factors are often time-dependent and may be collected in a time component.
The factors causing a variation with time within a laboratory often vary more between laboratories and, thus, there will also be a variation between laboratories.
Different behavior in the measurement system between a measured object and the calibrators is also a cause for systematic error. But as there may be a difference between a measured object and the calibrators there may also be differences in behavior between different objects. Thus, a systematic error is related to a specific object or type of objects.
According to the discussion above a measurement result may be partitioned into the following components
where
y measurement result of an object
Y the true value
b(m) a common systematic error that is constant or only dependent on the level of the measurand
d a systematic error that is specific for the measured object
blab effect of a certain laboratory
btime the component of a long-range variation within a laboratory
brun random calibration error for a run
e random error within a run.
The variance of y is obtained by adding the variances of the relevant random errors and corrections errors (for systematic errors). The sample specific systematic errors can often be treated as random error. The corresponding standard deviation of y is called the standard uncertainty.
It should be noticed that it often is difficult or impossible to take a representative sample from the target population, i. e. the population of measurements for which we want to estimate the bias and variance. The sampled population usually has a smaller variance than the target population.
Page 3, line 2
x1, x2....are input quantities and not parameters. A parameter is a measurable characteristic of the universe. The expected value m and the standard deviation s are examples of parameters.
Page 3, line 6 from the bottom
Why not using the same subscript for sw and its estimate sr?
Page 4, line 6 from the bottom
As far as I can see there is only one random error component within laboratories, i. e. all measurements within a laboratory are assumed to be independent. This is not a realistic assumption, compare my model above.
Are there any sample-related error components in your general model?
Page 6, line 1
What is the uncertainty of DB used for? If the laboratory bias Bl is estimated directly, why should we then be interested in the uncertainty of DB? As far as I can see it is not used in the inequality on line 5.
Page 6, line 2
Under which conditions are the replicate measurements performed?
Page 6, line 7 from the top and from the bottom
What is n and sw in the formula? If sw is an estimate of a standard deviation it shouldn't be mixed with true standard deviations.
Page 6, line 9 from the bottom
I can't see why the laboratory bias Dl (earlier denoted Bl) now shall be < 2sD, while it in case i) was the bias with respect to the study population. In both cases I would like to see the error models used for uncertainty evaluations expressed more explicitly.
Page 6, line 5 from the bottom
On line 2 sl is a standard deviation within a laboratory and here sl is a between-laboratory standard deviation. I think you should use less confusing notations.
Page 7, line 4-9
Extremely confusing! What is the combined uncertainty supposed to be an uncertainty of? A systematic error or a measurement result? What happens if the test gives a significant result? Furthermore, the paired test is dubious. If a set of samples are compared with two methods, the systematic differences may depend on the samples. These systematic differences can often be partitioned into two components: one common component expressed as a function of the measurand and one component which is specific for the individual samples. To hide these possibilities in a mean difference is a mistake of a type that confirms the scurrilous portrait of a statistician as a person who says that when you have one foot in an oven and the other in a freezer you will on the average feel fine. The standard deviation of the paired differences may mainly reflect a variation in the systematic differences between the samples. Moreover, if the measurements with each method are performed in one run we are only comparing one run from each method. The number of independent observations from the methods is definitely not the number of samples measured.
Page 7, Example from EURACHEM Guide
Is only one test item used? If yes, under which conditions are the replicated measurements performed?
Is it reasonable to pool the standard deviations from two different methods? Are there any reasons to believe that the two methods should have the same standard deviation?
Page 8, scenario b)
Are the measurements performed on the same or on different occasions? The systematic differences may not be the same for all test items, see the comments to page 7, line 4-9.
Is a correction for the bias introduced in the measurement result? In that case it does not matter whether the bias is significant or not. If there are differences between laboratories and a laboratory bias is the same for all test items (which seems to be assumed) a t-test should in fact give a significant result if the number of test items is sufficiently large. It is not reasonable to assume laboratory biases in the model and then to say that the bias of a specific laboratory is not under control if it turns out to be significantly different from 0.
Again, I want an error model expressed explicitly. Whether the uncertainty on the last line in the paragraph is relevant depends on whether a correction is performed or not.
Page 8, NOTE 2
Can you give an interpretation of the concept target range?
To which error component is the uncertainty mentioned in the note assigned?
Page 9, line 1-2
I don't understand the sentence.
Page 12, line 2 from the bottom
Type I and type II errors refer to hypothesis testing and not to confidence intervals.
Page 15, Case 1
If no correction is performed (= correction with 0), an a priori distribution for the bias should be used for evaluation of the uncertainty.
Page 15, line 2 from the bottom
Explain the equation [1].
Page 17, clause 1
In order to eliminate the contribution to a systematic error from an influence quantity one must also estimate the target value corresponding to no error contribution. As both this target value and the slope c are estimated from experimental results, they are probably uncertain and contributes to the total uncertainty. Often this values depends on unknown influence quantities belonging to the test item and, thus, they must be estimated for each test item separately. This is in practice usually an almost impossible task.
Page 17, clause 2
I think my procedure for comparison of methods (see my contribution
Evaluation of Measurement Uncertainty with special attention
to chemical methods) is preferable as it partitions the systematic
errors into a common component, depending only on the level of
the measurand, and a sample-related component.