4.
Process Modeling
4.4.
Data Analysis for Process Modeling
4.4.4.
How can I tell if a model fits my data?
4.4.4.6.
|
How can I test whether any significant terms are missing or misspecified in the functional part of the model?
|
|
Statistical Tests Can Augment Ambiguous Residual Plots
|
Although the residual plots discussed on pages 4.4.4.1 and
4.4.4.3 will often indicate whether any important variables are missing or
misspecified in the functional part of the model, a statistical test of the hypothesis that the model is
sufficient may be helpful if the plots leave any doubt. Although it may seem tempting to use this type of
statistical test in place of residual plots since it apparently assesses the fit of the model objectively,
no single test can provide the rich feedback to the user that a graphical analysis of the residuals can
provide. Furthermore, while model completeness is one of the most important aspects of model adequacy, this
type of test does not address other important aspects of model quality. In statistical jargon, this type of
test for model adequacy is usually called a "lack-of-fit" test.
|
General Strategy
|
The most common strategy used to test for model adequacy is to compare the amount of random variation in the
residuals from the data used to fit the model with an estimate of the random variation in the process using
data that are independent of the model. If these two estimates of the random variation are similar, that
indicates that no significant terms are likely to be missing from the model. If the model-dependent estimate
of the random variation is larger than the model-independent estimate, then significant terms probably are
missing or misspecified in the functional part of the model.
|
Testing Model Adequacy Requires Replicate Measurements
|
The need for a model-independent estimate of the random variation means that replicate measurements made
under identical experimental conditions are required to carry out a lack-of-fit test. If no replicate
measurements are available, then there will not be any baseline estimate of the random process variation to
compare with the results from the model. This is the main reason that the use of replication is emphasized
in experimental design.
|
Data Used to Fit Model Can Be Partitioned to Compute Lack-of-Fit Statistic
|
Although it might seem like two sets of data would be needed to carry out the lack-of-fit test using the
strategy described above, one set of data to fit the model and compute the residual
standard deviation and the other to compute the model-independent estimate of the random variation, that
is usually not necessary. In most regression applications, the same data used to fit the model can also be
used to carry out the lack-of-fit test, as long as the necessary replicate measurements are available.
In these cases, the lack-of-fit statistic is computed by partitioning the residual standard deviation into
two independent estimators of the random variation in the process. One estimator depends on the model and the
sample means of the replicated sets of data (),
while the other estimator is a pooled standard deviation based on the variation
observed in each set of replicated measurements ().
The squares of these two estimators of the random variation are often
called the "mean square for lack-of-fit" and the "mean square for pure error," respectively,
in statistics texts. The notation
and
is used here instead to emphasize the fact that, if the model fits the data, these quantities
should both be good estimators of .
|
Estimating
Using Replicate Measurements
|
The model-independent estimator of
is computed using the formula
with
denoting the sample size of the data set used to fit the model,
is the number of unique combinations of predictor variable levels,
is the number of replicated observations at the ith combination of
predictor variable levels, the
are the regression responses indexed by their predictor variable levels and
number of replicate measurements, and
is the mean of the responses at the ith
combination of predictor variable levels. Notice that the formula for
depends only on the data and not on the functional part of the model. This shows that
will be a good estimator of ,
regardless of whether the model is a complete description of the process or not.
|
Estimating
Using the Model
|
Unlike the formula for ,
the formula for
(with
denoting the number of unknown parameters in the model) does
depend on the functional part of the model. If the model were correct, the value of the function would
be a good estimate of the mean value of the response for every combination of predictor variable values.
When the function provides good estimates of the mean response at the ith combination, then
should be close in value to
and should also be a good estimate of .
If, on the other hand, the function is missing any important terms (within the range of the data),
or if any terms are misspecified, then the function will provide a poor estimate of the mean response
for some combinations of the predictors and
will tend to be greater than .
|
Carrying Out the Test for Lack-of-Fit
|
Combining the ideas presented in the previous two paragraphs, following the general strategy outlined
above, the adequacy of the functional part of the model
can be assessed by comparing the values of
and .
If ,
then one or more important terms may be missing or misspecified in the functional
part of the model. Because of the random error in the data, however, we know that
will sometimes be larger than
even when the model is adequate. To make sure that the hypothesis that the model is adequate
is not rejected by chance, it is necessary to understand how much greater than
the value of
might typically be when the model does fit the data. Then the
hypothesis can be rejected only when
is significantly greater than .
|
|
When the model does fit the data, it turns out that the ratio
follows an F distribution. Knowing the probability distribution
that describes the behavior of the statistic, ,
we can control the probability of rejecting the hypothesis that the model is adequate in
cases when the model actually is adequate. Rejecting
the hypothesis that the model is adequate only when
is greater than an upper-tail cut-off value from the F distribution with a user-specified
probability of wrongly rejecting the hypothesis
gives us a precise, objective, probabilistic definition of when
is significantly greater than .
The user-specified probability used to obtain the cut-off value from the F distribution is
called the "significance level" of the test. The
significance level for most statistical tests is denoted by .
The most commonly used value for the significance level is ,
which means that the hypothesis of
an adequate model will only be rejected in 5 % of tests for which the model really is adequate. Cut-off values
can be computed using most statistical software or from tables
of the F distribution. In addition to needing the significance level to obtain the cut-off value, the F
distribution is indexed by the degrees of freedom associated with each of the two estimators of .
,
which appears in the numerator of ,
has
degrees of freedom. ,
which appears in the denominator of ,
has
degrees of freedom.
|
Alternative Formula for
|
Although the formula given above more clearly shows the nature of ,
the numerically equivalent formula below is easier to use in computations
|