|
4.
Process Modeling
4.4. Data Analysis for Process Modeling 4.4.4. How can I tell if a model fits my data?
|
|||
| Statistical Tests Can Augment Ambiguous Residual Plots | Although the residual plots discussed on pages 4.4.4.1 and 4.4.4.3 will often indicate whether any important variables are missing or misspecified in the functional part of the model, a statistical test of the hypothesis that the model is sufficient may be helpful if the plots leave any doubt. Although it may seem tempting to use this type of statistical test in place of residual plots since it apparently assesses the fit of the model objectively, no single test can provide the rich feedback to the user that a graphical analysis of the residuals can provide. Furthermore, while model completeness is one of the most important aspects of model adequacy, this type of test does not address other important aspects of model quality. In statistical jargon, this type of test for model adequacy is usually called a "lack-of-fit" test. | ||
| General Strategy | The most common strategy used to test for model adequacy is to compare the amount of random variation in the residuals from the data used to fit the model with an estimate of the random variation in the process using data that are independent of the model. If these two estimates of the random variation are similar, that indicates that no significant terms are likely to be missing from the model. If the model-dependent estimate of the random variation is larger than the model-independent estimate, then significant terms probably are missing or misspecified in the functional part of the model. | ||
| Testing Model Adequacy Requires Replicate Measurements | The need for a model-independent estimate of the random variation means that replicate measurements made under identical experimental conditions are required to carry out a lack-of-fit test. If no replicate measurements are available, then there will not be any baseline estimate of the random process variation to compare with the results from the model. This is the main reason that the use of replication is emphasized in experimental design. | ||
| Data Used to Fit Model Can Be Partitioned to Compute Lack-of-Fit Statistic |
Although it might seem like two sets of data would be needed to carry out the lack-of-fit test using the
strategy described above, one set of data to fit the model and compute the residual
standard deviation and the other to compute the model-independent estimate of the random variation, that
is usually not necessary. In most regression applications, the same data used to fit the model can also be
used to carry out the lack-of-fit test, as long as the necessary replicate measurements are available.
In these cases, the lack-of-fit statistic is computed by partitioning the residual standard deviation into
two independent estimators of the random variation in the process. One estimator depends on the model and the
sample means of the replicated sets of data ( ), while the other estimator is a
pooled standard deviation based on the variation observed in each set of replicated measurements
( ). The squares of these two estimators of the random variation are often
called the "mean square for lack-of-fit" and the "mean square for pure error," respectively, in statistics
texts. The notation and is used here instead
to emphasize the fact that, if the model fits the data, these quantities should both be good estimators of
.
|
||
Estimating Using Replicate Measurements
|
The model-independent estimator of is computed using the formula
![]() with denoting the sample size of the data set used to fit the model,
is the number of unique combinations of predictor variable levels,
is the number of replicated observations at the ith combination of
predictor variable levels, the are the regression responses
indexed by their predictor variable levels and number of replicate measurements, and
is the mean of the responses at the itth
combination of predictor variable levels. Notice that the formula for
depends only on the data and not on the functional part of the model. This shows that
will be a good estimator of , regardless of whether
the model is a complete description of the process or not.
|
||
Estimating Using the Model
|
Unlike the formula for , the formula for
![]() (with denoting the number of unknown parameters in the model) does
depend on the functional part of the model. If the model were correct, the value of the function would
be a good estimate of the mean value of the response for every combination of predictor variable values.
When the function provides good estimates of the mean response at the ith combination, then
should be close in value to and should also
be a good estimate of . If, on the other hand, the function is missing any
important terms (within the range of the data), or if any terms are misspecified, then the function will
provide a poor estimate of the mean response for some combinations of the predictors and
will tend to be greater than .
|
||
| Carrying Out the Test for Lack-of-Fit |
Combining the ideas presented in the previous two paragraphs, following the general strategy outlined
above, the adequacy of the functional part of the model
can be assessed by comparing the values of and
. If , then one or more
important terms may be missing or misspecified in the functional part of the model. Because of the random
error in the data, however, we know that will sometimes be
larger than even when the model is adequate. To make sure that
the hypothesis that the model is adequate is not rejected by chance, it is necessary to understand how much
greater than the value of
might typically be when the model does fit the data. Then the
hypothesis can be rejected only when is
significantly greater than .
|
||
When the model does fit the data, it turns out that the ratio
![]() follows an F distribution. Knowing the probability distribution that describes the behavior of the statistic, , we can control the probability of
rejecting the hypothesis that the model is adequate in cases when the model actually is adequate. Rejecting
the hypothesis that the model is adequate only when is greater than an upper-tail
cut-off value from the F distribution with a user-specified probability of wrongly rejecting the hypothesis
gives us a precise, objective, probabilistic definition of when
is significantly greater than . The user-specified probability
used to obtain the cut-off value from the F distribution is called the "significance level" of the test. The
significance level for most statistical tests is denoted by . The most commonly
used value for the significance level is , which means that the hypothesis of
an adequate model will only be rejected in 5% of tests for which the model really is adequate. Cut-off values
can be computed using most statistical software or from tables
of the F distribution. In addition to needing the significance level to obtain the cut-off value, the F
distribution is indexed by the degrees of freedom associated with each of the two estimators of
. , which appears in the numerator of
, has degrees of freedom.
, which appears in the denominator of , has
degrees of freedom.
|
|||
Alternative Formula for
|
Although the formula given above more clearly shows the nature of , the
numerically equivalent formula below is easier to use in computations
. |
||