|
4.
Process Modeling
4.4. Data Analysis for Process Modeling
|
|||
Is Not Enough!
|
Model validation is possibly the most important step in the model building
sequence. It is also one of the most overlooked. Often the validation of
a model seems to consist of nothing more than quoting the
statistic from the fit (which measures the fraction
of the total variability in the response that is accounted for by the model).
Unfortunately, a high value does not guarantee
that the model fits the data well. Use of a model that does not fit the
data well cannot provide good answers to the underlying engineering or
scientific questions under investigation.
|
||
| Main Tool: Graphical Residual Analysis |
There are many statistical tools for model validation, but the primary tool
for most process modeling applications is graphical residual analysis.
Different types of plots of the residuals (see definition
below) from a fitted
model provide information on the adequacy of different aspects of the model.
Numerical methods for model validation, such as the
statistic, are also useful, but usually to a lesser degree than graphical
methods. Graphical methods have an advantage over numerical methods for
model validation because they readily illustrate a broad range of complex
aspects of the relationship between the model and the data. Numerical methods
for model validation tend to be narrowly focused on a particular aspect of the
relationship between the model and the data and often try to compress that
information into a single descriptive number or test result.
|
||
| Numerical Methods' Forte | Numerical methods do play an important role as confirmatory methods for graphical techniques, however. For example, the lack-of-fit test for assessing the correctness of the functional part of the model can aid in interpreting a borderline residual plot. There are also a few modeling situations in which graphical methods cannot easily be used. In these cases, numerical methods provide a fallback position for model validation. One common situation when numerical validation methods take precedence over graphical methods is when the number of parameters being estimated is relatively close to the size of the data set. In this situation residual plots are often difficult to interpret due to constraints on the residuals imposed by the estimation of the unknown parameters. One area in which this typically happens is in optimization applications using designed experiments. Logistic regression with binary data is another area in which graphical residual analysis can be difficult. | ||
| Residuals |
The residuals from a fitted model are the differences between the responses
observed at each combination values of the explanatory variables and the corresponding
prediction of the response computed using the regression function.
Mathematically, the definition of the residual for the ith
observation in the data set is written
,with denoting the ith
response in the data set and
represents the list of explanatory variables, each set at the corresponding
values found in the ith observation in the data set.
|
||
| Example |
The data listed below are from the
Pressure/Temperature example introduced
in Section 4.1.1. The first column
shows the order in which the observations were made, the second column indicates the
day on which each observation was made, and the third column gives the ambient
temperature recorded when each measurement was made. The fourth column lists the
temperature of the gas itself (the explanatory variable) and the fifth column
contains the observed pressure of the gas (the response variable). Finally, the sixth
column gives the corresponding values from the fitted straight-line regression function.
![]() and the last column lists the residuals, the difference between columns five and six. |
||
| Data, Fitted Values & Residuals |
Run Ambient Fitted Order Day Temperature Temperature Pressure Value Residual 1 1 23.820 54.749 225.066 222.920 2.146 2 1 24.120 23.323 100.331 99.411 0.920 3 1 23.434 58.775 230.863 238.744 -7.881 4 1 23.993 25.854 106.160 109.359 -3.199 5 1 23.375 68.297 277.502 276.165 1.336 6 1 23.233 37.481 148.314 155.056 -6.741 7 1 24.162 49.542 197.562 202.456 -4.895 8 1 23.667 34.101 138.537 141.770 -3.232 9 1 24.056 33.901 137.969 140.983 -3.014 10 1 22.786 29.242 117.410 122.674 -5.263 11 2 23.785 39.506 164.442 163.013 1.429 12 2 22.987 43.004 181.044 176.759 4.285 13 2 23.799 53.226 222.179 216.933 5.246 14 2 23.661 54.467 227.010 221.813 5.198 15 2 23.852 57.549 232.496 233.925 -1.429 16 2 23.379 61.204 253.557 248.288 5.269 17 2 24.146 31.489 139.894 131.506 8.388 18 2 24.187 68.476 273.931 276.871 -2.940 19 2 24.159 51.144 207.969 208.753 -0.784 20 2 23.803 68.774 280.205 278.040 2.165 21 3 24.381 55.350 227.060 225.282 1.779 22 3 24.027 44.692 180.605 183.396 -2.791 23 3 24.342 50.995 206.229 208.167 -1.938 24 3 23.670 21.602 91.464 92.649 -1.186 25 3 24.246 54.673 223.869 222.622 1.247 26 3 25.082 41.449 172.910 170.651 2.259 27 3 24.575 35.451 152.073 147.075 4.998 28 3 23.803 42.989 169.427 176.703 -7.276 29 3 24.660 48.599 192.561 198.748 -6.188 30 3 24.097 21.448 94.448 92.042 2.406 31 4 22.816 56.982 222.794 231.697 -8.902 32 4 24.167 47.901 199.003 196.008 2.996 33 4 22.712 40.285 168.668 166.077 2.592 34 4 23.611 25.609 109.387 108.397 0.990 35 4 23.354 22.971 98.445 98.029 0.416 36 4 23.669 25.838 110.987 109.295 1.692 37 4 23.965 49.127 202.662 200.826 1.835 38 4 22.917 54.936 224.773 223.653 1.120 39 4 23.546 50.917 216.058 207.859 8.199 40 4 24.450 41.976 171.469 172.720 -1.251 |
||
| Why Use Residuals? | If the model fit to the data were correct, the residuals would approximate the random errors that make the relationship between the explanatory variables and the response variable a statistical relationship. Therefore, if the residuals appear to behave randomly, it suggests that the model fits the data well. On the other hand, if non-random structure is evident in the residuals, it is a clear sign that the model fits the data poorly. The subsections listed below detail the types of plots to use to test different aspects of a model and give guidance on the correct interpretations of different results that could be observed for each type of plot. | ||
| Model Validation Specifics |
|
||