4.
Process Modeling
4.4. Data Analysis for Process Modeling 4.4.4. How can I tell if a model fits my data?


Unnecessary Terms in the Model Affect Inferences  Models that are generally correct in form, but that include extra, unnecessary terms are said to "overfit" the data. The term overfitting is used to describe this problem because the extra terms in the model make it more flexible than it should be, allowing it to fit some of the random variation in the data as if it were deterministic structure. Because the parameters for any unnecessary terms in the model usually have estimated values near zero, it may seem as though leaving them in the model would not hurt anything. It is true, actually, that having one or two extra terms in the model does not usually have much negative impact. However, if enough extra terms are left in the model, the consequences can be serious. Among other things, including unnecessary terms in the model can cause the uncertainties estimated from the data to be larger than necessary, potentially impacting scientific or engineering conclusions to be drawn from the analysis of the data.  
Empirical and Local Models Most Prone to Overfitting the Data  Overfitting is especially likely to occur when developing purely empirical models for processes when there is no external understanding of how much of the total variation in the data might be systematic and how much is random. It also happens more frequently when using regression methods that fit the data locally instead of using an explicitly specified function to describe the structure in the data. Explicit functions are usually relatively simple and have few terms. It is usually difficult to know how to specify an explicit function that fits the noise in the data, since noise will not typically display much structure. This is why overfitting is not usually a problem with these types of models. Local models, on the other hand, can easily be made to fit very complex patterns, allowing them to find apparent structure in process noise if care is not exercised.  
Statistical Tests for Overfitting  Just as statistical tests can be used to check for significant missing or misspecified terms in the functional part of a model, they can also be used to determine if any unnecessary terms have been included. In fact, checking for overfitting of the data is one area in which statistical tests are more effective than residual plots. To test for overfitting, however, individual tests of the importance of each parameter in the model are used rather than following using a single test as done when testing for terms that are missing or misspecified in the model.  
Tests of Individual Parameters  Most output from regression software also includes individual statistical tests that compare the hypothesis that each parameter is equal to zero with the alternative that it is not zero. These tests are convenient because they are automatically included in most computer output, do not require replicate measurements, and give specific information about each parameter in the model. However, if the different predictor variables included in the model have values that are correlated, these tests can also be quite difficult to interpret. This is because these tests are actually testing whether or not each parameter is zero given that all of the other predictors are included in the model.  
Test Statistics Based on Student's t Distribution  The test statistics for testing whether or not each parameter is zero are typically based on Student's t distribution. Each parameter estimate in the model is measured in terms of how many standard deviations it is from its hypothesized value of zero. If the parameter's estimated value is close enough to the hypothesized value that any deviation can be attributed to random error, the hypothesis that the parameter's true value is zero is not rejected. If, on the other hand, the parameter's estimated value is so far away from the hypothesized value that the deviation cannot be plausibly explained by random error, the hypothesis that the true value of the parameter is zero is rejected.  
Because the hypothesized value of each parameter is zero, the test statistic for each of these tests is simply the estimated parameter value divided by its estimated standard deviation, $$ T = \frac{(\hat{\beta}_i0)}{\hat{\sigma}_{\hat{\beta}_i}} = \frac{\hat{\beta}_i}{\hat{\sigma}_{\hat{\beta}_i}} $$ which provides a measure of the distance between the estimated and hypothesized values of the parameter in standard deviations. Based on the assumptions that the random errors are normally distributed and the true value of the parameter is zero (as we have hypothesized), the test statistic has a Student's t distribution with \(np\) degrees of freedom. Therefore, cutoff values for the t distribution can be used to determine how extreme the test statistic must be in order for each parameter estimate to be too far away from its hypothesized value for the deviation to be attributed to random error. Because these tests are generally used to simultaneously test whether or not a parameter value is greater than or less than zero, the tests should each be used with cutoff values with a significance level of \(\alpha/2\). This will guarantee that the hypothesis that each parameter equals zero will be rejected by chance with probability \(\alpha\). Because of the symmetry of the t distribution, only one cutoff value, the upper or the lower one, needs to be determined, and the other will be it's negative. Equivalently, many people simply compare the absolute value of the test statistic to the upper cutoff value.  
Parameter Tests for the Pressure / Temperature Example  To illustrate the use of significance tests for each parameter in a model, the regression results for the Pressure/Temperature example are shown below. In this case a straightline model was fit to the data, so the output includes tests of the significance of the intercept and slope. The estimates of the intercept and the slope are 7.74899 and 3.93014, respectively. Their estimated standard deviations are listed in the next column followed by the test statistics to determine whether or not each parameter is zero. The estimated residual standard deviation, \(\hat{\sigma}\), and its degrees of freedom are also listed.  
Parameter Estimate Std. Dev. t Value B0 7.74899 2.3538 3.292 B1 3.93014 0.0507 77.515 Residual standard deviation = 4.299098 Residual degrees of freedom = 38 

Looking up the cutoff value from the tables of the t distribution using a significance level of \(\alpha = 0.05\) and 38 degrees of freedom yields a critical value of 2.024 (the critical value is obtained from the column labeled "0.975" since this is a twosided test and 10.05/2 = 0.975). Since both of the test statistics are larger in absolute value than the critical value of 2.024, the appropriate conclusion is that both the slope and intercept are significantly different from zero at the 95 % significance level. 