4.
Process Modeling
4.4. Data Analysis for Process Modeling 4.4.5. If my current model does not fit the data well, how can I improve it?


Basic Approach: Transformation  Unlike when correcting for nonconstant variation in the random errors, there is really only one basic approach to handling data with nonnormal random errors for most regression methods. This is because most methods rely on the assumption of normality and the use of linear estimation methods (like least squares) to make probabilistic inferences to answer scientific or engineering questions. For methods that rely on normality of the data, direct manipulation of the data to make the random errors approximately normal is usually the best way to try to bring the data in line with this assumption. The main alternative to transformation is to use a fitting criterion that directly takes the distribution of the random errors into account when estimating the unknown parameters. Using these types of fitting criteria, such as maximum likelihood, can provide very good results. However, they are often much harder to use than the general fitting criteria used in most process modeling methods.  
Using Transformations 
The basic steps for using transformations to handle data with nonnormally distributed random
errors are essentially the same as those used to handle nonconstant variation of the random
errors.


Typical Transformations for Meeting Distributional Assumptions 
Not surprisingly, three transformations that are often effective for making the distribution
of the random errors approximately normal are:


Example  To illustrate how to use transformations to change the distribution of the random errors, we will look at a modified version of the Pressure/Temperature example in which the errors are uniformly distributed. Comparing the results obtained from fitting the data in their original units and under different transformations will directly illustrate the effects of the transformations on the distribution of the random errors.  
Modified Pressure/Temperature Data with Uniform Random Errors  
Fit of Model to the Untransformed Data  A fourplot of the residuals obtained after fitting a straightline model to the Pressure/Temperature data with uniformly distributed random errors is shown below. The histogram and normal probability plot on the bottom row of the fourplot are the most useful plots for assessing the distribution of the residuals. In this case the histogram suggests that the distribution is more rectangular than bellshaped, indicating the random errors a not likely to be normally distributed. The curvature in the normal probability plot also suggests that the random errors are not normally distributed. If the random errors were normally distributed the normal probability plots should be a fairly straight line. Of course it wouldn't be perfectly straight, but smooth curvature or several points lying far from the line are fairly strong indicators of nonnormality.  
Residuals from StraightLine Model of Untransformed Data with Uniform Random Errors  
Selection of Appropriate Transformations  Going through a set of steps similar to those used to find transformations to stabilize the random variation, different pairs of transformations of the response and predictor which have a simple functional form and will potentially have more normally distributed residuals are chosen. In the multiplots below, all of the possible combinations of basic transformations are applied to the temperature and pressure to find the pairs which have simple functional forms. In this case, which is typical, the the data with square rootsquare root, lnln, and inverseinverse tranformations all appear to follow a straightline model. The next step will be to fit lines to each of these sets of data and then to compare the residual plots to see whether any have random errors which appear to be normally distributed.  
sqrt(Pressure) vs Different Tranformations of Temperature  
log(Pressure) vs Different Tranformations of Temperature  
1/Pressure vs Different Tranformations of Temperature  
Fit of Model to Transformed Variables  The normal probability plots and histograms below show the results of fitting straightline models to the three sets of transformed data. The results from the fit of the model to the data in its original units are also shown for comparison. From the four normal probability plots it looks like the model fit using the lnln transformations produces the most normally distributed random errors. Because the normal probability plot for the lnln data is so straight, it seems safe to conclude that taking the ln of the pressure makes the distribution of the random errors approximately normal. The histograms seem to confirm this since the histogram of the lnln data looks reasonably bellshaped while the other histograms are not particularly bellshaped. Therefore, assuming the other residual plots also indicated that a straight line model fit this transformed data, the use of lnln tranformations appears to be appropriate for analysis of this data.  
Residuals from the Fit to the Transformed Variables  
Residuals from the Fit to the Transformed Variables 