4.5.1.1. How do I estimate the average response for a particular set of predictor variable values?

4. Process Modeling
4.5. Use and Interpretation of Process Models
4.5.1. What types of predictions can I make using the model?

4.5.1.1. How do I estimate the average response for a particular set of predictor variable values?

Step 1: Plug Predictors Into Estimated Function

Once a model that gives a good description of the process has been developed, it can be used for estimation or prediction. To estimate the average response of the process, or, equivalently, the value of the regression function, for any particular combination of predictor variable values, the values of the predictor variables are simply substituted in the estimated regression function itself. These estimated function values are often called "predicted values" or "fitted values".

Pressure / Temperature Example

For example, in the Pressure/Temperature process, which is well described by a straight-line model relating pressure ($y$) to temperature ($x$), the estimated regression function is found to be $$ \hat{y} = 7.74899 + 3.93014 \cdot x $$ by substituting the estimated parameter values into the functional part of the model. Then to estimate the average pressure at a temperature of 65, the predictor value of interest is subsituted in the estimated regression function, yielding an estimated pressure of 263.21. $$ \begin{array}{ccl} \hat{y} & = & 7.74899 + 3.93014 \cdot 65 & & \\ & = & 263.21 \end{array} $$ This estimation process works analogously for nonlinear models, LOESS models, and all other types of functional process models.

Polymer Relaxation Example

Based on the output from fitting the stretched exponential model in time ($x_1$) and temperature ($x_2$), the estimated regression function for the polymer relaxation data is $$ \hat{y} = 4.99721 + 3.01998\exp\left[-\left(\frac{x_1}{3.06885+0.04187x_2+0.01441x_2^2}\right)^{1.16612}\right] $$ Therefore, the estimated torque ($y$) on a polymer sample after 60 minutes at a temperature of 40 is 5.26.

Uncertainty Needed

Knowing that the estimated average pressure is 263.21 at a temperature of 65, or that the estimated average torque on a polymer sample under particular conditions is 5.26, however, is not enough information to make scientific or engineering decisions about the process. This is because the pressure value of 263.21 is only an estimate of the average pressure at a temperature of 65. Because of the random error in the data, there is also random error in the estimated regression parameters, and in the values predicted using the model. To use the model correctly, therefore, the uncertainty in the prediction must also be quantified. For example, if the safe operational pressure of a particular type of gas tank that will be used at a temperature of 65 is 300, different engineering conclusions would be drawn from knowing the average actual pressure in the tank is likely to lie somewhere in the range $263 \pm 52$ versus lying in the range $263.21 \pm 0.52$.

Confidence Intervals

In order to provide the necessary information with which to make engineering or scientific decisions, predictions from process models are usually given as intervals of plausible values that have a probabilistic interpretation. In particular, intervals that specify a range of values that will contain the value of the regression function with a pre-specified probability are often used. These intervals are called confidence intervals. The probability with which the interval will capture the true value of the regression function is called the confidence level, and is most often set by the user to be 0.95, or 95 % in percentage terms. Any value between 0 % and 100 % could be specified, though it would almost never make sense to consider values outside a range of about 80 % to 99 %. The higher the confidence level is set, the more likely the true value of the regression function is to be contained in the interval. The trade-off for high confidence, however, is wide intervals. As the sample size is increased, however, the average width of the intervals typically decreases for any fixed confidence level. The confidence level of an interval is usually denoted symbolically using the notation $1-\alpha$, with $\alpha$ denoting a user-specified probability, called the significance level, that the interval will not capture the true value of the regression function. The significance level is most often set to be 5 % so that the associated confidence level will be 95 %.

Computing Confidence Intervals

Confidence intervals are computed using the estimated standard deviations of the estimated regression function values and a coverage factor that controls the confidence level of the interval and accounts for the variation in the estimate of the residual standard deviation.

The standard deviations of the predicted values of the estimated regression function depend on the standard deviation of the random errors in the data, the experimental design used to collect the data and fit the model, and the values of the predictor variables used to obtain the predicted values. These standard deviations are not simple quantities that can be read off of the output summarizing the fit of the model, but they can often be obtained from the software used to fit the model. This is the best option, if available, because there are a variety of numerical issues that can arise when the standard deviations are calculated directly using typical theoretical formulas. Carefully written software should minimize the numerical problems encountered. If necessary, however, matrix formulas that can be used to directly compute these values are given in texts such as Neter, Wasserman, and Kutner.

The coverage factor used to control the confidence level of the intervals depends on the distributional assumption about the errors and the amount of information available to estimate the residual standard deviation of the fit. For procedures that depend on the assumption that the random errors have a normal distribution, the coverage factor is typically a cut-off value from the Student's t distribution at the user's pre-specified confidence level and with the same number of degrees of freedom as used to estimate the residual standard deviation in the fit of the model. Tables of the t distribution (or functions in software) may be indexed by the confidence level ($1-\alpha$) or the significance level $\alpha$. It is also important to note that since these are two-sided intervals, half of the probability denoted by the significance level is usually assigned to each side of the interval, so the proper entry in a t table or in a software function may also be labeled with the value of $\alpha/2$, or $1-\alpha/2$, if the table or software is not exclusively designed for use with two-sided tests.

The estimated values of the regression function, their standard deviations, and the coverage factor are combined using the formula $$ \hat{y} \pm t_{1-\alpha/2,\nu} \cdot \hat{\sigma}_f $$ with $\hat{y}$ denoting the estimated value of the regression function, $t_{1-\alpha/2,\nu}$ is the coverage factor, indexed by a function of the significance level and by its degrees of freedom, and $\hat{\sigma}_f$ is the standard deviation of $\hat{y}$. Some software may provide the total uncertainty for the confidence interval given by the equation above, or may provide the lower and upper confidence bounds by adding and subtracting the total uncertainty from the estimate of the average response. This can save some computational effort when making predictions, if available. Since there are many types of predictions that might be offered in a software package, however, it is a good idea to test the software on an example for which confidence limits are already available to make sure that the software is computing the expected type of intervals.

Confidence Intervals for the Example Applications

Computing confidence intervals for the average pressure in the Pressure/Temperature example, for temperatures of 25, 45, and 65, and for the average torque on specimens from the polymer relaxation example at different times and temperatures gives the results listed in the tables below. Note: the number of significant digits shown in the tables below is larger than would normally be reported. However, as many significant digits as possible should be carried throughout all calculations and results should only be rounded for final reporting. If reported numbers may be used in further calculations, they should not be rounded even when finally reported. A useful rule for rounding final results that will not be used for further computation is to round all of the reported values to one or two significant digits in the total uncertainty, $t_{1-\alpha/2,\nu} \, \hat{\sigma}_p$. This is the convention for rounding that has been used in the tables below.

Pressure / Temperature Example

$x$	$\hat{y}$	$\hat{\sigma}_f$	$t_{1-\alpha/2,\nu}$	$t_{1-\alpha/2,\nu} \, \hat{\sigma}_f$	Lower 95% Confidence Bound	Upper 95% Confidence Bound

25	106.0025	1.1976162	2.024394	2.424447	103.6	108.4
45	184.6053	0.6803245	2.024394	1.377245	183.2	186.0
65	263.2081	1.2441620	2.024394	2.518674	260.7	265.7

Polymer Relaxation Example

$x_1$	$x_2$	$\hat{y}$	$\hat{\sigma}_f$	$t_{1-\alpha/2,\nu}$	$t_{1-\alpha/2,\nu} \, \hat{\sigma}_f$	Lower 95% Confidence Bound	Upper 95% Confidence Bound

20	25	5.586307	0.028402	2.000298	0.056812	5.529	5.643
80	25	4.998012	0.012171	2.000298	0.024346	4.974	5.022
20	50	6.960607	0.013711	2.000298	0.027427	6.933	6.988
80	50	5.342600	0.010077	2.000298	0.020158	5.322	5.363
20	75	7.521252	0.012054	2.000298	0.024112	7.497	7.545
80	75	6.220895	0.013307	2.000298	0.026618	6.194	6.248

Interpretation of Confidence Intervals

As mentioned above, confidence intervals capture the true value of the regression function with a user-specified probability, the confidence level, using the estimated regression function and the associated estimate of the error. Simulation of many sets of data from a process model provides a good way to obtain a detailed understanding of the probabilistic nature of these intervals. The advantage of using simulation is that the true model parameters are known, which is never the case for a real process. This allows direct comparison of how confidence intervals constructed from a limited amount of data relate to the true values that are being estimated.

The plot below shows 95 % confidence intervals computed using 50 independently generated data sets that follow the same model as the data in the Pressure/Temperature example. Random errors from a normal distribution with a mean of zero and a known standard deviation are added to each set of true temperatures and true pressures that lie on a perfect straight line to obtain the simulated data. Then each data set is used to compute a confidence interval for the average pressure at a temperature of 65. The dashed reference line marks the true value of the average pressure at a temperature of 65.

Confidence Intervals Computed from 50 Sets of Simulated Data

Confidence Level Specifies Long-Run Interval Coverage

From the plot it is easy to see that not all of the intervals contain the true value of the average pressure. Data sets 16, 26, and 39 all produced intervals that did not cover the true value of the average pressure at a temperature of 65. Sometimes the interval may fail to cover the true value because the estimated pressure is unusually high or low because of the random errors in the data set. In other cases, the variability in the data may be underestimated, leading to an interval that is too short to cover the true value. However, for 47 out of 50, or approximately 95 % of the data sets, the confidence intervals did cover the true average pressure. When the number of data sets was increased to 5000, confidence intervals computed for 4723, or 94.46 %, of the data sets covered the true average pressure. Finally, when the number of data sets was increased to 10000, 95.12 % of the confidence intervals computed covered the true average pressure. Thus, the simulation shows that although any particular confidence interval might not cover its associated true value, in repeated experiments this method of constructing intervals produces intervals that cover the true value at the rate specified by the user as the confidence level. Unfortunately, when dealing with real processes with unknown parameters, it is impossible to know whether or not a particular confidence interval does contain the true value. It is nice to know that the error rate can be controlled, however, and can be set so that it is far more likely than not that each interval produced does contain the true value.

Interpretation Summary

To summarize the interpretation of the probabilistic nature of confidence intervals in words: in independent, repeated experiments, $100(1-\alpha) \, \%$ of the intervals will cover the true values, given that the assumptions needed for the construction of the intervals hold.