4.
Process Modeling
4.5.
Use and Interpretation of Process Models
4.5.1.
What types of predictions can I make using the model?
4.5.1.1.

How do I estimate the average response for a particular set of predictor variable values?


Step 1: Plug Predictors Into Estimated Function

Once a model that gives a good description of the process has been developed, it can be used for
estimation or prediction. To estimate the average response of the process,
or, equivalently, the value of the regression function, for any particular combination of
predictor variable values, the values of the predictor variables are simply substituted in the
estimated regression function itself. These estimated function values are often called "predicted
values" or "fitted values".

Pressure / Temperature Example

For example, in the Pressure/Temperature process, which
is well described by a straightline model relating pressure (\(y\))
to temperature (\(x\)),
the estimated regression function is found to be
$$ \hat{y} = 7.74899 + 3.93014 \cdot x $$
by substituting the estimated parameter values into
the functional part of the model. Then to estimate the average pressure at a temperature of 65,
the predictor value of interest is subsituted in the estimated regression function, yielding an
estimated pressure of 263.21.
$$ \begin{array}{ccl}
\hat{y} & = & 7.74899 + 3.93014 \cdot 65
& & \\
& = & 263.21
\end{array} $$
This estimation process works analogously for nonlinear models, LOESS models, and all other
types of functional process models.

Polymer Relaxation Example

Based on the output from fitting the stretched exponential model in time (\(x_1\))
and temperature (\(x_2\)),
the estimated regression function for the
polymer relaxation data is
$$ \hat{y} = 4.99721 + 3.01998\exp\left[\left(\frac{x_1}{3.06885+0.04187x_2+0.01441x_2^2}\right)^{1.16612}\right] $$
Therefore, the estimated torque (\(y\))
on a polymer sample after 60 minutes at a temperature of 40 is 5.26.

Uncertainty Needed

Knowing that the estimated average pressure is 263.21 at a temperature of 65, or that the estimated
average torque on a polymer sample under particular conditions is 5.26, however, is not enough
information to make scientific or engineering decisions about the process. This is because the
pressure value of 263.21 is only an estimate of the average pressure at a temperature of 65.
Because of the random error in the data, there is also random error in the estimated regression
parameters, and in the values predicted using the model. To use the model correctly, therefore,
the uncertainty in the prediction must also be quantified. For example, if the safe operational
pressure of a particular type of gas tank that will be used at a temperature of 65 is 300,
different engineering conclusions would be drawn from knowing the average actual pressure in the
tank is likely to lie somewhere in the range \(263 \pm 52\)
versus lying in the range \(263.21 \pm 0.52\).

Confidence Intervals

In order to provide the necessary information with which to make engineering or scientific
decisions, predictions from process models are usually given as intervals of plausible values that
have a probabilistic interpretation. In particular, intervals that specify a range of values that
will contain the value of the regression function with a prespecified probability are often used.
These intervals are called confidence intervals. The probability with which the interval will
capture the true value of the regression function is called the confidence level, and is most
often set by the user to be 0.95, or 95 % in percentage terms. Any value between 0 % and 100 % could
be specified, though it would almost never make sense to consider values outside a range of about
80 % to 99 %. The higher the confidence level is set, the more likely the true value of the
regression function is to be contained in the interval. The tradeoff for high confidence, however,
is wide intervals. As the sample size is increased, however, the average width of the intervals
typically decreases for any fixed confidence level. The confidence level of an interval is usually
denoted symbolically using the notation \(1\alpha\),
with \(\alpha\)
denoting a userspecified probability, called the significance
level, that the interval will not capture the true value of the regression function. The
significance level is most often set to be 5 % so that the associated confidence level will be 95 %.

Computing Confidence Intervals

Confidence intervals are computed using the estimated standard deviations of the estimated regression function values and
a coverage factor that controls the confidence level of the interval and accounts for the variation in the estimate of the
residual standard deviation.


The standard deviations of the predicted values of the estimated regression function depend
on the standard deviation of the random errors in the data, the experimental design
used to collect the data and fit the model, and the values of the predictor variables used to
obtain the predicted values. These standard deviations are not simple quantities that
can be read off of the output summarizing the fit of the model, but they can often be obtained from
the software used to fit the model. This is the best option, if available, because there are a
variety of numerical issues that can arise when the standard deviations are calculated directly
using typical theoretical formulas. Carefully written software should minimize the numerical
problems encountered. If necessary, however, matrix formulas that can be used to directly compute
these values are given in texts such as Neter, Wasserman, and Kutner.


The coverage factor used to control the confidence level of the intervals depends on the
distributional assumption about the errors and the amount of information available to estimate
the residual standard deviation of the fit. For procedures that depend on the assumption that
the random errors have a normal distribution, the coverage factor is typically a cutoff value
from the Student's t distribution at the
user's prespecified
confidence level and with the same number of degrees of freedom as used to estimate the residual
standard deviation in the fit of the model. Tables of the t distribution (or functions in
software) may be indexed by the confidence level (\(1\alpha\))
or the significance level \(\alpha\).
It is also important to note that since these
are twosided intervals, half of the probability denoted by the significance level is usually
assigned to each side of the interval, so the proper entry in a t table or in a software
function may also be labeled with the value of \(\alpha/2\),
or \(1\alpha/2\),
if the table or software is not exclusively designed for use with twosided tests.


The estimated values of the regression function, their standard deviations, and the coverage
factor are combined using the formula
$$ \hat{y} \pm t_{1\alpha/2,\nu} \cdot \hat{\sigma}_f $$
with \(\hat{y}\)
denoting the estimated value of the regression function, \(t_{1\alpha/2,\nu}\)
is the coverage factor, indexed by a function of the
significance level and by its degrees of freedom, and \(\hat{\sigma}_f\)
is the standard deviation of \(\hat{y}\).
Some software may provide the
total uncertainty for the confidence interval given by the equation above, or may provide the
lower and upper confidence bounds by adding and subtracting the total uncertainty from the
estimate of the average response. This can save some computational effort when making predictions, if
available. Since there are many types of predictions that might be offered in a software package,
however, it is a good idea to test the software on an example for which confidence limits are
already available to make sure that the software is computing the expected type of intervals.

Confidence Intervals for the Example Applications

Computing confidence intervals for the average pressure in the
Pressure/Temperature example, for temperatures of 25,
45, and 65, and for the average torque on specimens from the
polymer relaxation example at different times and
temperatures gives the results listed in the tables below. Note: the number of significant
digits shown in the tables below is larger than would normally be reported. However, as many
significant digits as possible should be carried throughout all calculations and results should
only be rounded for final reporting. If reported numbers may be used in further calculations,
they should not be rounded even when finally reported. A useful rule for rounding final results
that will not be used for further computation is to round all of the reported values to one or
two significant digits in the total uncertainty, \(t_{1\alpha/2,\nu} \, \hat{\sigma}_p\).
This is the convention for rounding that has been used in the tables below.

Pressure / Temperature Example

\(x\)

\(\hat{y}\)

\(\hat{\sigma}_f\)

\(t_{1\alpha/2,\nu}\)

\(t_{1\alpha/2,\nu} \, \hat{\sigma}_f\)

Lower 95% Confidence Bound

Upper 95% Confidence Bound


25 
106.0025 
1.1976162 
2.024394 
2.424447 
103.6 
108.4 
45 
184.6053 
0.6803245 
2.024394 
1.377245 
183.2 
186.0 
65 
263.2081 
1.2441620 
2.024394 
2.518674 
260.7 
265.7 

Polymer Relaxation Example

\(x_1\)

\(x_2\)

\(\hat{y}\)

\(\hat{\sigma}_f\)

\(t_{1\alpha/2,\nu}\)

\(t_{1\alpha/2,\nu} \, \hat{\sigma}_f\)

Lower 95% Confidence Bound

Upper 95% Confidence Bound


20 
25 
5.586307 
0.028402 
2.000298 
0.056812 
5.529 
5.643 
80 
25 
4.998012 
0.012171 
2.000298 
0.024346 
4.974 
5.022 
20 
50 
6.960607 
0.013711 
2.000298 
0.027427 
6.933 
6.988 
80 
50 
5.342600 
0.010077 
2.000298 
0.020158 
5.322 
5.363 
20 
75 
7.521252 
0.012054 
2.000298 
0.024112 
7.497 
7.545 
80 
75 
6.220895 
0.013307 
2.000298 
0.026618 
6.194 
6.248 

Interpretation of Confidence Intervals

As mentioned above, confidence intervals capture the true value of the regression function with
a userspecified probability, the confidence level, using the estimated regression function and
the associated estimate of the error. Simulation of many sets of data from a process model
provides a good way to obtain a detailed understanding of the probabilistic nature of these
intervals. The advantage of using simulation is that the true model parameters are known, which
is never the case for a real process. This allows direct comparison of how confidence intervals
constructed from a limited amount of data relate to the true values that are being estimated.


The plot below shows 95 % confidence intervals computed using 50 independently generated data sets
that follow the same model as the data in the Pressure/Temperature example. Random errors from
a normal distribution with a mean of zero and a known standard deviation are added to each set of
true temperatures and true pressures that lie on a perfect straight line to obtain the simulated
data. Then each data set is used to compute a confidence interval for the average pressure at a
temperature of 65. The dashed reference line marks the true value of the average pressure at a
temperature of 65.

Confidence Intervals Computed from 50 Sets of Simulated Data


Confidence Level Specifies LongRun Interval Coverage

From the plot it is easy to see that not all of the intervals contain the true value of the
average pressure. Data sets 16, 26, and 39 all produced intervals that did not cover the true
value of the average pressure at a temperature of 65. Sometimes the interval may fail to cover
the true value because the estimated pressure is unusually high or low because of the random
errors in the data set. In other cases, the variability in the data may be underestimated,
leading to an interval that is too short to cover the true value. However, for 47 out of 50,
or approximately 95 % of the data sets, the confidence intervals did cover the true average
pressure. When the number of data sets was increased to 5000, confidence intervals computed for
4723, or 94.46 %, of the data sets covered the true average pressure. Finally, when the number
of data sets was increased to 10000, 95.12 % of the confidence intervals computed covered the
true average pressure. Thus, the simulation shows that although any particular confidence interval
might not cover its associated true value, in repeated experiments this method of constructing
intervals produces intervals that cover the true value at the rate specified by the user as the
confidence level. Unfortunately, when dealing with real processes with unknown parameters,
it is impossible to know whether or not a particular confidence interval does contain the true
value. It is nice to know that the error rate can be controlled, however, and can be set so that
it is far more likely than not that each interval produced does contain the true value.

Interpretation Summary

To summarize the interpretation of the probabilistic nature of confidence intervals in words:
in independent, repeated experiments, \(100(1\alpha) \, \%\)
of the intervals
will cover the true values, given that the assumptions needed for the construction of the
intervals hold.
