|
4.
Process Modeling
4.3. Data Collection for Process Modeling
|
|||
| Output from Process Model is Fitted Mathematical Function |
The output from process modeling is a fitted mathematical function
with estimated coefficients. For example, in modeling resistivity,
, as a function of dopant density,
, an analyst may suggest the function
![]()
in which the coefficients to be estimated are
|
||
| What are Good Coefficient Values? |
Poor values of the coefficients are those for which the resulting
predicted values are considerably different from the observed raw
data . Good values of the coefficients are those
for which the resulting predicted values are close to the observed
raw data . The best values of the coefficients are those
for which the resulting predicted values are close to the observed raw
data , and the statistical uncertainty connected with
each coefficient is small.
|
||
There are two considerations that are useful for the generation
of "best" coefficients:
|
|||
| Least Squares Criterion |
For a given data set (e.g., 10 ( , )
pairs), the most common procedure for obtaining the
coefficients for is the
least squares estimation
criterion. This criterion yields coefficients with
predicted values that are closest to the raw data
in the sense that the sum of the squared differences between the
raw data and the predicted values is as small as possible.
The overwhelming majority of regression programs today use the least squares criterion for estimating the model coefficients. Least squares estimates are popular because
values it cannot be; but frequently the choice of the
values is under our control. If we can select the values,
the coefficients will have less variability than if the
are not controlled.
|
||
| Design of Experiment Principles |
As to what values should be used for the 's, we look
to established experimental design principles for guidance.
|
||
| Principle 1: Minimize Coefficient Estimation Variation |
The first principle of experimental design is to
control the values within the vector such that
after the data are collected, the subsequent model
coefficients are as good, in the sense of having the smallest
variation, as possible.
The key underlying point with respect to design of
experiments and process modeling is that even though (for
simple ( |
||
| Five Designs |
To see the effect of experimental design on process modeling,
consider the following simplest case of fitting a line:
![]()
Suppose the analyst can afford 10 observations (that is, 10
(
and
) perhaps it doesn't make any difference?
For each of the above five experimental designs, there will of course be
|
||
| Are the Fitted Lines Better for Some Designs? |
But are the fitted lines, i.e., the fitted process models, better
for some designs than for others? Are the coefficient estimator
variances smaller for some designs than for others? For given
estimates, are the resulting predicted values better (that is,
closer to the observed values) than for other designs? The
answer to all of the above is YES. It DOES make a difference.
The most popular answer to the above question about which
design to use for linear modeling is design #1 with ten
equi-spaced points. It can be shown, however, that the
variance of the estimated slope parameter depends on the
design according to the relationship
.Therefore to obtain minimum variance estimators, one maximizes the denominator on the right. To maximize the denominator, it is (for an arbitrarily fixed ), best to position the
's as far away from
as possible. This is done by positioning half of the
's at the lower extreme and the other half
at the upper extreme.
This is design #3 above, and this "dumbbell" design (half low and
half high) is in fact the best possible design for fitting a line.
Upon reflection, this is intuitively arrived at by the adage that
"2 points define a line", and so it makes the most sense to determine
those 2 points as far apart as possible (at the extremes) and as
well as possible (having half the data at each extreme).
Hence the design of experiment solution to model processing
when the model is a line is the "dumbbell" design--half the
X's at each extreme.
|
||
| What is the Worst Design? | What is the worst design in the above case? Of the five designs, the worst design is the one that has maximum variation. In the mathematical expression above, it is the one that minimizes the denominator, and so this is design #4 above, for which almost all of the data are located at the mid-range. Clearly the estimated line in this case is going to chase the solitary point at each end and so the resulting linear fit is intuitively inferior. | ||
| Designs 1, 2, and 5 |
What about the other 3 designs? Designs 1, 2, and 5 are
useful only for the case when we think the model may be
linear, but we are not sure, and so we allow additional
points that permit fitting a line if appropriate, but build
into the design the "capacity" to fit beyond a line (e.g.,
quadratic, cubic, etc.) if necessary. In this regard, the
ordering of the designs would be
|
||