4.
Process Modeling
4.3. Data Collection for Process Modeling


Experimental Design Principles Applied to Process Modeling 
There are six principles of experimental design as applied to
process modeling:


Capacity for Primary Model  For your bestguess model, make sure that the design has the capacity for estimating the coefficients of that model. For a simple example of this, if you are fitting a quadratic model, then make sure you have at least three distinct horixontal axis points.  
Capacity for Alternative Model  If your bestguess model happens to be inadequate, make sure that the design has the capacity to estimate the coefficients of your bestguess backup alternative model (which means implicitly that you should have already identified such a model). For a simple example, if you suspect (but are not positive) that a linear model is appropriate, then it is best to employ a globally robust design (say, four points at each extreme and three points in the middle, for a tenpoint design) as opposed to the locally optimal design (such as five points at each extreme). The locally optimal design will provide a best fit to the line, but have no capacity to fit a quadratic. The globally robust design will provide a good (though not optimal) fit to the line and additionally provide a good (though not optimal) fit to the quadratic.  
Minimum Variance of Coefficient Estimators  For a given model, make sure the design has the property of minimizing the variation of the least squares estimated coefficients. This is a general principle that is always in effect but which in practice is hard to implement for many models beyond the simpler 1factor $$ y = f(x;\vec{\beta} + \varepsilon) $$ models. For more complicated 1factor models, and for most multifactor $$ y = f(\vec{x};\vec{\beta} + \varepsilon) $$ models, the expressions for the variance of the least squares estimators, although available, are complicated and assume more than the analyst typically knows. The net result is that this principle, though important, is harder to apply beyond the simple cases.  
Sample Where the Variation Is (Non Constant Variance Case)  Regardless of the simplicity or complexity of the model, there are situations in which certain regions of the curve are noisier than others. A simple case is when there is a linear relationship between \(x\) and \(y\) but the recording device is proportional rather than absolute and so larger values of \(y\) are intrinsically noisier than smaller values of \(y\). In such cases, sampling where the variation is means to have more replicated points in those regions that are noisier. The practical answer to how many such replicated points there should be is $$ n_{i} = \frac{1} {\sigma_{i}^{2}} $$ with \(\sigma_{i}\) denoting the theoretical standard deviation for that given region of the curve. Usually \(\sigma_{i}\) is estimated by apriori guesses for what the local standard deviations are.  
Sample Where the Variation Is (Steep Curve Case) 
A common occurence for nonlinear models is for some regions of the
curve to be steeper than others. For example, in fitting an
exponential model (small \(x\)
corresponding to large \(y\),
and large \(y\)
corresponding to small \(x\))
it is often the case that the \(y\)
data in the steep region are intrinsically noisier than the \(y\)
data in the relatively flat regions. The reason for
this is that commonly the \(x\)
values themselves have a bit of noise and this \(x\)noise
gets translated into larger \(y\)noise
in the steep sections than in the
shallow sections. In such cases, when we know the shape of the
response curve well enough to identify steepversusshallow
regions, it is often a good idea to sample more heavily in the steep
regions than in the shallow regions. A practical ruleofthumb for where
to position the \(x\)
values in such situations is to
The above rough procedure for an exponentially decreasing curve would thus yield a logarithmic preponderance of points in the steep region of the curve and relatively few points in the flatter part of the curve. 

Replication  If affordable, replication should be part of every design. Replication allows us to compute a modelindependent estimate of the process standard deviation. Such an estimate may then be used as a criterion in an objective lackoffit test to assess whether a given model is adequate. Such an objective lackoffit Ftest can be employed only if the design has builtin replication. Some replication is essential; replication at every point is ideal.  
Randomization  Just because the \(x\)'s have some natural ordering does not mean that the data should be collected in the same order as the \(x\)'s. Some aspect of randomization should enter into every experiment, and experiments for process modeling are no exception. Thus if your are sampling ten points on a curve, the ten \(y\) values should not be collected by sequentially stepping through the \(x\) values from the smallest to the largest. If you do so, and if some extraneous drifting or wear occurs in the machine, the operator, the environment, the measuring device, etc., then that drift will unwittingly contaminate the \(y\) values and in turn contaminate the final fit. To minimize the effect of such potential drift, it is best to randomize (use random number tables) the sequence of the \(x\) values. This will not make the drift go away, but it will spread the drift effect evenly over the entire curve, realistically inflating the variation of the fitted values, and providing some mechanism after the fact (at the residual analysis model validation stage) for uncovering or discovering such a drift. If you do not randomize the run sequence, you give up your ability to detect such a drift if it occurs. 