|
1.
Exploratory Data Analysis
1.3. EDA Techniques 1.3.5. Quantitative Techniques 1.3.5.18. Yates Analysis
|
|||||||||||||||||||||
| Identify Important Factors |
The Yates analysis generates a large number of potential
models. From this list, we want to select the most appropriate model.
This requires balancing the following two goals.
Seven criteria are utilized to define important factors. These seven criteria are not all equally important, nor will they yield identical subsets, in which case a consensus subset or a weighted consensus subset must be extracted. In practice, some of these criteria may not apply in all situations. These criteria will be examined in the context of the Eddy current data set. The Yates Analysis page gave the sample Yates output for these data and the Defining Models and Predictions page listed the potential models from the Yates analysis. In practice, not all of these criteria will be used with every analysis (and some analysts may have additional criteria). These critierion are given as useful guidelines. Mosts analysts will focus on those criteria that they find most useful. |
||||||||||||||||||||
| Criteria for Including Terms in the Model |
The seven criteria that we can use in determining whether to
keep a factor in the model can be summarized as follows.
The last section summarizes the conclusions based on all of the criteria. |
||||||||||||||||||||
| Effects: Engineering Significance |
The minimum engineering significant difference is defined as
is the absolute value of the parameter estimate (i.e., the
effect) and
is the minimum engineering significant difference.
That is, declare a factor as "important" if the effect is greater
than some a priori declared engineering difference. This
implies that the engineering staff have in fact stated what a
minimum effect will be. Oftentimes this is not the case. In
the absence of an a priori difference, a good rough rule for
the minimum engineering significant
Based on this minimum engineering significant difference criterion, we conclude that we should keep two terms: X1 and X2. |
||||||||||||||||||||
| Effects: Order of Magnitude |
The order of magnitude criterion is defined as
Based on the order-of-magnitude criterion, we thus conclude that we should keep two terms: X1 and X2. A third term, X2*X3 (.29750), is just slightly under the cutoff level, so we may consider keeping it based on the other criterion. |
||||||||||||||||||||
| Effects: Statistical Significance |
Statistical significance is defined as
The "2" comes from normal theory (more specifically, a value of 1.96 yields a 95% confidence interval). More precise values would come from t-distribution theory.
The difficulty with this is that in order to invoke this criterion
we need the standard deviation,
> 0.2850.
This results in keeping three terms: X1 (3.10250), X2 (-.86750), and X1*X2 (.29750). |
||||||||||||||||||||
| Effects: Probability Plots |
Probability plots
can be used in the following manner.
Since the half-normal probability plot is only concerned with effect magnitudes as opposed to signed effects (which are subject to the vagaries of how the initial factor codings +1 and -1 were assigned), the half-normal probability plot is preferred by some over the normal probability plot. |
||||||||||||||||||||
| Normal Probablity Plot of Effects and Half-Normal Probability Plot of Effects |
The following half-normal plot shows the normal probability plot of
the effect estimates and the half-normal probability plot of the
absolute value of the estimates for the Eddy current data.
For the example at hand, both probability plots clearly show two factors displaced off the line, and from the third plot (with factor tags included), we see that those two factors are factor 1 and factor 2. All of the remaining five effects are behaving like random drawings from a normal distribution centered at zero, and so are deemed to be statistically non-significant. In conclusion, this rule keeps two factors: X1 (3.10250) and X2 (-.86750). |
||||||||||||||||||||
| Effects: Youden Plot | A Youden plot can be used in the following way. Keep a factor as "important" if it is displaced away from the central-tendancy "bunch" in a Youden plot of high and low averages. By definition, a factor is important when its average response for the low (-1) setting is significantly different from its average response for the high (+1) setting. Conversely, if the low and high averages are about the same, then what difference does it make which setting to use and so why would such a factor be considered important? This fact in combination with the intrinsic benefits of the Youden plot for comparing pairs of items leads to the technique of generating a Youden plot of the low and high averages. | ||||||||||||||||||||
| Youden Plot of Effect Estimatess |
The following is the Youden plot of the effect estimatess for the Eddy
current data.
For the example at hand, the Youden plot clearly shows a cluster of points near the grand average (2.65875) with two displaced points above (factor 1) and below (factor 2). Based on the Youden plot, we conclude to keep two factors: X1 (3.10250) and X2 (-.86750). |
||||||||||||||||||||
| Residual Standard Deviation: Engineering Significance |
This criterion is defined as
This criterion is different from the others in that it is model focused. In practice, this criterion states that starting with the largest effect, we cumulatively keep adding terms to the model and monitor how the residual standard deviation for each progressively more complicated model becomes smaller. At some point, the cumulative model will become complicated enough and comprehensive enough that the resulting residual standard deviation will drop below the pre-specified engineering cutoff for the residual standard deviation. At that point, we stop adding terms and declare all of the model-included terms to be "important" and everything not in the model to be "unimportant". This approach implies that the engineer has considered what a minimum residual standard deviation should be. In effect, this relates to what the engineer can tolerate for the magnitude of the typical residual (= difference between the raw data and the predicted value from the model). In other words, how good does the engineer want the prediction equation to be. Unfortunately, this engineering specification has not always been formulated and so this criterion can become moot. In the absence of a prior specified cutoff, a good rough rule for the minimum engineering residual standard deviation is to keep adding terms until the residual standard deviation just dips below, say, 5% of the current production average. For the Eddy current data, let's say that the average detector has a sensitivity of 2.5 ohms. Then this would suggest that we would keep adding terms to the model until the residual standard deviation falls below 5% of 2.5 ohms = 0.125 ohms. Based on the minimum residual standard deviation criteria, and by scanning the far right column of the Yates table, we would conclude to keep the following terms:
Note that we must include all terms in order to drive the residual standard deviation below 0.125. Again, the 5% rule is a rough-and-ready rule that has no basis in engineering or statistics, but is simply a "numerics". Ideally, the engineer has a better cutoff for the residual standard deviation that is based on how well he/she wants the equation to peform in practice. If such a number were available, then for this criterion and data set we would select something less than the entire collection of terms. |
||||||||||||||||||||
| Residual Standard Deviation: Statistical Significance |
This criterion is defined as
is the standard deviation of an observation under replicated
conditions.
That is, declare a term as "important" until the cumulative model that
includes the term has a residual standard deviation smaller than
In practice, this criterion may be difficult to apply because
For the current case study:
Thus for this current case, this criteria could not be used to yield a subset of "important" factors. |
||||||||||||||||||||
| Conclusions |
In summary, the seven criteria for specifying "important" factors
yielded the following for the Eddy current data:
Such conflicting results are common. Arguably, the three most important criteria (listed in order of most important) are:
Scanning all of the above, we thus declare the following consensus for the Eddy current data:
|
||||||||||||||||||||