|
1.
Exploratory Data Analysis
1.3. EDA Techniques 1.3.5. Quantitative Techniques
|
|||
|
Purpose: Detect significant factors |
The analysis of variance (ANOVA)
(Neter, Wasserman,
and Kunter, 1990) is used to detect significant
factors in a multi-factor model. In the multi-factor model,
there is a response (dependent) variable and one or more
factor (independent) variables. This is a common model
in designed experiments
where the experimenter sets the values for each of the factor
variables and then measures the response variable.
Each factor can take on a certain number of values. These are referred to as the levels of a factor. The number of levels can vary betweeen factors. For designed experiments, the number of levels for a given factor tends to be small. Each factor and level combination is a cell. Balanced designs are those in which the cells have an equal number of observations and unbalanced designs are those in which the number of observations varies among cells. It is customary to use balanced designs in designed experiments. |
||
| Definition |
The Product and Process
Comparisons chapter (chapter 7) contains a more extensive
discussion of 2-factor
ANOVA, including the details for the mathematical
computations.
The model for the analysis of variance can be stated in two mathematically equivalent ways. We explain the model for a two-way ANOVA (the concepts are the same for additional factors). In the following discussion, each combination of factors and levels is called a cell. In the following, the subscript i refers to the level of factor 1, j refers to the level of factor 2, and the subscript k refers to the kth observation within the (i,j)th cell. For example, Y235 refers to the fifth observation in the second level of factor 1 and the third level of factor 2. The first model is
The second model is
and represent the effects of the
ith level of the first factor and the jth level
of the second factor, respectively), and an error
term. The analysis of variance provides estimates of the grand
mean and the factor effects. The predicted values and the
residuals of the model are
The distinction between these models is that the second model divides the cell mean into an overall mean and factor effects. This second model makes the factor effect more explicit, so we will emphasize this approach. |
||
| Model Validation | Note that the ANOVA model assumes that the error term, Eijk, should follow the assumptions for a univariate measurement process. That is, after performing an analysis of variance, the model should be validated by analyzing the residuals. | ||
|
Sample Output |
Dataplot generated the following ANOVA output for the
JAHANMI2.DAT data set:
**********************************
**********************************
** 4-WAY ANALYSIS OF VARIANCE **
**********************************
**********************************
NUMBER OF OBSERVATIONS = 480
NUMBER OF FACTORS = 4
NUMBER OF LEVELS FOR FACTOR 1 = 2
NUMBER OF LEVELS FOR FACTOR 2 = 2
NUMBER OF LEVELS FOR FACTOR 3 = 2
NUMBER OF LEVELS FOR FACTOR 4 = 2
BALANCED CASE
RESIDUAL STANDARD DEVIATION = 0.63057727814E+02
RESIDUAL DEGREES OF FREEDOM = 475
REPLICATION CASE
REPLICATION STANDARD DEVIATION = 0.61890106201E+02
REPLICATION DEGREES OF FREEDOM = 464
NUMBER OF DISTINCT CELLS = 16
*****************
* ANOVA TABLE *
*****************
SOURCE DF SUM OF SQUARES MEAN SQUARE F STATISTIC F CDF SIG
-------------------------------------------------------------------------------
TOTAL (CORRECTED) 479 2668446.000000 5570.868652
-------------------------------------------------------------------------------
FACTOR 1 1 26672.726562 26672.726562 6.7080 99.011% **
FACTOR 2 1 11524.053711 11524.053711 2.8982 91.067%
FACTOR 3 1 14380.633789 14380.633789 3.6166 94.219%
FACTOR 4 1 727143.125000 727143.125000 182.8703 100.000% **
-------------------------------------------------------------------------------
RESIDUAL 475 1888731.500000 3976.276855
RESIDUAL STANDARD DEVIATION = 63.05772781
RESIDUAL DEGREES OF FREEDOM = 475
REPLICATION STANDARD DEVIATION = 61.89010620
REPLICATION DEGREES OF FREEDOM = 464
LACK OF FIT F RATIO = 2.6447 = THE 99.7269% POINT OF THE
F DISTRIBUTION WITH 11 AND 464 DEGREES OF FREEDOM
****************
* ESTIMATION *
****************
GRAND MEAN = 0.65007739258E+03
GRAND STANDARD DEVIATION = 0.74638252258E+02
LEVEL-ID NI MEAN EFFECT SD(EFFECT)
--------------------------------------------------------------------
FACTOR 1-- -1.00000 240. 657.53168 7.45428 2.87818
-- 1.00000 240. 642.62286 -7.45453 2.87818
FACTOR 2-- -1.00000 240. 645.17755 -4.89984 2.87818
-- 1.00000 240. 654.97723 4.89984 2.87818
FACTOR 3-- -1.00000 240. 655.55084 5.47345 2.87818
-- 1.00000 240. 644.60376 -5.47363 2.87818
FACTOR 4-- 1.00000 240. 688.99890 38.92151 2.87818
-- 2.00000 240. 611.15594 -38.92145 2.87818
MODEL RESIDUAL STANDARD DEVIATION
-------------------------------------------------------
CONSTANT ONLY-- 74.6382522583
CONSTANT & FACTOR 1 ONLY-- 74.3419036865
CONSTANT & FACTOR 2 ONLY-- 74.5548019409
CONSTANT & FACTOR 3 ONLY-- 74.5147094727
CONSTANT & FACTOR 4 ONLY-- 63.7284545898
CONSTANT & ALL 4 FACTORS -- 63.0577278137
|
| Interpretation of Sample Output |
The output is divided into three sections.
In addition to the quantitative ANOVA output, it is recommended that any analysis of variance be complemented with model validation. At a minimum, this should include
|
| Questions |
The analysis of variance can be used to answer the following
questions:
|
| Related Techniques |
One-factor analysis of variance Two-sample t-test Box plot Block plot Dex mean plot |
| Case Study | The quantitative ANOVA approach can be contrasted with the more graphical EDA approach in the ceramic strength case study. |
| Software | Most general purpose statistical software programs, including Dataplot, can perform multi-factor analysis of variance. |