Next Page Previous Page Home Tools & Aids Search Handbook

1. Exploratory Data Analysis
1.3. EDA Techniques
1.3.5. Quantitative Techniques

1.3.5.5.

Multi-factor Analysis of Variance

Purpose:
Detect significant factors
The analysis of variance (ANOVA) (Neter, Wasserman, and Kunter, 1990) is used to detect significant factors in a multi-factor model. In the multi-factor model, there is a response (dependent) variable and one or more factor (independent) variables. This is a common model in designed experiments where the experimenter sets the values for each of the factor variables and then measures the response variable.

Each factor can take on a certain number of values. These are referred to as the levels of a factor. The number of levels can vary betweeen factors. For designed experiments, the number of levels for a given factor tends to be small. Each factor and level combination is a cell. Balanced designs are those in which the cells have an equal number of observations and unbalanced designs are those in which the number of observations varies among cells. It is customary to use balanced designs in designed experiments.

Definition The Product and Process Comparisons chapter (chapter 7) contains a more extensive discussion of 2-factor ANOVA, including the details for the mathematical computations.

The model for the analysis of variance can be stated in two mathematically equivalent ways. We explain the model for a two-way ANOVA (the concepts are the same for additional factors). In the following discussion, each combination of factors and levels is called a cell. In the following, the subscript i refers to the level of factor 1, j refers to the level of factor 2, and the subscript k refers to the kth observation within the (i,j)th cell. For example, Y235 refers to the fifth observation in the second level of factor 1 and the third level of factor 2.

The first model is

    Y(ijk) = u(ij) + E(ijk)
This model decomposes the response into a mean for each cell and an error term. The analysis of variance provides estimates for each cell mean. These cell means are the predicted values of the model and the differences between the response variable and the estimated cell means are the residuals. That is
    YHAT(ijk) = uhat(ij)

    R(ijk) = Y(ijk) - uhat(ij)

The second model is

    Y(ijk) = u + alpha(i) + beta(j) + E(ijk)
This model decomposes the response into an overall (grand) mean, factor effects (alpha(i) and beta(j) represent the effects of the ith level of the first factor and the jth level of the second factor, respectively), and an error term. The analysis of variance provides estimates of the grand mean and the factor effects. The predicted values and the residuals of the model are
    YHAT(ijk) = uhat + alphahat(i) + betahat(j)

    R(ijk) = Y(ijk) - uhat - alphahat(i) - betahat(j)

The distinction between these models is that the second model divides the cell mean into an overall mean and factor effects. This second model makes the factor effect more explicit, so we will emphasize this approach.

Model Validation Note that the ANOVA model assumes that the error term, Eijk, should follow the assumptions for a univariate measurement process. That is, after performing an analysis of variance, the model should be validated by analyzing the residuals.
Sample Output
Dataplot generated the following ANOVA output for the JAHANMI2.DAT data set:
  
                 **********************************
                 **********************************
                 **  4-WAY ANALYSIS OF VARIANCE  **
                 **********************************
                 **********************************
  
       NUMBER OF OBSERVATIONS           =      480
       NUMBER OF FACTORS                =        4
       NUMBER OF LEVELS FOR FACTOR  1  =        2
       NUMBER OF LEVELS FOR FACTOR  2  =        2
       NUMBER OF LEVELS FOR FACTOR  3  =        2
       NUMBER OF LEVELS FOR FACTOR  4  =        2
       BALANCED CASE
       RESIDUAL    STANDARD DEVIATION   =    0.63057727814E+02
       RESIDUAL    DEGREES OF FREEDOM   =      475
       REPLICATION CASE
       REPLICATION STANDARD DEVIATION   =    0.61890106201E+02
       REPLICATION DEGREES OF FREEDOM   =      464
       NUMBER OF DISTINCT CELLS         =       16
  
                          *****************
                          *  ANOVA TABLE  *
                          *****************
  
 SOURCE              DF SUM OF SQUARES    MEAN SQUARE   F STATISTIC    F CDF SIG
 -------------------------------------------------------------------------------
 TOTAL (CORRECTED)  479 2668446.000000    5570.868652
 -------------------------------------------------------------------------------
 FACTOR  1            1   26672.726562   26672.726562        6.7080  99.011%  **
 FACTOR  2            1   11524.053711   11524.053711        2.8982  91.067%
 FACTOR  3            1   14380.633789   14380.633789        3.6166  94.219%
 FACTOR  4            1  727143.125000  727143.125000      182.8703 100.000%  **
 -------------------------------------------------------------------------------
 RESIDUAL           475 1888731.500000    3976.276855
  
       RESIDUAL    STANDARD DEVIATION =       63.05772781
       RESIDUAL    DEGREES OF FREEDOM =           475
       REPLICATION STANDARD DEVIATION =       61.89010620
       REPLICATION DEGREES OF FREEDOM =           464
       LACK OF FIT F RATIO =      2.6447 = THE   99.7269% POINT OF THE
       F DISTRIBUTION WITH     11 AND    464 DEGREES OF FREEDOM
  
                          ****************
                          *  ESTIMATION  *
                          ****************
  
       GRAND MEAN                       =    0.65007739258E+03
       GRAND STANDARD DEVIATION         =    0.74638252258E+02
  
  
              LEVEL-ID      NI      MEAN      EFFECT     SD(EFFECT)
 --------------------------------------------------------------------
 FACTOR 1--   -1.00000    240.  657.53168    7.45428    2.87818
         --    1.00000    240.  642.62286   -7.45453    2.87818
 FACTOR 2--   -1.00000    240.  645.17755   -4.89984    2.87818
         --    1.00000    240.  654.97723    4.89984    2.87818
 FACTOR 3--   -1.00000    240.  655.55084    5.47345    2.87818
         --    1.00000    240.  644.60376   -5.47363    2.87818
 FACTOR 4--    1.00000    240.  688.99890   38.92151    2.87818
         --    2.00000    240.  611.15594  -38.92145    2.87818
  
  
         MODEL               RESIDUAL STANDARD DEVIATION
 -------------------------------------------------------
 CONSTANT             ONLY--       74.6382522583
 CONSTANT & FACTOR  1 ONLY--       74.3419036865
 CONSTANT & FACTOR  2 ONLY--       74.5548019409
 CONSTANT & FACTOR  3 ONLY--       74.5147094727
 CONSTANT & FACTOR  4 ONLY--       63.7284545898
 CONSTANT & ALL 4 FACTORS --       63.0577278137
      
Interpretation of Sample Output The output is divided into three sections.
  1. The first section prints the number of observations (480), the number of factors (4), and the number of levels for each factor (2 levels for each factor). It also prints some overall summary statistics. In particular, the residual standard deviation is 63.058. The smaller the residual standard deviation, the more we have accounted for the variance in the data.

  2. The second section prints an ANOVA table. The ANOVA table decomposes the variance into the following component sum of squares:
    • Total sum of squares. The degrees of freedom for this entry is the number of observations minus one.
    • Sum of squares for each of the factors. The degrees of freedom for these entries are the number of levels for the factor minus one. The mean square is the sum of squares divided by the number of degrees of freedom.
    • Residual sum of squares. The degrees of freedom is the total degrees of freedom minus the sum of the factor degrees of freedom. The mean square is the sum of squares divided by the number of degrees of freedom.
    That is, it summarizes how much of the variance in the data (total sum of squares) is accounted for by the factor effects (factor sum of squares) and how much is random error (residual sum of squares). Ideally, we would like most of the variance to be explained by the factor effects. The ANOVA table provides a formal F test for the factor effects. The F-statistic is the mean square for the factor divided by the mean square for the error. This statistic follows an F distribution with (k-1) and (N-k) degrees of freedom where k is the number of levels for the given factor. If the F CDF column for the factor effect is greater than 95%, then the factor is significant at the 5% level. Here, we see that the size of the effect of factor 4 dominates the size of the other effects. The F test shows that factors one and four are significant at the 1% level while factors two and three are not significant at the 5% level.

  3. The third section is an estimation section. It prints an overall mean and overall standard deviation. Then for each level of each factor, it prints the number of observations, the mean for the observations of each cell (uhat(ij) in the above terminology), the factor effects (alphahat(i) and betahat(j) in the above terminology), and the standard deviation of the factor effect. Finally, it prints the residual standard deviation for the various possible models. For the four-way ANOVA here, it prints the constant model

      Y(i) = A0 + E(i)

    a model with each factor individually, and the model with all four factors included.

    For these data, we see that including factor 4 has a significant impact on the residual standard deviation (63.73 when only the factor 4 effect is included compared to 63.058 when all four factors are included).

Output from other statistical software may look somewhat different from the above output.

In addition to the quantitative ANOVA output, it is recommended that any analysis of variance be complemented with model validation. At a minimum, this should include

  1. A run sequence plot of the residuals.
  2. A normal probability plot of the residuals.
  3. A scatter plot of the predicted values against the residuals.
Questions The analysis of variance can be used to answer the following questions:
  1. Do any of the factors have a significant effect?
  2. Which is the most important factor?
  3. Can we account for most of the variability in the data?
Related Techniques One-factor analysis of variance
Two-sample t-test
Box plot
Block plot
Dex mean plot
Case Study The quantitative ANOVA approach can be contrasted with the more graphical EDA approach in the ceramic strength case study.
Software Most general purpose statistical software programs, including Dataplot, can perform multi-factor analysis of variance.
Home Tools & Aids Search Handbook Previous Page Next Page