Next Page Previous Page Home Tools & Aids Search Handbook

1. Exploratory Data Analysis
1.3. EDA Techniques
1.3.5. Quantitative Techniques

1.3.5.4.

One-Factor ANOVA

Purpose:
Test for Equal Means Across Groups
One factor analysis of variance (Snedecor and Cochran, 1989) is a special case of analysis of variance (ANOVA), for one factor of interest, and a generalization of the two-sample t-test. The two-sample t-test is used to decide whether two groups (levels) of a factor have the same mean. One-way analysis of variance generalizes this to levels where k, the number of levels, is greater than or equal to 2.

For example, data collected on, say, five instruments have one factor (instruments) at five levels. The ANOVA tests whether instruments have a significant effect on the results.

Definition The Product and Process Comparisons chapter (chapter 7) contains a more extensive discussion of 1-factor ANOVA, including the details for the mathematical computations of one-way analysis of variance.

The model for the analysis of variance can be stated in two mathematically equivalent ways. In the following discussion, each level of each factor is called a cell. For the one-way case, a cell and a level are equivalent since there is only one factor. In the following, the subscript i refers to the level and the subscript j refers to the observation within a level. For example, Y23 refers to the third observation in the second level.

The first model is

    Y(ij) = u(ij) + E(ij)
This model decomposes the response into a mean for each cell and an error term. The analysis of variance provides estimates for each cell mean. These estimated cell means are the predicted values of the model and the differences between the response variable and the estimated cell means are the residuals. That is
    YHAT(ij) = uhat(i)

    R(ij) = Y(ij) - uhat(i)

The second model is

    Y(ij) = u + alpha(i) + E(ij)
This model decomposes the response into an overall (grand) mean, the effect of the ith factor level, and an error term. The analysis of variance provides estimates of the grand mean and the effect of the ith factor level. The predicted values and the residuals of the model are
    YHAT(ij) = muhat + alphahat(i)

    R(ij) = Y(ij) - uhat - alphahat(i)

The distinction between these models is that the second model divides the cell mean into an overall mean and the effect of the ith factor level. This second model makes the factor effect more explicit, so we will emphasize this approach.

Model Validation Note that the ANOVA model assumes that the error term, Eij, should follow the assumptions for a univariate measurement process. That is, after performing an analysis of variance, the model should be validated by analyzing the residuals.
Sample Output
Dataplot generated the following output for the one-way analysis of variance from the GEAR.DAT data set.
  
       NUMBER OF OBSERVATIONS           =      100
       NUMBER OF FACTORS                =        1
       NUMBER OF LEVELS FOR FACTOR  1  =       10
       BALANCED CASE
       RESIDUAL    STANDARD DEVIATION   =    0.59385783970E-02
       RESIDUAL    DEGREES OF FREEDOM   =       90
       REPLICATION CASE
       REPLICATION STANDARD DEVIATION   =    0.59385774657E-02
       REPLICATION DEGREES OF FREEDOM   =       90
       NUMBER OF DISTINCT CELLS         =       10
  
                          *****************
                          *  ANOVA TABLE  *
                          *****************
  
 SOURCE              DF SUM OF SQUARES    MEAN SQUARE   F STATISTIC    F CDF SIG
 -------------------------------------------------------------------------------
 TOTAL (CORRECTED)   99       0.003903       0.000039
 -------------------------------------------------------------------------------
 FACTOR  1            9       0.000729       0.000081        2.2969  97.734%   *
 -------------------------------------------------------------------------------
 RESIDUAL            90       0.003174       0.000035
  
       RESIDUAL    STANDARD DEVIATION =        0.00593857840
       RESIDUAL    DEGREES OF FREEDOM =            90
       REPLICATION STANDARD DEVIATION =        0.00593857747
       REPLICATION DEGREES OF FREEDOM =            90

                          ****************
                          *  ESTIMATION  *
                          ****************
  
       GRAND MEAN                       =    0.99764001369E+00
       GRAND STANDARD DEVIATION         =    0.62789078802E-02
  
  
              LEVEL-ID      NI      MEAN      EFFECT     SD(EFFECT)
 --------------------------------------------------------------------
 FACTOR 1--    1.00000     10.    0.99800    0.00036    0.00178
         --    2.00000     10.    0.99910    0.00146    0.00178
         --    3.00000     10.    0.99540   -0.00224    0.00178
         --    4.00000     10.    0.99820    0.00056    0.00178
         --    5.00000     10.    0.99190   -0.00574    0.00178
         --    6.00000     10.    0.99880    0.00116    0.00178
         --    7.00000     10.    1.00150    0.00386    0.00178
         --    8.00000     10.    1.00040    0.00276    0.00178
         --    9.00000     10.    0.99830    0.00066    0.00178
         --   10.00000     10.    0.99480   -0.00284    0.00178
  
  
         MODEL               RESIDUAL STANDARD DEVIATION
 -------------------------------------------------------
 CONSTANT             ONLY--        0.0062789079
 CONSTANT & FACTOR  1 ONLY--        0.0059385784
  
  

Interpretation of Sample Output The output is divided into three sections.
  1. The first section prints the number of observations (100), the number of factors (10), and the number of levels for each factor (10 levels for factor 1). It also prints some overall summary statistics. In particular, the residual standard deviation is 0.0059. The smaller the residual standard deviation, the more we have accounted for the variance in the data.

  2. The second section prints an ANOVA table. The ANOVA table decomposes the variance into the following component sum of squares:
    • Total sum of squares. The degrees of freedom for this entry is the number of observations minus one.
    • Sum of squares for the factor. The degrees of freedom for this entry is the number of levels minus one. The mean square is the sum of squares divided by the number of degrees of freedom.
    • Residual sum of squares. The degrees of freedom is the total degrees of freedom minus the factor degrees of freedom. The mean square is the sum of squares divided by the number of degrees of freedom.
    That is, it summarizes how much of the variance in the data (total sum of squares) is accounted for by the factor effect (factor sum of squares) and how much is random error (residual sum of squares). Ideally, we would like most of the variance to be explained by the factor effect. The ANOVA table provides a formal F test for the factor effect. The F-statistic is the mean square for the factor divided by the mean square for the error. This statistic follows an F distribution with (k-1) and (N-k) degrees of freedom. If the F CDF column for the factor effect is greater than 95%, then the factor is significant at the 5% level.

  3. The third section prints an estimation section. It prints an overall mean and overall standard deviation. Then for each level of each factor, it prints the number of observations, the mean for the observations of each cell (uhat(i) in the above terminology), the factor effect (alphahat(j) in the above terminology), and the standard deviation of the factor effect. Finally, it prints the residual standard deviation for the various possible models. For the one-way ANOVA, the two models are the constant model, i.e.,

      Y(i) = A0 + E(i)

    and the model with a factor effect

      Y(ij) = u + alpha(i) + E(ij)

    For these data, including the factor effect reduces the residual standard deviation from 0.00623 to 0.0059. That is, although the factor is statistically significant, it has minimal improvement over a simple constant model. This is because the factor is just barely significant.
Output from other statistical software may look somewhat different from the above output.

In addition to the quantitative ANOVA output, it is recommended that any analysis of variance be complemented with model validation. At a minimum, this should include

  1. A run sequence plot of the residuals.
  2. A normal probability plot of the residuals.
  3. A scatter plot of the predicted values against the residuals.
Question The analysis of variance can be used to answer the following question
  • Are means the same across groups in the data?
Importance The analysis of uncertainty depends on whether the factor significantly affects the outcome.
Related Techniques Two-sample t-test
Multi-factor analysis of variance
Regression
Box plot
Software Most general purpose statistical software programs, including Dataplot, can generate an analysis of variance.
Home Tools & Aids Search Handbook Previous Page Next Page