Next Page Previous Page Home Tools & Aids Search Handbook
1. Exploratory Data Analysis
1.4. EDA Case Studies
1.4.2. Case Studies
1.4.2.8. Heat Flow Meter 1

1.4.2.8.3.

Quantitative Output and Interpretation

Summary Statistics As a first step in the analysis, a table of summary statistics is computed from the data. The following table, generated by Dataplot, shows a typical set of statistics.
 
                                SUMMARY
 
                     NUMBER OF OBSERVATIONS =      195
 
 
***********************************************************************
*        LOCATION MEASURES         *       DISPERSION MEASURES        *
***********************************************************************
*  MIDRANGE     =   0.9262411E+01  *  RANGE        =   0.1311255E+00  *
*  MEAN         =   0.9261460E+01  *  STAND. DEV.  =   0.2278881E-01  *
*  MIDMEAN      =   0.9259412E+01  *  AV. AB. DEV. =   0.1788945E-01  *
*  MEDIAN       =   0.9261952E+01  *  MINIMUM      =   0.9196848E+01  *
*               =                  *  LOWER QUART. =   0.9246826E+01  *
*               =                  *  LOWER HINGE  =   0.9246496E+01  *
*               =                  *  UPPER HINGE  =   0.9275530E+01  *
*               =                  *  UPPER QUART. =   0.9275708E+01  *
*               =                  *  MAXIMUM      =   0.9327973E+01  *
***********************************************************************
*       RANDOMNESS MEASURES        *     DISTRIBUTIONAL MEASURES      *
***********************************************************************
*  AUTOCO COEF  =   0.2805789E+00  *  ST. 3RD MOM. =  -0.8537455E-02  *
*               =   0.0000000E+00  *  ST. 4TH MOM. =   0.3049067E+01  *
*               =   0.0000000E+00  *  ST. WILK-SHA =   0.9458605E+01  *
*               =                  *  UNIFORM PPCC =   0.9735289E+00  *
*               =                  *  NORMAL  PPCC =   0.9989640E+00  *
*               =                  *  TUK -.5 PPCC =   0.8927904E+00  *
*               =                  *  CAUCHY  PPCC =   0.6360204E+00  *
***********************************************************************
 
Location One way to quantify a change in location over time is to fit a straight line to the data set using the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no significant drift in the location, the slope parameter should be zero. For this data set, Dataplot generates the following output:
 LEAST SQUARES MULTILINEAR FIT
       SAMPLE SIZE N       =      195
       NUMBER OF VARIABLES =        1
       NO REPLICATION CASE
  
  
               PARAMETER ESTIMATES           (APPROX. ST. DEV.)    T VALUE
        1  A0                   9.26699       (0.3253E-02)        2849.
        2  A1       X         -0.564115E-04   (0.2878E-04)       -1.960
  
       RESIDUAL    STANDARD DEVIATION =        0.2262372E-01
       RESIDUAL    DEGREES OF FREEDOM =         193
The slope parameter, A1, has a t value of -1.96 which is (barely) statistically significant since it is essentially equal to the 95% level cutoff of -1.96. However, notice that the value of the slope parameter estimate is -0.00056. This slope, even though statistically significant, can essentially be considered zero.
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data set into several equal-sized intervals. The choice of the number of intervals is somewhat arbitrary, although values of 4 or 8 are reasonable. Dataplot generated the following output for the Bartlett test.
               BARTLETT TEST
           (STANDARD DEFINITION)
 NULL HYPOTHESIS UNDER TEST--ALL SIGMA(I) ARE EQUAL
  
 TEST:
    DEGREES OF FREEDOM          =    3.000000
  
    TEST STATISTIC VALUE        =    3.147338
    CUTOFF: 95% PERCENT POINT   =    7.814727
    CUTOFF: 99% PERCENT POINT   =    11.34487
  
    CHI-SQUARE CDF VALUE        =    0.630538
  
   NULL          NULL HYPOTHESIS        NULL HYPOTHESIS
   HYPOTHESIS    ACCEPTANCE INTERVAL    CONCLUSION
 ALL SIGMA EQUAL    (0.000,0.950)         ACCEPT
  
In this case, since the Bartlett test statistic of 3.14 is less than the critical value at the 5% significance level of 7.81, we conclude that the standard deviations are not significantly different in the 4 intervals. That is, the assumption of constant scale is valid.
Randomness There are many ways in which data can be non-random. However, most common forms of non-randomness can be detected with a few simple tests. The lag plot in the previous section is a simple graphical technique.

Another check is an autocorrelation plot that shows the autocorrelations for various lags. Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this band indicate statistically significant values (lag 0 is always 1). Dataplot generated the following autocorrelation plot.

autocorrelation plot

The lag 1 autocorrelation, which is generally the one of greatest interest, is 0.281. The critical values at the 5% significance level are -0.087 and 0.087. This indicates that the lag 1 autocorrelation is statistically significant, so there is evidence of non-randomness.

A common test for randomness is the runs test.

 
                      RUNS UP
 
           STATISTIC = NUMBER OF RUNS UP
               OF LENGTH EXACTLY I
 
   I         STAT     EXP(STAT)    SD(STAT)       Z
 
   1        35.0     40.6667      6.4079       -0.88
   2         8.0     17.7583      3.3021       -2.96
   3        12.0      5.0806      2.0096        3.44
   4         3.0      1.1014      1.0154        1.87
   5         0.0      0.1936      0.4367       -0.44
   6         0.0      0.0287      0.1692       -0.17
   7         0.0      0.0037      0.0607       -0.06
   8         0.0      0.0004      0.0204       -0.02
   9         0.0      0.0000      0.0065       -0.01
  10         0.0      0.0000      0.0020        0.00
 
 
           STATISTIC = NUMBER OF RUNS UP
               OF LENGTH I OR MORE
 
   I         STAT     EXP(STAT)    SD(STAT)       Z
 
   1        58.0     64.8333      4.1439       -1.65
   2        23.0     24.1667      2.7729       -0.42
   3        15.0      6.4083      2.1363        4.02
   4         3.0      1.3278      1.1043        1.51
   5         0.0      0.2264      0.4716       -0.48
   6         0.0      0.0328      0.1809       -0.18
   7         0.0      0.0041      0.0644       -0.06
   8         0.0      0.0005      0.0215       -0.02
   9         0.0      0.0000      0.0068       -0.01
  10         0.0      0.0000      0.0021        0.00
 
 
                     RUNS DOWN
 
           STATISTIC = NUMBER OF RUNS DOWN
               OF LENGTH EXACTLY I
 
   I         STAT     EXP(STAT)    SD(STAT)       Z
 
   1        33.0     40.6667      6.4079       -1.20
   2        18.0     17.7583      3.3021        0.07
   3         3.0      5.0806      2.0096       -1.04
   4         3.0      1.1014      1.0154        1.87
   5         1.0      0.1936      0.4367        1.85
   6         0.0      0.0287      0.1692       -0.17
   7         0.0      0.0037      0.0607       -0.06
   8         0.0      0.0004      0.0204       -0.02
   9         0.0      0.0000      0.0065       -0.01
  10         0.0      0.0000      0.0020        0.00
 
 
           STATISTIC = NUMBER OF RUNS DOWN
               OF LENGTH I OR MORE
 
 
   I         STAT     EXP(STAT)    SD(STAT)       Z
 
   1        58.0     64.8333      4.1439       -1.65
   2        25.0     24.1667      2.7729        0.30
   3         7.0      6.4083      2.1363        0.28
   4         4.0      1.3278      1.1043        2.42
   5         1.0      0.2264      0.4716        1.64
   6         0.0      0.0328      0.1809       -0.18
   7         0.0      0.0041      0.0644       -0.06
   8         0.0      0.0005      0.0215       -0.02
   9         0.0      0.0000      0.0068       -0.01
  10         0.0      0.0000      0.0021        0.00
 
 
           RUNS TOTAL = RUNS UP + RUNS DOWN
 
         STATISTIC = NUMBER OF RUNS TOTAL
              OF LENGTH EXACTLY I
 
   I         STAT     EXP(STAT)    SD(STAT)       Z
 
   1        68.0     81.3333      9.0621       -1.47
   2        26.0     35.5167      4.6698       -2.04
   3        15.0     10.1611      2.8420        1.70
   4         6.0      2.2028      1.4360        2.64
   5         1.0      0.3871      0.6176        0.99
   6         0.0      0.0574      0.2392       -0.24
   7         0.0      0.0074      0.0858       -0.09
   8         0.0      0.0008      0.0289       -0.03
   9         0.0      0.0001      0.0092       -0.01
  10         0.0      0.0000      0.0028        0.00
 
 
         STATISTIC = NUMBER OF RUNS TOTAL
               OF LENGTH I OR MORE
 
   I         STAT     EXP(STAT)    SD(STAT)       Z
 
   1       116.0    129.6667      5.8604       -2.33
   2        48.0     48.3333      3.9215       -0.09
   3        22.0     12.8167      3.0213        3.04
   4         7.0      2.6556      1.5617        2.78
   5         1.0      0.4528      0.6669        0.82
   6         0.0      0.0657      0.2559       -0.26
   7         0.0      0.0083      0.0911       -0.09
   8         0.0      0.0009      0.0305       -0.03
   9         0.0      0.0001      0.0097       -0.01
  10         0.0      0.0000      0.0029        0.00
 
 
          LENGTH OF THE LONGEST RUN UP         =     4
          LENGTH OF THE LONGEST RUN DOWN       =     5
          LENGTH OF THE LONGEST RUN UP OR DOWN =     5
 
          NUMBER OF POSITIVE DIFFERENCES =    98
          NUMBER OF NEGATIVE DIFFERENCES =    95
          NUMBER OF ZERO     DIFFERENCES =     1
  
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant at the 5% level. The runs test does indicate some non-randomness.

Although the autocorrelation plot and the runs test indicate some mild non-randomness, the violation of the randomness assumption is not serious enough to warrant developing a more sophisticated model. It is common in practice that some of the assumptions are mildly violated and it is a judgement call as to whether or not the violations are serious enough to warrant developing a more sophisticated model for the data.

Distributional Analysis Probability plots are a graphical test for assessing if a particular distribution provides an adequate fit to a data set.

A quantitative enhancement to the probability plot is the correlation coefficient of the points on the probability plot. For this data set the correlation coefficient is 0.996. Since this is greater than the critical value of 0.987 (this is a tabulated value), the normality assumption is not rejected.

Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for assessing distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests can be used to test for normality. Dataplot generates the following output for the Anderson-Darling normality test.

  
               ANDERSON-DARLING 1-SAMPLE TEST
               THAT THE DATA CAME FROM A NORMAL DISTRIBUTION
  
 1. STATISTICS:
       NUMBER OF OBSERVATIONS                =      195
       MEAN                                  =    9.261460
       STANDARD DEVIATION                    =   0.2278881E-01
  
       ANDERSON-DARLING TEST STATISTIC VALUE =   0.1264954
       ADJUSTED TEST STATISTIC VALUE         =   0.1290070
  
 2. CRITICAL VALUES:
       90         % POINT    =   0.6560000
       95         % POINT    =   0.7870000
       97.5       % POINT    =   0.9180000
       99         % POINT    =    1.092000
  
 3. CONCLUSION (AT THE 5% LEVEL):
       THE DATA DO COME FROM A NORMAL DISTRIBUTION.
  
The Anderson-Darling test also does not reject the normality assumption because the test statistic, 0.129, is less than the critical value at the 5% significance level of 0.918.
Outlier Analysis A test for outliers is the Grubbs' test. Dataplot generated the following output for Grubbs' test.
  
               GRUBBS TEST FOR OUTLIERS
               (ASSUMPTION: NORMALITY)
  
 1. STATISTICS:
       NUMBER OF OBSERVATIONS      =      195
       MINIMUM                     =    9.196848
       MEAN                        =    9.261460
       MAXIMUM                     =    9.327973
    STANDARD DEVIATION          =   0.2278881E-01
  
    GRUBBS TEST STATISTIC       =    2.918673
  
 2. PERCENT POINTS OF THE REFERENCE DISTRIBUTION
    FOR GRUBBS TEST STATISTIC
       0          % POINT    =    0.000000
       50         % POINT    =    2.984294
       75         % POINT    =    3.181226
       90         % POINT    =    3.424672
       95         % POINT    =    3.597898
       97.5       % POINT    =    3.763061
       99         % POINT    =    3.970215
       100        % POINT    =    13.89263
 
 3. CONCLUSION (AT THE 5% LEVEL):
       THERE ARE NO OUTLIERS.
  
For this data set, Grubbs' test does not detect any outliers at the 25%, 10%, 5%, and 1% significance levels.
Model Since the underlying assumptions were validated both graphically and analytically, with a mild violation of the randomness assumption, we conclude that a reasonable model for the data is:
    Y(i) = 9.26146 + E(i)
We can express the uncertainty for C, here estimated by 9.26146, as the 95% confidence interval (9.258242,9.26479).
Univariate Report It is sometimes useful and convenient to summarize the above results in a report. The report for the heat flow meter data follows.
  
 Analysis for heat flow meter data
  
 1: Sample Size                           = 195
  
 2: Location
    Mean                                  = 9.26146
    Standard Deviation of Mean            = 0.001632
    95% Confidence Interval for Mean      = (9.258242,9.264679)
    Drift with respect to location?       = NO
  
 3: Variation
    Standard Deviation                    = 0.022789
    95% Confidence Interval for SD        = (0.02073,0.025307)
    Drift with respect to variation?
    (based on Bartlett's test on quarters
    of the data)                          = NO
  
 4: Randomness
    Autocorrelation                       = 0.280579
    Data are Random?
      (as measured by autocorrelation)    = NO
  
 5: Distribution
    Normal PPCC                           = 0.998965
    Data are Normal?
      (as measured by Normal PPCC)        = YES
  
 6: Statistical Control
    (i.e., no drift in location or scale,
    data are random, distribution is 
    fixed, here we are testing only for
    fixed normal)
    Data Set is in Statistical Control?   = YES
  
 7: Outliers?
    (as determined by Grubbs' test)        = NO
  
Home Tools & Aids Search Handbook Previous Page Next Page