Next Page Previous Page Home Tools & Aids Search Handbook
1. Exploratory Data Analysis
1.4. EDA Case Studies
1.4.2. Case Studies
1.4.2.4. Josephson Junction Cryothermometry

1.4.2.4.3.

Quantitative Output and Interpretation

Summary Statistics As a first step in the analysis, a table of summary statistics is computed from the data. The following table, generated by Dataplot, shows a typical set of statistics.
                                 SUMMARY
  
                      NUMBER OF OBSERVATIONS =      700
  
  
 ***********************************************************************
 *        LOCATION MEASURES         *       DISPERSION MEASURES        *
 ***********************************************************************
 *  MIDRANGE     =   0.2898500E+04  *  RANGE        =   0.7000000E+01  *
 *  MEAN         =   0.2898562E+04  *  STAND. DEV.  =   0.1304969E+01  *
 *  MIDMEAN      =   0.2898363E+04  *  AV. AB. DEV. =   0.1058571E+01  *
 *  MEDIAN       =   0.2899000E+04  *  MINIMUM      =   0.2895000E+04  *
 *               =                  *  LOWER QUART. =   0.2898000E+04  *
 *               =                  *  LOWER HINGE  =   0.2898000E+04  *
 *               =                  *  UPPER HINGE  =   0.2899000E+04  *
 *               =                  *  UPPER QUART. =   0.2899000E+04  *
 *               =                  *  MAXIMUM      =   0.2902000E+04  *
 ***********************************************************************
 *       RANDOMNESS MEASURES        *     DISTRIBUTIONAL MEASURES      *
 ***********************************************************************
 *  AUTOCO COEF  =   0.3148023E+00  *  ST. 3RD MOM. =   0.6580265E-02  *
 *               =   0.0000000E+00  *  ST. 4TH MOM. =   0.2825334E+01  *
 *               =   0.0000000E+00  *  ST. WILK-SHA =  -0.2272378E+02  *
 *               =                  *  UNIFORM PPCC =   0.9554127E+00  *
 *               =                  *  NORMAL  PPCC =   0.9748405E+00  *
 *               =                  *  TUK -.5 PPCC =   0.7935873E+00  *
 *               =                  *  CAUCHY  PPCC =   0.4231319E+00  *
 ***********************************************************************
Location One way to quantify a change in location over time is to fit a straight line to the data set using the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no significant drift in the location, the slope parameter should be zero. For this data set, Dataplot generates the following output:
 LEAST SQUARES MULTILINEAR FIT
       SAMPLE SIZE N       =      700
       NUMBER OF VARIABLES =        1
       NO REPLICATION CASE
  
  
               PARAMETER ESTIMATES           (APPROX. ST. DEV.)    T VALUE
        1  A0                   2898.19       (0.9745E-01)       0.2974E+05
        2  A1       X          0.107075E-02   (0.2409E-03)        4.445
  
       RESIDUAL    STANDARD DEVIATION =         1.287802
       RESIDUAL    DEGREES OF FREEDOM =         698
The slope parameter, A1, has a t value of 2.1 which is statistically significant (the critical value is 1.98). However, the value of the slope is 0.0011. Given that the slope is nearly zero, the assumption of constant location is not seriously violated even though it is (just barely) statistically significant.
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data set into several equal-sized intervals. However, the Bartlett test is not robust for non-normality. Since the nature of the data (a few distinct points repeated many times) makes the normality assumption questionable, we use the alternative Levene test. In partiuclar, we use the Levene test based on the median rather the mean. The choice of the number of intervals is somewhat arbitrary, although values of 4 or 8 are reasonable. Dataplot generated the following output for the Levene test.
               LEVENE F-TEST FOR SHIFT IN VARIATION
                      (ASSUMPTION: NORMALITY)
  
 1. STATISTICS
       NUMBER OF OBSERVATIONS    =      700
       NUMBER OF GROUPS          =        4
       LEVENE F TEST STATISTIC   =    1.432365
  
  
    FOR LEVENE TEST STATISTIC
       0          % POINT    =    0.000000
       50         % POINT    =   0.7894323
       75         % POINT    =    1.372513
       90         % POINT    =    2.091688
       95         % POINT    =    2.617726
       99         % POINT    =    3.809943
       99.9       % POINT    =    5.482234
  
  
          76.79006       % Point:     1.432365
  
 3. CONCLUSION (AT THE 5% LEVEL):
       THERE IS NO SHIFT IN VARIATION.
       THUS THE GROUPS ARE HOMOGENEOUS WITH RESPECT TO VARIATION.
Since the Levene test statistic value of 1.43 is less than the 95% critical value of 2.67, we conclude that the standard deviations are not significantly different in the 4 intervals.
Randomness There are many ways in which data can be non-random. However, most common forms of non-randomness can be detected with a few simple tests. The lag plot in the previous section is a simple graphical technique.

Another check is an autocorrelation plot that shows the autocorrelations for various lags. Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this band indicate statistically significant values (lag 0 is always 1). Dataplot generated the following autocorrelation plot.

autocorrelation plot

The lag 1 autocorrelation, which is generally the one of most interest, is 0.31. The critical values at the 5% level of significance are -0.087 and 0.087. This indicates that the lag 1 autocorrelation is statistically significant, so there is some evidence for non-randomness.

A common test for randomness is the runs test.

                       RUNS UP
  
            STATISTIC = NUMBER OF RUNS UP
                OF LENGTH EXACTLY I
  
    I         STAT     EXP(STAT)    SD(STAT)       Z
  
    1       102.0    145.8750     12.1665       -3.61
    2        48.0     64.0500      6.2731       -2.56
    3        23.0     18.4069      3.8239        1.20
    4        11.0      4.0071      1.9366        3.61
    5         4.0      0.7071      0.8347        3.95
    6         2.0      0.1052      0.3240        5.85
    7         2.0      0.0136      0.1164       17.06
    8         0.0      0.0015      0.0393       -0.04
    9         0.0      0.0002      0.0125       -0.01
   10         0.0      0.0000      0.0038        0.00
  
  
            STATISTIC = NUMBER OF RUNS UP
                OF LENGTH I OR MORE
  
    I         STAT     EXP(STAT)    SD(STAT)       Z
  
    1       192.0    233.1667      7.8779       -5.23
    2        90.0     87.2917      5.2610        0.51
    3        42.0     23.2417      4.0657        4.61
    4        19.0      4.8347      2.1067        6.72
    5         8.0      0.8276      0.9016        7.96
    6         4.0      0.1205      0.3466       11.19
    7         2.0      0.0153      0.1236       16.06
    8         0.0      0.0017      0.0414       -0.04
    9         0.0      0.0002      0.0132       -0.01
   10         0.0      0.0000      0.0040        0.00
  
  
                      RUNS DOWN
  
            STATISTIC = NUMBER OF RUNS DOWN
                OF LENGTH EXACTLY I
  
    I         STAT     EXP(STAT)    SD(STAT)       Z
  
    1       106.0    145.8750     12.1665       -3.28
    2        47.0     64.0500      6.2731       -2.72
    3        24.0     18.4069      3.8239        1.46
    4         8.0      4.0071      1.9366        2.06
    5         4.0      0.7071      0.8347        3.95
    6         3.0      0.1052      0.3240        8.94
    7         0.0      0.0136      0.1164       -0.12
    8         0.0      0.0015      0.0393       -0.04
    9         0.0      0.0002      0.0125       -0.01
   10         0.0      0.0000      0.0038        0.00
  
  
            STATISTIC = NUMBER OF RUNS DOWN
                OF LENGTH I OR MORE
  
  
    I         STAT     EXP(STAT)    SD(STAT)       Z
  
    1       192.0    233.1667      7.8779       -5.23
    2        86.0     87.2917      5.2610       -0.25
    3        39.0     23.2417      4.0657        3.88
    4        15.0      4.8347      2.1067        4.83
    5         7.0      0.8276      0.9016        6.85
    6         3.0      0.1205      0.3466        8.31
    7         0.0      0.0153      0.1236       -0.12
    8         0.0      0.0017      0.0414       -0.04
    9         0.0      0.0002      0.0132       -0.01
   10         0.0      0.0000      0.0040        0.00
  
  
            RUNS TOTAL = RUNS UP + RUNS DOWN
  
          STATISTIC = NUMBER OF RUNS TOTAL
               OF LENGTH EXACTLY I
  
    I         STAT     EXP(STAT)    SD(STAT)       Z
  
    1       208.0    291.7500     17.2060       -4.87
    2        95.0    128.1000      8.8716       -3.73
    3        47.0     36.8139      5.4079        1.88
    4        19.0      8.0143      2.7387        4.01
    5         8.0      1.4141      1.1805        5.58
    6         5.0      0.2105      0.4582       10.45
    7         2.0      0.0271      0.1647       11.98
    8         0.0      0.0031      0.0556       -0.06
    9         0.0      0.0003      0.0177       -0.02
   10         0.0      0.0000      0.0054       -0.01
  
  
          STATISTIC = NUMBER OF RUNS TOTAL
                OF LENGTH I OR MORE
  
    I         STAT     EXP(STAT)    SD(STAT)       Z
  
    1       384.0    466.3333     11.1410       -7.39
    2       176.0    174.5833      7.4402        0.19
    3        81.0     46.4833      5.7498        6.00
    4        34.0      9.6694      2.9794        8.17
    5        15.0      1.6552      1.2751       10.47
    6         7.0      0.2410      0.4902       13.79
    7         2.0      0.0306      0.1748       11.27
    8         0.0      0.0034      0.0586       -0.06
    9         0.0      0.0003      0.0186       -0.02
   10         0.0      0.0000      0.0056       -0.01
  
  
           LENGTH OF THE LONGEST RUN UP         =     7
           LENGTH OF THE LONGEST RUN DOWN       =     6
           LENGTH OF THE LONGEST RUN UP OR DOWN =     7
  
           NUMBER OF POSITIVE DIFFERENCES =   262
           NUMBER OF NEGATIVE DIFFERENCES =   258
           NUMBER OF ZERO     DIFFERENCES =   179
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant at the 5% level. The runs test indicates some mild non-randomness.

Although the runs test and lag 1 autocorrelation indicate some mild non-randomness, it is not sufficient to reject the Yi = C + Ei model. At least part of the non-randomness can be explained by the discrete nature of the data.

Distributional Analysis Probability plots are a graphical test for assessing if a particular distribution provides an adequate fit to a data set.

A quantitative enhancement to the probability plot is the correlation coefficient of the points on the probability plot. For this data set the correlation coefficient is 0.975. Since this is less than the critical value of 0.987 (this is a tabulated value), the normality assumption is rejected.

Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for assessing distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests can be used to test for normality. Dataplot generates the following output for the Anderson-Darling normality test.

               ANDERSON-DARLING 1-SAMPLE TEST
               THAT THE DATA CAME FROM A NORMAL DISTRIBUTION
  
 1. STATISTICS:
       NUMBER OF OBSERVATIONS                =      700
       MEAN                                  =    2898.562
       STANDARD DEVIATION                    =    1.304969
  
       ANDERSON-DARLING TEST STATISTIC VALUE =    16.76349
       ADJUSTED TEST STATISTIC VALUE         =    16.85843
  
 2. CRITICAL VALUES:
       90         % POINT    =   0.6560000
       95         % POINT    =   0.7870000
       97.5       % POINT    =   0.9180000
       99         % POINT    =    1.092000
  
 3. CONCLUSION (AT THE 5% LEVEL):
       THE DATA DO NOT COME FROM A NORMAL DISTRIBUTION.
The Anderson-Darling test rejects the normality assumption because the test statistic, 16.76, is greater than the 99% critical value 1.092.

Although the data are not strictly normal, the violation of the normality assumption is not severe enough to conclude that the Yi = C + Ei model is unreasonable. At least part of the non-normality can be explained by the discrete nature of the data.

Outlier Analysis A test for outliers is the Grubbs test. Dataplot generated the following output for Grubbs' test.
               GRUBBS TEST FOR OUTLIERS
               (ASSUMPTION: NORMALITY)
  
 1. STATISTICS:
       NUMBER OF OBSERVATIONS      =      700
       MINIMUM                     =    2895.000
       MEAN                        =    2898.562
       MAXIMUM                     =    2902.000
       STANDARD DEVIATION          =    1.304969
  
       GRUBBS TEST STATISTIC       =    2.729201
  
 2. PERCENT POINTS OF THE REFERENCE DISTRIBUTION
    FOR GRUBBS TEST STATISTIC
       0          % POINT    =    0.000000
       50         % POINT    =    3.371397
       75         % POINT    =    3.554906
       90         % POINT    =    3.784969
       95         % POINT    =    3.950619
       97.5       % POINT    =    4.109569
       99         % POINT    =    4.311552
       100        % POINT    =    26.41972
 
 3. CONCLUSION (AT THE 5% LEVEL):
       THERE ARE NO OUTLIERS.
For this data set, Grubbs' test does not detect any outliers at the 10%, 5%, and 1% significance levels.
Model Although the randomness and normality assumptions were mildly violated, we conclude that a reasonable model for the data is:
    Y(i) = 2898.7 + E(i)
In addition, a 95% confidence interval for the mean value is (2898.515,2898.928).
Univariate Report It is sometimes useful and convenient to summarize the above results in a report.
 Analysis for Josephson Junction Cryothermometry Data
  
 1: Sample Size                           = 700
  
 2: Location
    Mean                                  = 2898.562
    Standard Deviation of Mean            = 0.049323
    95% Confidence Interval for Mean      = (2898.465,2898.658)
    Drift with respect to location?       = YES
    (Further analysis indicates that
    the drift, while statistically
    significant, is not practically
    significant)
  
 3: Variation
    Standard Deviation                    = 1.30497
    95% Confidence Interval for SD        = (1.240007,1.377169)
    Drift with respect to variation?
    (based on Levene's test on quarters
    of the data)                          = NO
  
 4: Distribution
    Normal PPCC                           = 0.97484
    Data are Normal?
      (as measured by Normal PPCC)        = NO
  
 5: Randomness
    Autocorrelation                       = 0.314802
    Data are Random?
      (as measured by autocorrelation)    = NO
  
 6: Statistical Control
    (i.e., no drift in location or scale,
    data are random, distribution is 
    fixed, here we are testing only for
    fixed normal)
    Data Set is in Statistical Control?   = NO
  
    Note: Although we have violations of
    the assumptions, they are mild enough,
    and at least partially explained by the
    discrete nature of the data, so we may model
    the data as if it were in statistical
    control
  
 7: Outliers?
    (as determined by Grubbs test)        = NO
Home Tools & Aids Search Handbook Previous Page Next Page