Next Page Previous Page Home Tools & Aids Search Handbook
1. Exploratory Data Analysis
1.4. EDA Case Studies
1.4.2. Case Studies
1.4.2.2. Uniform Random Numbers

1.4.2.2.3.

Quantitative Output and Interpretation

Summary Statistics As a first step in the analysis, a table of summary statistics is computed from the data. The following table, generated by Dataplot, shows a typical set of statistics.
                                 SUMMARY
  
                      NUMBER OF OBSERVATIONS =      500
  
  
 ***********************************************************************
 *        LOCATION MEASURES         *       DISPERSION MEASURES        *
 ***********************************************************************
 *  MIDRANGE     =   0.4997850E+00  *  RANGE        =   0.9945900E+00  *
 *  MEAN         =   0.5078304E+00  *  STAND. DEV.  =   0.2943252E+00  *
 *  MIDMEAN      =   0.5045621E+00  *  AV. AB. DEV. =   0.2526468E+00  *
 *  MEDIAN       =   0.5183650E+00  *  MINIMUM      =   0.2490000E-02  *
 *               =                  *  LOWER QUART. =   0.2508093E+00  *
 *               =                  *  LOWER HINGE  =   0.2505935E+00  *
 *               =                  *  UPPER HINGE  =   0.7594775E+00  *
 *               =                  *  UPPER QUART. =   0.7591152E+00  *
 *               =                  *  MAXIMUM      =   0.9970800E+00  *
 ***********************************************************************
 *       RANDOMNESS MEASURES        *     DISTRIBUTIONAL MEASURES      *
 ***********************************************************************
 *  AUTOCO COEF  =  -0.3098569E-01  *  ST. 3RD MOM. =  -0.3443941E-01  *
 *               =   0.0000000E+00  *  ST. 4TH MOM. =   0.1796969E+01  *
 *               =   0.0000000E+00  *  ST. WILK-SHA =  -0.2004886E+02  *
 *               =                  *  UNIFORM PPCC =   0.9995682E+00  *
 *               =                  *  NORMAL  PPCC =   0.9771602E+00  *
 *               =                  *  TUK -.5 PPCC =   0.7229201E+00  *
 *               =                  *  CAUCHY  PPCC =   0.3591767E+00  *
 ***********************************************************************

Note that under the distributional measures the uniform probability plot correlation coefficient (PPCC) value is significantly larger than the normal PPCC value. This is evidence that the uniform distribution fits these data better than does a normal distribution.

Location One way to quantify a change in location over time is to fit a straight line to the data set using the index variable X = 1, 2, ..., N, with N denoting the number of observations. If there is no significant drift in the location, the slope parameter should be zero. For this data set, Dataplot generated the following output:
  
LEAST SQUARES MULTILINEAR FIT
SAMPLE SIZE N       =      500
NUMBER OF VARIABLES =        1
NO REPLICATION CASE
 
 
        PARAMETER ESTIMATES           (APPROX. ST. DEV.)    T VALUE
 1  A0                  0.522923       (0.2638E-01)        19.82
 2  A1       X         -0.602478E-04   (0.9125E-04)        -0.6603
 
RESIDUAL    STANDARD DEVIATION =         0.2944917
RESIDUAL    DEGREES OF FREEDOM =         498
  
The slope parameter, A1, has a t value of -0.66 which is statistically not significant. This indicates that the slope can in fact be considered zero.
Variation One simple way to detect a change in variation is with a Bartlett test after dividing the data set into several equal-sized intervals. However, the Bartlett test is not robust for non-normality. Since we know this data set is not approximated well by the normal distribution, we use the alternative Levene test. In partiuclar, we use the Levene test based on the median rather the mean. The choice of the number of intervals is somewhat arbitrary, although values of 4 or 8 are reasonable. Dataplot generated the following output for the Levene test.
               LEVENE F-TEST FOR SHIFT IN VARIATION
                      (ASSUMPTION: NORMALITY)
  
 1. STATISTICS
       NUMBER OF OBSERVATIONS    =      500
       NUMBER OF GROUPS          =        4
       LEVENE F TEST STATISTIC   =   0.7983007E-01
  
  
    FOR LEVENE TEST STATISTIC
       0          % POINT    =   0.0000000E+00
       50         % POINT    =   0.7897459
       75         % POINT    =    1.373753
       90         % POINT    =    2.094885
       95         % POINT    =    2.622929
       99         % POINT    =    3.821479
       99.9       % POINT    =    5.506884
  
  
          2.905608       % Point:    0.7983007E-01
  
 3. CONCLUSION (AT THE 5% LEVEL):
       THERE IS NO SHIFT IN VARIATION.
       THUS: HOMOGENEOUS WITH RESPECT TO VARIATION.
  
In this case, the Levene test indicates that the standard deviations are not significantly different in the 4 intervals.
Randomness There are many ways in which data can be non-random. However, most common forms of non-randomness can be detected with a few simple tests. The lag plot in the 4-plot in the previous section is a simple graphical technique.

Another check is an autocorrelation plot that shows the autocorrelations for various lags. Confidence bands can be plotted using 95% and 99% confidence levels. Points outside this band indicate statistically significant values (lag 0 is always 1). Dataplot generated the following autocorrelation plot.

autocorrelation plot

The lag 1 autocorrelation, which is generally the one of most interest, is 0.03. The critical values at the 5% significance level are -0.087 and 0.087. This indicates that the lag 1 autocorrelation is not statistically significant, so there is no evidence of non-randomness.

A common test for randomness is the runs test.

                    RUNS UP
         STATISTIC = NUMBER OF RUNS UP
             OF LENGTH EXACTLY I
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       103.0    104.2083     10.2792       -0.12
 2        48.0     45.7167      5.2996        0.43
 3        11.0     13.1292      3.2297       -0.66
 4         6.0      2.8563      1.6351        1.92
 5         0.0      0.5037      0.7045       -0.71
 6         0.0      0.0749      0.2733       -0.27
 7         1.0      0.0097      0.0982       10.08
 8         0.0      0.0011      0.0331       -0.03
 9         0.0      0.0001      0.0106       -0.01
10         0.0      0.0000      0.0032        0.00
         STATISTIC = NUMBER OF RUNS UP
             OF LENGTH I OR MORE
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       169.0    166.5000      6.6546        0.38
 2        66.0     62.2917      4.4454        0.83
 3        18.0     16.5750      3.4338        0.41
 4         7.0      3.4458      1.7786        2.00
 5         1.0      0.5895      0.7609        0.54
 6         1.0      0.0858      0.2924        3.13
 7         1.0      0.0109      0.1042        9.49
 8         0.0      0.0012      0.0349       -0.03
 9         0.0      0.0001      0.0111       -0.01
10         0.0      0.0000      0.0034        0.00
                   RUNS DOWN
         STATISTIC = NUMBER OF RUNS DOWN
             OF LENGTH EXACTLY I
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       113.0    104.2083     10.2792        0.86
 2        43.0     45.7167      5.2996       -0.51
 3        11.0     13.1292      3.2297       -0.66
 4         1.0      2.8563      1.6351       -1.14
 5         0.0      0.5037      0.7045       -0.71
 6         0.0      0.0749      0.2733       -0.27
 7         0.0      0.0097      0.0982       -0.10
 8         0.0      0.0011      0.0331       -0.03
 9         0.0      0.0001      0.0106       -0.01
10         0.0      0.0000      0.0032        0.00
         STATISTIC = NUMBER OF RUNS DOWN
             OF LENGTH I OR MORE
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       168.0    166.5000      6.6546        0.23
 2        55.0     62.2917      4.4454       -1.64
 3        12.0     16.5750      3.4338       -1.33
 4         1.0      3.4458      1.7786       -1.38
 5         0.0      0.5895      0.7609       -0.77
 6         0.0      0.0858      0.2924       -0.29
 7         0.0      0.0109      0.1042       -0.10
 8         0.0      0.0012      0.0349       -0.03
 9         0.0      0.0001      0.0111       -0.01
10         0.0      0.0000      0.0034        0.00
         RUNS TOTAL = RUNS UP + RUNS DOWN
       STATISTIC = NUMBER OF RUNS TOTAL
            OF LENGTH EXACTLY I
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       216.0    208.4167     14.5370        0.52
 2        91.0     91.4333      7.4947       -0.06
 3        22.0     26.2583      4.5674       -0.93
 4         7.0      5.7127      2.3123        0.56
 5         0.0      1.0074      0.9963       -1.01
 6         0.0      0.1498      0.3866       -0.39
 7         1.0      0.0193      0.1389        7.06
 8         0.0      0.0022      0.0468       -0.05
 9         0.0      0.0002      0.0150       -0.01
10         0.0      0.0000      0.0045        0.00
       STATISTIC = NUMBER OF RUNS TOTAL
             OF LENGTH I OR MORE
 I         STAT     EXP(STAT)    SD(STAT)       Z
  
 1       337.0    333.0000      9.4110        0.43
 2       121.0    124.5833      6.2868       -0.57
 3        30.0     33.1500      4.8561       -0.65
 4         8.0      6.8917      2.5154        0.44
 5         1.0      1.1790      1.0761       -0.17
 6         1.0      0.1716      0.4136        2.00
 7         1.0      0.0217      0.1474        6.64
 8         0.0      0.0024      0.0494       -0.05
 9         0.0      0.0002      0.0157       -0.02
10         0.0      0.0000      0.0047        0.00
        LENGTH OF THE LONGEST RUN UP         =     7
        LENGTH OF THE LONGEST RUN DOWN       =     4
        LENGTH OF THE LONGEST RUN UP OR DOWN =     7
  
        NUMBER OF POSITIVE DIFFERENCES =   263
        NUMBER OF NEGATIVE DIFFERENCES =   236
        NUMBER OF ZERO     DIFFERENCES =     0
  
Values in the column labeled "Z" greater than 1.96 or less than -1.96 are statistically significant at the 5% level. This runs test does not indicate any significant non-randomness. There is a statistically significant value for runs of length 7. However, further examination of the table shows that there is in fact a single run of length 7 when near 0 are expected. This is not sufficient evidence to conclude that the data are non-random.
Distributional Analysis Probability plots are a graphical test of assessing whether a particular distribution provides an adequate fit to a data set.

A quantitative enhancement to the probability plot is the correlation coefficient of the points on the probability plot. For this data set the correlation coefficient, from the summary table above, is 0.977. Since this is less than the critical value of 0.987 (this is a tabulated value), the normality assumption is rejected.

Chi-square and Kolmogorov-Smirnov goodness-of-fit tests are alternative methods for assessing distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests can be used to test for normality. Dataplot generates the following output for the Anderson-Darling normality test.

               ANDERSON-DARLING 1-SAMPLE TEST
               THAT THE DATA CAME FROM A NORMAL DISTRIBUTION
  
 1. STATISTICS:
       NUMBER OF OBSERVATIONS                =      500
       MEAN                                  =   0.5078304
       STANDARD DEVIATION                    =   0.2943252
  
       ANDERSON-DARLING TEST STATISTIC VALUE =    5.719849
       ADJUSTED TEST STATISTIC VALUE         =    5.765036
  
 2. CRITICAL VALUES:
       90         % POINT    =   0.6560000
       95         % POINT    =   0.7870000
       97.5       % POINT    =   0.9180000
       99         % POINT    =    1.092000
  
 3. CONCLUSION (AT THE 5% LEVEL):
       THE DATA DO NOT COME FROM A NORMAL DISTRIBUTION.
The Anderson-Darling test rejects the normality assumption because the value of the test statistic, 5.72, is larger than the critical value of 1.092 at the 1% significance level.
Model Based on the graphical and quantitative analysis, we use the model
    Yi = C + Ei
where C is estimated by the mid-range and the uncertainty interval for C is based on a bootstrap analysis. Specifically,
    C = 0.499
    95% confidence limit for C = (0.497,0.503)
Univariate Report It is sometimes useful and convenient to summarize the above results in a report. The report for the 500 uniform random numbers follows.
  
 Analysis for 500 uniform random numbers
  
 1: Sample Size                           = 500
  
 2: Location
    Mean                                  = 0.50783
    Standard Deviation of Mean            = 0.013163
    95% Confidence Interval for Mean      = (0.48197,0.533692)
    Drift with respect to location?       = NO
  
 3: Variation
    Standard Deviation                    = 0.294326
    95% Confidence Interval for SD        = (0.277144,0.313796)
    Drift with respect to variation?
    (based on Levene's test on quarters
    of the data)                          = NO
  
 4: Distribution
    Normal PPCC                           = 0.999569
    Data are Normal?
      (as measured by Normal PPCC)        = NO
  
    Uniform PPCC                          = 0.9995
    Data are Uniform?
      (as measured by Uniform PPCC)       = YES
  
 5: Randomness
    Autocorrelation                       = -0.03099
    Data are Random?
      (as measured by autocorrelation)    = YES
  
 6: Statistical Control
    (i.e., no drift in location or scale,
    data is random, distribution is 
    fixed, here we are testing only for
    fixed uniform)
    Data Set is in Statistical Control?   = YES
  
  
Home Tools & Aids Search Handbook Previous Page Next Page