|
1.
Exploratory Data Analysis
1.4. EDA Case Studies 1.4.2. Case Studies 1.4.2.2. Uniform Random Numbers
|
|||
| Summary Statistics |
As a first step in the analysis, a table of summary statistics is
computed from the data. The following table, generated by
Dataplot, shows a typical set of
statistics.
SUMMARY
NUMBER OF OBSERVATIONS = 500
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.4997850E+00 * RANGE = 0.9945900E+00 *
* MEAN = 0.5078304E+00 * STAND. DEV. = 0.2943252E+00 *
* MIDMEAN = 0.5045621E+00 * AV. AB. DEV. = 0.2526468E+00 *
* MEDIAN = 0.5183650E+00 * MINIMUM = 0.2490000E-02 *
* = * LOWER QUART. = 0.2508093E+00 *
* = * LOWER HINGE = 0.2505935E+00 *
* = * UPPER HINGE = 0.7594775E+00 *
* = * UPPER QUART. = 0.7591152E+00 *
* = * MAXIMUM = 0.9970800E+00 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = -0.3098569E-01 * ST. 3RD MOM. = -0.3443941E-01 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.1796969E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = -0.2004886E+02 *
* = * UNIFORM PPCC = 0.9995682E+00 *
* = * NORMAL PPCC = 0.9771602E+00 *
* = * TUK -.5 PPCC = 0.7229201E+00 *
* = * CAUCHY PPCC = 0.3591767E+00 *
***********************************************************************
Note that under the distributional measures the uniform probability plot correlation coefficient (PPCC) value is significantly larger than the normal PPCC value. This is evidence that the uniform distribution fits these data better than does a normal distribution. |
||
| Location |
One way to quantify a change in location over time is to
fit a straight line to the
data set using the index variable X = 1, 2, ..., N, with N denoting the
number of observations. If there is no significant drift in
the location, the slope parameter should be zero. For this data set,
Dataplot generated the following output:
LEAST SQUARES MULTILINEAR FIT
SAMPLE SIZE N = 500
NUMBER OF VARIABLES = 1
NO REPLICATION CASE
PARAMETER ESTIMATES (APPROX. ST. DEV.) T VALUE
1 A0 0.522923 (0.2638E-01) 19.82
2 A1 X -0.602478E-04 (0.9125E-04) -0.6603
RESIDUAL STANDARD DEVIATION = 0.2944917
RESIDUAL DEGREES OF FREEDOM = 498
The slope parameter, A1, has a
t value of -0.66 which is
statistically not significant. This indicates that the slope
can in fact be considered zero.
|
||
| Variation |
One simple way to detect a change in variation is with a
Bartlett test after dividing the
data set into several equal-sized intervals. However, the Bartlett
test is not robust for non-normality. Since we know this data set is
not approximated well by the normal distribution,
we use the alternative Levene
test. In partiuclar, we use the Levene test based on the median
rather the mean. The choice of the number of intervals is somewhat
arbitrary, although values of 4 or 8 are reasonable. Dataplot
generated the following output for the Levene test.
LEVENE F-TEST FOR SHIFT IN VARIATION
(ASSUMPTION: NORMALITY)
1. STATISTICS
NUMBER OF OBSERVATIONS = 500
NUMBER OF GROUPS = 4
LEVENE F TEST STATISTIC = 0.7983007E-01
FOR LEVENE TEST STATISTIC
0 % POINT = 0.0000000E+00
50 % POINT = 0.7897459
75 % POINT = 1.373753
90 % POINT = 2.094885
95 % POINT = 2.622929
99 % POINT = 3.821479
99.9 % POINT = 5.506884
2.905608 % Point: 0.7983007E-01
3. CONCLUSION (AT THE 5% LEVEL):
THERE IS NO SHIFT IN VARIATION.
THUS: HOMOGENEOUS WITH RESPECT TO VARIATION.
In this case, the Levene test indicates that the standard
deviations are not significantly different in the 4 intervals.
|
||
| Randomness |
There are many ways in which data can be non-random. However,
most common forms of non-randomness can be detected with a
few simple tests. The lag plot in the 4-plot in the previous section
is a simple graphical technique.
Another check is an autocorrelation plot that shows the autocorrelations for various lags. Confidence bands can be plotted using 95% and 99% confidence levels. Points outside this band indicate statistically significant values (lag 0 is always 1). Dataplot generated the following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of most interest, is 0.03. The critical values at the 5% significance level are -0.087 and 0.087. This indicates that the lag 1 autocorrelation is not statistically significant, so there is no evidence of non-randomness.
A common test for randomness is the
runs test.
|
||
| Distributional Analysis |
Probability plots are a
graphical test of assessing whether a particular distribution provides
an adequate fit to a data set.
A quantitative enhancement to the probability plot is the correlation coefficient of the points on the probability plot. For this data set the correlation coefficient, from the summary table above, is 0.977. Since this is less than the critical value of 0.987 (this is a tabulated value), the normality assumption is rejected.
Chi-square and
Kolmogorov-Smirnov
goodness-of-fit tests are alternative methods for assessing
distributional adequacy. The Wilk-Shapiro and Anderson-Darling tests
can be used to test for normality. Dataplot generates the following
output for the Anderson-Darling normality test.
|
||
| Model |
Based on the graphical and quantitative analysis, we use the model
95% confidence limit for C = (0.497,0.503) |
||
| Univariate Report |
It is sometimes useful and convenient to summarize the above results
in a report. The report for the 500 uniform random numbers follows.
Analysis for 500 uniform random numbers
1: Sample Size = 500
2: Location
Mean = 0.50783
Standard Deviation of Mean = 0.013163
95% Confidence Interval for Mean = (0.48197,0.533692)
Drift with respect to location? = NO
3: Variation
Standard Deviation = 0.294326
95% Confidence Interval for SD = (0.277144,0.313796)
Drift with respect to variation?
(based on Levene's test on quarters
of the data) = NO
4: Distribution
Normal PPCC = 0.999569
Data are Normal?
(as measured by Normal PPCC) = NO
Uniform PPCC = 0.9995
Data are Uniform?
(as measured by Uniform PPCC) = YES
5: Randomness
Autocorrelation = -0.03099
Data are Random?
(as measured by autocorrelation) = YES
6: Statistical Control
(i.e., no drift in location or scale,
data is random, distribution is
fixed, here we are testing only for
fixed uniform)
Data Set is in Statistical Control? = YES
|
||