|
1.
Exploratory Data Analysis
1.4. EDA Case Studies 1.4.2. Case Studies 1.4.2.1. Normal Random Numbers
|
|||
| Summary Statistics |
As a first step in the analysis, a table of summary statistics is
computed from the data. The following table, generated by
Dataplot, shows a typical set of
statistics.
SUMMARY
NUMBER OF OBSERVATIONS = 500
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.3945000E+00 * RANGE = 0.6083000E+01 *
* MEAN = -0.2935997E-02 * STAND. DEV. = 0.1021041E+01 *
* MIDMEAN = 0.1623600E-01 * AV. AB. DEV. = 0.8174360E+00 *
* MEDIAN = -0.9300000E-01 * MINIMUM = -0.2647000E+01 *
* = * LOWER QUART. = -0.7204999E+00 *
* = * LOWER HINGE = -0.7210000E+00 *
* = * UPPER HINGE = 0.6455001E+00 *
* = * UPPER QUART. = 0.6447501E+00 *
* = * MAXIMUM = 0.3436000E+01 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = 0.4505888E-01 * ST. 3RD MOM. = 0.3072273E+00 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.2990314E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = 0.7515639E+01 *
* = * UNIFORM PPCC = 0.9756625E+00 *
* = * NORMAL PPCC = 0.9961721E+00 *
* = * TUK -.5 PPCC = 0.8366451E+00 *
* = * CAUCHY PPCC = 0.4922674E+00 *
***********************************************************************
|
||
| Location |
One way to quantify a change in location over time is to
fit a straight line
to the data set, using the index variable X = 1, 2, ..., N, with
N denoting the number of observations. If there is no significant
drift in the location, the slope parameter should be zero. For this
data set, Dataplot generated the following output:
LEAST SQUARES MULTILINEAR FIT
SAMPLE SIZE N = 500
NUMBER OF VARIABLES = 1
NO REPLICATION CASE
PARAMETER ESTIMATES (APPROX. ST. DEV.) T VALUE
1 A0 0.699127E-02 (0.9155E-01) 0.7636E-01
2 A1 X -0.396298E-04 (0.3167E-03) -0.1251
RESIDUAL STANDARD DEVIATION = 1.02205
RESIDUAL DEGREES OF FREEDOM = 498
The slope parameter, A1, has a
t value of -0.13 which is
statistically not significant. This indicates that the slope
can in fact be considered zero.
|
||
| Variation |
One simple way to detect a change in variation is with a
Bartlett test, after dividing
the data set into several equal-sized intervals.
The choice of the number of intervals is somewhat arbitrary, although
values of 4 or 8 are reasonable. Dataplot generated the following
output for the Bartlett test.
BARTLETT TEST
(STANDARD DEFINITION)
NULL HYPOTHESIS UNDER TEST--ALL SIGMA(I) ARE EQUAL
TEST:
DEGREES OF FREEDOM = 3.000000
TEST STATISTIC VALUE = 2.373660
CUTOFF: 95% PERCENT POINT = 7.814727
CUTOFF: 99% PERCENT POINT = 11.34487
CHI-SQUARE CDF VALUE = 0.501443
NULL NULL HYPOTHESIS NULL HYPOTHESIS
HYPOTHESIS ACCEPTANCE INTERVAL CONCLUSION
ALL SIGMA EQUAL (0.000,0.950) ACCEPT
In this case, the Bartlett test indicates that the standard
deviations are not significantly different in the 4 intervals.
|
||
| Randomness |
There are many ways in which data can be non-random. However,
most common forms of non-randomness can be detected with a
few simple tests. The lag plot in the 4-plot above is a
simple graphical technique.
Another check is an autocorrelation plot that shows the autocorrelations for various lags. Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this band indicate statistically significant values (lag 0 is always 1). Dataplot generated the following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of most interest, is 0.045. The critical values at the 5% significance level are -0.087 and 0.087. Thus, since 0.045 is in the interval, the lag 1 autocorrelation is not statistically significant, so there is no evidence of non-randomness.
A common test for randomness is the
runs test.
|
||
| Distributional Analysis |
Probability plots
are a graphical test for assessing if a
particular distribution provides an adequate fit to a data
set.
A quantitative enhancement to the probability plot is the correlation coefficient of the points on the probability plot. For this data set the correlation coefficient is 0.996. Since this is greater than the critical value of 0.987 (this is a tabulated value), the normality assumption is not rejected.
Chi-square
and
Kolmogorov-Smirnov
goodness-of-fit tests are
alternative methods for assessing distributional adequacy.
The Wilk-Shapiro and
Anderson-Darling tests can be
used to test for normality. Dataplot generates the following
output for the Anderson-Darling normality test.
|
||
| Outlier Analysis |
A test for outliers is the Grubbs
test. Dataplot generated the following output for Grubbs' test.
GRUBBS TEST FOR OUTLIERS
(ASSUMPTION: NORMALITY)
1. STATISTICS:
NUMBER OF OBSERVATIONS = 500
MINIMUM = -2.647000
MEAN = -0.2935997E-02
MAXIMUM = 3.436000
STANDARD DEVIATION = 1.021041
GRUBBS TEST STATISTIC = 3.368068
2. PERCENT POINTS OF THE REFERENCE DISTRIBUTION
FOR GRUBBS TEST STATISTIC
0 % POINT = 0.000000
50 % POINT = 3.274338
75 % POINT = 3.461431
90 % POINT = 3.695134
95 % POINT = 3.863087
97.5 % POINT = 4.024592
99 % POINT = 4.228033
100 % POINT = 22.31596
3. CONCLUSION (AT THE 5% LEVEL):
THERE ARE NO OUTLIERS.
For this data set, Grubbs' test does not detect any outliers at
the 25%, 10%, 5%, and 1% significance levels.
|
||
| Model |
Since the underlying assumptions were validated both graphically
and analytically, we conclude that a reasonable model for the
data is:
|
||
| Univariate Report |
It is sometimes useful and convenient to summarize the above results
in a report. The report for the 500 normal random numbers follows.
Analysis for 500 normal random numbers
1: Sample Size = 500
2: Location
Mean = -0.00294
Standard Deviation of Mean = 0.045663
95% Confidence Interval for Mean = (-0.09266,0.086779)
Drift with respect to location? = NO
3: Variation
Standard Deviation = 1.021042
95% Confidence Interval for SD = (0.961437,1.088585)
Drift with respect to variation?
(based on Bartletts test on quarters
of the data) = NO
4: Distribution
Normal PPCC = 0.996173
Data are Normal?
(as measured by Normal PPCC) = YES
5: Randomness
Autocorrelation = 0.045059
Data are Random?
(as measured by autocorrelation) = YES
6: Statistical Control
(i.e., no drift in location or scale,
data are random, distribution is
fixed, here we are testing only for
fixed normal)
Data Set is in Statistical Control? = YES
7: Outliers?
(as determined by Grubbs' test) = NO
|
||