|
1.
Exploratory Data Analysis
1.4. EDA Case Studies 1.4.2. Case Studies 1.4.2.8. Heat Flow Meter 1
|
|||
| Summary Statistics |
As a first step in the analysis, a table of summary statistics is
computed from the data. The following table, generated by
Dataplot, shows a typical set of
statistics.
SUMMARY
NUMBER OF OBSERVATIONS = 195
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.9262411E+01 * RANGE = 0.1311255E+00 *
* MEAN = 0.9261460E+01 * STAND. DEV. = 0.2278881E-01 *
* MIDMEAN = 0.9259412E+01 * AV. AB. DEV. = 0.1788945E-01 *
* MEDIAN = 0.9261952E+01 * MINIMUM = 0.9196848E+01 *
* = * LOWER QUART. = 0.9246826E+01 *
* = * LOWER HINGE = 0.9246496E+01 *
* = * UPPER HINGE = 0.9275530E+01 *
* = * UPPER QUART. = 0.9275708E+01 *
* = * MAXIMUM = 0.9327973E+01 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = 0.2805789E+00 * ST. 3RD MOM. = -0.8537455E-02 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.3049067E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = 0.9458605E+01 *
* = * UNIFORM PPCC = 0.9735289E+00 *
* = * NORMAL PPCC = 0.9989640E+00 *
* = * TUK -.5 PPCC = 0.8927904E+00 *
* = * CAUCHY PPCC = 0.6360204E+00 *
***********************************************************************
|
||
| Location |
One way to quantify a change in location over time is to
fit a straight line to the
data set using the index variable X = 1, 2, ..., N, with N denoting
the number of observations. If there is no significant drift in
the location, the slope parameter should be zero. For this data set,
Dataplot generates the following output:
LEAST SQUARES MULTILINEAR FIT
SAMPLE SIZE N = 195
NUMBER OF VARIABLES = 1
NO REPLICATION CASE
PARAMETER ESTIMATES (APPROX. ST. DEV.) T VALUE
1 A0 9.26699 (0.3253E-02) 2849.
2 A1 X -0.564115E-04 (0.2878E-04) -1.960
RESIDUAL STANDARD DEVIATION = 0.2262372E-01
RESIDUAL DEGREES OF FREEDOM = 193
The slope parameter, A1, has a
t value of -1.96
which is (barely) statistically significant since it is essentially
equal to the 95% level cutoff of -1.96. However, notice that the
value of the slope parameter estimate is -0.00056. This slope, even
though statistically significant, can essentially be considered zero.
|
||
| Variation |
One simple way to detect a change in variation is with a
Bartlett test
after dividing the data set into several equal-sized intervals.
The choice of the number of intervals is somewhat arbitrary, although
values of 4 or 8 are reasonable. Dataplot generated the following
output for the Bartlett test.
BARTLETT TEST
(STANDARD DEFINITION)
NULL HYPOTHESIS UNDER TEST--ALL SIGMA(I) ARE EQUAL
TEST:
DEGREES OF FREEDOM = 3.000000
TEST STATISTIC VALUE = 3.147338
CUTOFF: 95% PERCENT POINT = 7.814727
CUTOFF: 99% PERCENT POINT = 11.34487
CHI-SQUARE CDF VALUE = 0.630538
NULL NULL HYPOTHESIS NULL HYPOTHESIS
HYPOTHESIS ACCEPTANCE INTERVAL CONCLUSION
ALL SIGMA EQUAL (0.000,0.950) ACCEPT
In this case, since the Bartlett test statistic of 3.14 is less than
the critical value at the 5% significance level of 7.81, we conclude
that the standard deviations are not significantly different in the
4 intervals. That is, the assumption of constant scale is valid.
|
||
| Randomness |
There are many ways in which data can be non-random. However,
most common forms of non-randomness can be detected with a
few simple tests. The lag plot in the previous section is a
simple graphical technique.
Another check is an autocorrelation plot that shows the autocorrelations for various lags. Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this band indicate statistically significant values (lag 0 is always 1). Dataplot generated the following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of greatest interest, is 0.281. The critical values at the 5% significance level are -0.087 and 0.087. This indicates that the lag 1 autocorrelation is statistically significant, so there is evidence of non-randomness.
A common test for randomness is the
runs test.
Although the autocorrelation plot and the runs test indicate some
mild non-randomness, the violation of the randomness assumption is
not serious enough to warrant developing a more
sophisticated model. It is common in practice that some of the
assumptions are mildly violated and it is a judgement call as to
whether or not the violations are serious enough to warrant
developing a more sophisticated model for the data.
|
||
| Distributional Analysis |
Probability plots
are a graphical test for assessing if a
particular distribution provides an adequate fit to a data set.
A quantitative enhancement to the probability plot is the correlation coefficient of the points on the probability plot. For this data set the correlation coefficient is 0.996. Since this is greater than the critical value of 0.987 (this is a tabulated value), the normality assumption is not rejected.
Chi-square and
Kolmogorov-Smirnov
goodness-of-fit tests are
alternative methods for assessing distributional adequacy.
The Wilk-Shapiro and
Anderson-Darling tests can be used
to test for normality. Dataplot generates the following
output for the Anderson-Darling normality test.
|
||
| Outlier Analysis |
A test for outliers is the
Grubbs' test. Dataplot generated
the following output for Grubbs' test.
GRUBBS TEST FOR OUTLIERS
(ASSUMPTION: NORMALITY)
1. STATISTICS:
NUMBER OF OBSERVATIONS = 195
MINIMUM = 9.196848
MEAN = 9.261460
MAXIMUM = 9.327973
STANDARD DEVIATION = 0.2278881E-01
GRUBBS TEST STATISTIC = 2.918673
2. PERCENT POINTS OF THE REFERENCE DISTRIBUTION
FOR GRUBBS TEST STATISTIC
0 % POINT = 0.000000
50 % POINT = 2.984294
75 % POINT = 3.181226
90 % POINT = 3.424672
95 % POINT = 3.597898
97.5 % POINT = 3.763061
99 % POINT = 3.970215
100 % POINT = 13.89263
3. CONCLUSION (AT THE 5% LEVEL):
THERE ARE NO OUTLIERS.
For this data set, Grubbs' test does not detect any outliers at
the 25%, 10%, 5%, and 1% significance levels.
|
||
| Model |
Since the underlying assumptions were validated both graphically
and analytically, with a mild violation of the randomness
assumption, we conclude that a reasonable model for the data is:
|
||
| Univariate Report |
It is sometimes useful and convenient to summarize the above
results in a report. The report for the heat flow meter data follows.
Analysis for heat flow meter data
1: Sample Size = 195
2: Location
Mean = 9.26146
Standard Deviation of Mean = 0.001632
95% Confidence Interval for Mean = (9.258242,9.264679)
Drift with respect to location? = NO
3: Variation
Standard Deviation = 0.022789
95% Confidence Interval for SD = (0.02073,0.025307)
Drift with respect to variation?
(based on Bartlett's test on quarters
of the data) = NO
4: Randomness
Autocorrelation = 0.280579
Data are Random?
(as measured by autocorrelation) = NO
5: Distribution
Normal PPCC = 0.998965
Data are Normal?
(as measured by Normal PPCC) = YES
6: Statistical Control
(i.e., no drift in location or scale,
data are random, distribution is
fixed, here we are testing only for
fixed normal)
Data Set is in Statistical Control? = YES
7: Outliers?
(as determined by Grubbs' test) = NO
|
||