|
1.
Exploratory Data Analysis
1.4. EDA Case Studies 1.4.2. Case Studies 1.4.2.4. Josephson Junction Cryothermometry
|
|||
| Summary Statistics |
As a first step in the analysis, a table of summary statistics is
computed from the data. The following table, generated by
Dataplot, shows a typical set of
statistics.
SUMMARY
NUMBER OF OBSERVATIONS = 700
***********************************************************************
* LOCATION MEASURES * DISPERSION MEASURES *
***********************************************************************
* MIDRANGE = 0.2898500E+04 * RANGE = 0.7000000E+01 *
* MEAN = 0.2898562E+04 * STAND. DEV. = 0.1304969E+01 *
* MIDMEAN = 0.2898363E+04 * AV. AB. DEV. = 0.1058571E+01 *
* MEDIAN = 0.2899000E+04 * MINIMUM = 0.2895000E+04 *
* = * LOWER QUART. = 0.2898000E+04 *
* = * LOWER HINGE = 0.2898000E+04 *
* = * UPPER HINGE = 0.2899000E+04 *
* = * UPPER QUART. = 0.2899000E+04 *
* = * MAXIMUM = 0.2902000E+04 *
***********************************************************************
* RANDOMNESS MEASURES * DISTRIBUTIONAL MEASURES *
***********************************************************************
* AUTOCO COEF = 0.3148023E+00 * ST. 3RD MOM. = 0.6580265E-02 *
* = 0.0000000E+00 * ST. 4TH MOM. = 0.2825334E+01 *
* = 0.0000000E+00 * ST. WILK-SHA = -0.2272378E+02 *
* = * UNIFORM PPCC = 0.9554127E+00 *
* = * NORMAL PPCC = 0.9748405E+00 *
* = * TUK -.5 PPCC = 0.7935873E+00 *
* = * CAUCHY PPCC = 0.4231319E+00 *
***********************************************************************
|
||
| Location |
One way to quantify a change in location over time is to
fit a straight line to the
data set using the index variable X = 1, 2, ..., N, with N denoting the
number of observations. If there is no significant drift in
the location, the slope parameter should be zero. For this data set,
Dataplot generates the following output:
LEAST SQUARES MULTILINEAR FIT
SAMPLE SIZE N = 700
NUMBER OF VARIABLES = 1
NO REPLICATION CASE
PARAMETER ESTIMATES (APPROX. ST. DEV.) T VALUE
1 A0 2898.19 (0.9745E-01) 0.2974E+05
2 A1 X 0.107075E-02 (0.2409E-03) 4.445
RESIDUAL STANDARD DEVIATION = 1.287802
RESIDUAL DEGREES OF FREEDOM = 698
The slope parameter, A1, has a
t value of 2.1 which is
statistically significant (the critical value is 1.98). However, the
value of the slope is 0.0011. Given that the slope is nearly zero,
the assumption of constant location is not seriously violated
even though it is (just barely) statistically significant.
|
||
| Variation |
One simple way to detect a change in variation is with a
Bartlett test after dividing the
data set into several equal-sized intervals. However, the Bartlett
test is not robust for non-normality. Since the nature of the data
(a few distinct points repeated many times) makes the normality
assumption questionable,
we use the alternative Levene
test. In partiuclar, we use the Levene test based on the median
rather the mean. The choice of the number of intervals is somewhat
arbitrary, although values of 4 or 8 are reasonable. Dataplot
generated the following output for the Levene test.
LEVENE F-TEST FOR SHIFT IN VARIATION
(ASSUMPTION: NORMALITY)
1. STATISTICS
NUMBER OF OBSERVATIONS = 700
NUMBER OF GROUPS = 4
LEVENE F TEST STATISTIC = 1.432365
FOR LEVENE TEST STATISTIC
0 % POINT = 0.000000
50 % POINT = 0.7894323
75 % POINT = 1.372513
90 % POINT = 2.091688
95 % POINT = 2.617726
99 % POINT = 3.809943
99.9 % POINT = 5.482234
76.79006 % Point: 1.432365
3. CONCLUSION (AT THE 5% LEVEL):
THERE IS NO SHIFT IN VARIATION.
THUS THE GROUPS ARE HOMOGENEOUS WITH RESPECT TO VARIATION.
Since the Levene test statistic value of 1.43 is less than the
95% critical value of 2.67, we conclude that the standard
deviations are not significantly different in the 4 intervals.
|
||
| Randomness |
There are many ways in which data can be non-random. However,
most common forms of non-randomness can be detected with a
few simple tests. The lag plot in the
previous section is a simple graphical technique.
Another check is an autocorrelation plot that shows the autocorrelations for various lags. Confidence bands can be plotted at the 95% and 99% confidence levels. Points outside this band indicate statistically significant values (lag 0 is always 1). Dataplot generated the following autocorrelation plot.
The lag 1 autocorrelation, which is generally the one of most interest, is 0.31. The critical values at the 5% level of significance are -0.087 and 0.087. This indicates that the lag 1 autocorrelation is statistically significant, so there is some evidence for non-randomness.
A common test for randomness is the
runs test.
Although the runs test and lag 1 autocorrelation indicate some mild non-randomness, it is not sufficient to reject the Yi = C + Ei model. At least part of the non-randomness can be explained by the discrete nature of the data. |
||
| Distributional Analysis |
Probability plots are a graphical test for assessing if a
particular distribution provides an adequate fit to a data
set.
A quantitative enhancement to the probability plot is the correlation coefficient of the points on the probability plot. For this data set the correlation coefficient is 0.975. Since this is less than the critical value of 0.987 (this is a tabulated value), the normality assumption is rejected.
Chi-square and
Kolmogorov-Smirnov
goodness-of-fit tests are
alternative methods for assessing distributional adequacy.
The Wilk-Shapiro
and Anderson-Darling
tests can be used to test for normality. Dataplot generates the
following output for the Anderson-Darling normality test.
Although the data are not strictly normal, the violation of the normality assumption is not severe enough to conclude that the Yi = C + Ei model is unreasonable. At least part of the non-normality can be explained by the discrete nature of the data. |
||
| Outlier Analysis |
A test for outliers is the Grubbs
test. Dataplot generated the following output for Grubbs' test.
GRUBBS TEST FOR OUTLIERS
(ASSUMPTION: NORMALITY)
1. STATISTICS:
NUMBER OF OBSERVATIONS = 700
MINIMUM = 2895.000
MEAN = 2898.562
MAXIMUM = 2902.000
STANDARD DEVIATION = 1.304969
GRUBBS TEST STATISTIC = 2.729201
2. PERCENT POINTS OF THE REFERENCE DISTRIBUTION
FOR GRUBBS TEST STATISTIC
0 % POINT = 0.000000
50 % POINT = 3.371397
75 % POINT = 3.554906
90 % POINT = 3.784969
95 % POINT = 3.950619
97.5 % POINT = 4.109569
99 % POINT = 4.311552
100 % POINT = 26.41972
3. CONCLUSION (AT THE 5% LEVEL):
THERE ARE NO OUTLIERS.
For this data set, Grubbs' test does not detect any outliers at
the 10%, 5%, and 1% significance levels.
|
||
| Model |
Although the randomness and normality assumptions were
mildly violated, we conclude that a reasonable model for the
data is:
|
||
| Univariate Report |
It is sometimes useful and convenient to summarize the above
results in a report.
Analysis for Josephson Junction Cryothermometry Data
1: Sample Size = 700
2: Location
Mean = 2898.562
Standard Deviation of Mean = 0.049323
95% Confidence Interval for Mean = (2898.465,2898.658)
Drift with respect to location? = YES
(Further analysis indicates that
the drift, while statistically
significant, is not practically
significant)
3: Variation
Standard Deviation = 1.30497
95% Confidence Interval for SD = (1.240007,1.377169)
Drift with respect to variation?
(based on Levene's test on quarters
of the data) = NO
4: Distribution
Normal PPCC = 0.97484
Data are Normal?
(as measured by Normal PPCC) = NO
5: Randomness
Autocorrelation = 0.314802
Data are Random?
(as measured by autocorrelation) = NO
6: Statistical Control
(i.e., no drift in location or scale,
data are random, distribution is
fixed, here we are testing only for
fixed normal)
Data Set is in Statistical Control? = NO
Note: Although we have violations of
the assumptions, they are mild enough,
and at least partially explained by the
discrete nature of the data, so we may model
the data as if it were in statistical
control
7: Outliers?
(as determined by Grubbs test) = NO
|
||