Purpose:
Check
underlying
statistical assumptions
|
The 4-plot is a collection of 4 specific
EDA graphical techniques whose purpose is to test the assumptions
which underlie most measurement processes.
A 4-plot consists of a
- run sequence plot;
- lag plot;
- histogram;
- normal probability plot.
If the 4 underlying assumptions of a typical measurement process hold,
then the above 4 plots will have a characteristic appearance
(see the normal random numbers example below) ;
if any of the underlying assumptions fail to hold, then it
will be revealed by anomalous appearance in one or more
of the plots.
(Go to examples.)
|
Sample Plot:
This 4-plot reveals a process which
has fixed location, fixed variation,
is non-random (oscillatory),
has a non-normal, U-shaped distribution, and has
3 outliers.
|
|
Importance:
Testing the 4 underlying assumptions
helps ensure the validity
of the final scientific and
engineering conclusions
|
There are 4 assumptions which typically
underlie all measurement processes, namely,
that the data from the
process at hand "behave like":
- random drawings;
- from a fixed distribution;
- with that distribution having a fixed location; and
- with that distribution having a fixed variation.
Predictability is an all-important goal in science and engineering.
If the above 4 assumptions hold, then we have achieved
probabilistic predictability--the ability to make
probability statements not only about the process
in the past, but also about the process in the future.
In short, such processes
are said to be "statistically in control".
If the 4 assumptions do not hold, then we have a process
which is drifting (with respect to location, variation,
or distribution), is unpredictable, and is out of control.
A simple characterization of such processes by a location estimate, a variation estimate, or a distribution "estimate"
inevitably leads to optimistic and grossly invalid engineering conclusions.
Inasmuch as the
validity of the final scientific and engineering conclusions
is inextricably linked to the validity of these
same 4 underlying assumptions, it naturally follows that there
is a real necessity that each and every one of the above 4 assumptions be routinely
tested. The 4-plot (run sequence plot, lag plot, histogram, and
normal probability plot) is seen as a simple, efficient, and powerful
way of carrying out this routine checking.
|
Interpretation:
Flat, equi-banded,
random, bell-shaped,
and linear
|
Of the 4 underlying assumptions:
- If the fixed location assumption holds,
then the run sequence plot will be flat and non-drifting.
- If the fixed variation assumption holds,
then the vertical spread in the run sequence plot
will be the approximately the same over the entire horizontal axis.
- If the randomness assumption holds,
then the lag plot will be structureless
and random.
- If the fixed distribution assumption holds
(in particular, if the fixed normal distribution holds),
then the histogram will be bell-shaped and the normal
probability plot will be linear.
If all 4 of the assumptions hold,
then the process is "statistically in control".
In practice, many processes fall short of achieving this ideal.
|
|
Questions
|
4-plots can provide answers to many questions:
- Is the process in-control, stable, and predictable?
- Is the process drifting with respect to location?
- Is the process drifting with respect to variation?
- Is the data random?
- Is an observation related to an adjacent observation?
- If a time series, is is white noise?
- If not white noise, is it sinusoidal, autoregressive, etc.?
- If non-random, what is a better model?
- Does the process follow a normal distribution?
- If non-normal, what distribution does the process follow?
- Is the model    
Y = constant + error
    valid and sufficient?
- If the default model is insufficient, what is a better model?
- Is the formula SD(Ybar) = SD(y) / sqrt(n) valid?
- Is the sample mean a good estimator of the process location?
- If not, what would be a better estimator?
- Are there any outliers?
|
Input:
1 Variable: Y
|
A single variable Y--that is, a single column of numbers.
These numbers need not have been collected equi-spaced in time.
|
Definition:
1. Run sequence plot;
2. Lag plot;
3. Histogram;
4. Normal probability plot
|
The 4-plot consists of the following:
- Run sequence plot (to test fixed location & variation)
- Vertically: Y(i)
- Horizontally: i
- Lag Plot (to test randomness)
- Vertically: Y(i)
- Horizontally: Y(i-1)
- Histogram (to test (normal) distribution)
- Vertically: counts
- Horizontally Y
- Normal) probability plot (to test (normal distribution)
- Vertically: ordered Y(i)
- Horizontally: theoretical values from normal N(0,1) for ordered Y(i)
|
|
Examples:
|
- 1. Normal random numbers (the ideal)
- 2. Uniform random numbers
- 3. Random walk
- 4. Flicker noise
- 5. Josephson junction cryothermometry
- 6. Beam deflections
- 7. Filter transmittance
- 8. Standard resistor
- 9. Heat flow meter 1
- 10. Heat flow meter 2
- 11. Heat flow meter 3
|
Stat Category:
Time Series & Univariate
|
The 4-plot should be used for the routine analysis of both
univariate data sets and time series.
Its scope is much broader, however, since the 4-plot can be
applied to the residuals from any regression or ANOVA to ascertain
as to whether the broader underlying assumptions hold for these
more general categories.
|
Usage:
Moderate
|
The 4-plot receives moderate usage; due to the importance
of assumption testing on the validity of the final scientific and
engineering conclusions, the 4-plot should receive much heavier usage.
The 4-plot is arguably the most important EDA tool for
checking underlying assumptions in the univariate case.
|
|
Other Issues
|
|
|
References
|
Filliben ...
|
|
Dataplot Command
|
4-PLOT
|
|
Related Techniques
|
Run Sequence Plot
Lag Plot
Histogram
Normal Probability Plot
Autocorrelation Plot
Spectral Plot
PPCC Plot
|
|
Do Worked Example
|
1. With our data
2. With you own data
|