1.
Exploratory Data Analysis
1.4.
EDA Case Studies
1.4.1.
|
Case Studies Introduction
|
|
Purpose
|
The purpose of the first eight case studies is to show how EDA graphics
and quantitative measures and tests are applied to data from scientific
processes and to critique those data with regard to the following
assumptions that typically underlie a measurement process; namely,
that the data behave like:
- random drawings
- from a fixed distribution
- with a fixed location
- with a fixed standard deviation
Case studies 9 and 10 show the use of EDA techniques in distributional
modeling and the analysis of a designed experiment, respectively.
|
Yi = C + Ei
|
If the above assumptions are satisfied, the process is said to be
statistically "in control" with the core characteristic of having
"predictability". That is, probability statements can be made about
the process, not only in the past, but also in the future.
An appropriate model for an "in control" process is
where C is a constant (the "deterministic" or
"structural" component), and where Ei is the
error term (or "random" component).
The constant C is the average value of the process--it
is the primary summary number which shows up on any report. Although
C is (assumed) fixed, it is unknown, and so a primary
analysis objective of the engineer is to arrive at an estimate of
C.
This goal partitions into 4 sub-goals:
- Is the most common estimator of C,
\(\bar{Y}\), the best estimator for C? What does
"best" mean?
- If
\(\bar{Y}\) is best, what is the uncertainty
\(s_{\bar{Y}}\) for
\(\bar{Y}\). In particular, is the usual formula for the
uncertainty of \(\bar{Y}\):
\( s_{\bar{Y}} = s/\sqrt{N} \)
valid? Here, s is the standard deviation of
the data and N is the sample size.
- If
\(\bar{Y}\) is not the best estimator for C,
what is a better estimator for C (for example,
median, midrange, midmean)?
- If there is a better estimator,
\(\hat{C}\), what is its uncertainty? That is, what is
\(s_{\hat{C}}\)?
EDA and the routine checking of underlying assumptions
provides insight into all of the above.
- Location and
variation checks provide
information as to whether C is really constant.
- Distributional checks indicate whether
\(\bar{Y}\) is the best estimator. Techniques for
distributional checking include
histograms,
normal probability plots,
and probability
plot correlation coefficient plots.
- Randomness checks ascertain whether the usual
\( s_{\bar{Y}} = s/\sqrt{N} \)
is valid.
- Distributional tests assist in determining a better estimator,
if needed.
- Simulator tools (namely
bootstrapping) provide
values for the uncertainty of alternative estimators.
|
Assumptions not satisfied
|
If one or more of the above assumptions is not satisfied, then
we use EDA techniques, or some mix of EDA and classical techniques,
to find a more appropriate model for the data. That is,
where D is the deterministic part and E
is an error component.
If the data are not random, then we may investigate fitting
some simple time series models to the data. If the constant location
and scale assumptions are violated, we may need to investigate the
measurement process to see if there is an explanation.
The assumptions on the error term are still quite relevant in the
sense that for an appropriate model the error component should follow
the assumptions. The criterion for validating the model, or
comparing competing models, is framed in terms of these assumptions.
|
Multivariable data
|
Although the case studies in this chapter utilize univariate data,
the assumptions above are relevant for multivariable data as well.
If the data are not univariate, then we are trying to find a model
where F is some function based on one or more variables.
The error component, which is a univariate data set, of a good model
should satisfy the assumptions given above. The criterion for
validating and comparing models is based on how well the error
component follows these assumptions.
The load cell calibration
case study in the process modeling chapter shows an example of this
in the regression context.
|
First three case studies utilize data with known characteristics
|
The first three case studies utilize data that are randomly
generated from the following distributions:
- normal distribution with mean 0 and standard deviation 1
- uniform distribution with mean 0 and standard deviation
\(\sqrt{1/12}\) (uniform over the interval (0,1))
- random walk
The other univariate case studies utilize data from scientific
processes. The goal is to determine if
is a reasonable model. This is done by testing the underlying
assumptions. If the assumptions are satisfied, then an estimate of
C and an estimate of the uncertainty of C
are computed. If the assumptions are not satisfied, we attempt to
find a model where the error component does satisfy the underlying
assumptions.
|
Graphical methods that are applied to the data
|
To test the underlying assumptions, each data set is analyzed using
four graphical methods that are particularly suited for this purpose:
- run sequence plot
which is useful for detecting shifts of location or scale
- lag plot
which is useful for detecting non-randomness in the data
- histogram which is
useful for trying to determine the underlying distribution
- normal probability plot
for deciding whether the data follow the normal distribution
There are a number of other techniques for addressing the underlying
assumptions. However, the four plots listed above provide an
excellent opportunity for addressing all of the assumptions on a single
page of graphics.
Additional graphical techniques are used in certain case studies
to develop models that do have error components that satisfy the
underlying assumptions.
|
Quantitative methods that are applied to the data
|
The normal and uniform random number data sets are also analyzed with
the following quantitative techniques, which are explained in more
detail in an earlier section:
- Summary statistics which include:
- Linear fit of the data as a function of time to assess
drift (test for fixed location)
- Bartlett test
for fixed variance
- Autocorrelation plot
and coefficient to test for randomness
- Runs test
to test for lack of randomness
- Anderson-Darling test
for a normal distribution
- Grubbs test
for outliers
- Summary report
Although the graphical methods applied to the normal and uniform
random numbers are sufficient to assess the validity of the
underlying assumptions, the quantitative techniques are
used to show the different flavor of the graphical and
quantitative approaches.
The remaining case studies intermix one or more of these
quantitative techniques into the analysis where appropriate.
|