1.4.1. Case Studies Introduction

1. Exploratory Data Analysis
1.4. EDA Case Studies

1.4.1. Case Studies Introduction

Purpose

The purpose of the first eight case studies is to show how EDA graphics and quantitative measures and tests are applied to data from scientific processes and to critique those data with regard to the following assumptions that typically underlie a measurement process; namely, that the data behave like:

random drawings
from a fixed distribution
with a fixed location
with a fixed standard deviation

Case studies 9 and 10 show the use of EDA techniques in distributional modeling and the analysis of a designed experiment, respectively.

Y_i = C + E_i

If the above assumptions are satisfied, the process is said to be statistically "in control" with the core characteristic of having "predictability". That is, probability statements can be made about the process, not only in the past, but also in the future.

An appropriate model for an "in control" process is

Y_i = C + E_i

where C is a constant (the "deterministic" or "structural" component), and where E_i is the error term (or "random" component).

The constant C is the average value of the process--it is the primary summary number which shows up on any report. Although C is (assumed) fixed, it is unknown, and so a primary analysis objective of the engineer is to arrive at an estimate of C.

This goal partitions into 4 sub-goals:

Is the most common estimator of C, \(\bar{Y}\), the best estimator for C? What does "best" mean?
If \(\bar{Y}\) is best, what is the uncertainty \(s_{\bar{Y}}\) for \(\bar{Y}\). In particular, is the usual formula for the uncertainty of \(\bar{Y}\):
valid? Here, s is the standard deviation of the data and N is the sample size.
If \(\bar{Y}\) is not the best estimator for C, what is a better estimator for C (for example, median, midrange, midmean)?
If there is a better estimator, \(\hat{C}\), what is its uncertainty? That is, what is \(s_{\hat{C}}\)?

EDA and the routine checking of underlying assumptions provides insight into all of the above.

Location and variation checks provide information as to whether C is really constant.
Distributional checks indicate whether \(\bar{Y}\) is the best estimator. Techniques for distributional checking include histograms, normal probability plots, and probability plot correlation coefficient plots.
Randomness checks ascertain whether the usual
is valid.
Distributional tests assist in determining a better estimator, if needed.
Simulator tools (namely bootstrapping) provide values for the uncertainty of alternative estimators.

Assumptions not satisfied

If one or more of the above assumptions is not satisfied, then we use EDA techniques, or some mix of EDA and classical techniques, to find a more appropriate model for the data. That is,

Y_i = D + E_i

where D is the deterministic part and E is an error component.

If the data are not random, then we may investigate fitting some simple time series models to the data. If the constant location and scale assumptions are violated, we may need to investigate the measurement process to see if there is an explanation.

The assumptions on the error term are still quite relevant in the sense that for an appropriate model the error component should follow the assumptions. The criterion for validating the model, or comparing competing models, is framed in terms of these assumptions.

Multivariable data

Although the case studies in this chapter utilize univariate data, the assumptions above are relevant for multivariable data as well.

If the data are not univariate, then we are trying to find a model

Y_i = F(X₁, ..., X_k) + E_i

where F is some function based on one or more variables. The error component, which is a univariate data set, of a good model should satisfy the assumptions given above. The criterion for validating and comparing models is based on how well the error component follows these assumptions.

The load cell calibration case study in the process modeling chapter shows an example of this in the regression context.

First three case studies utilize data with known characteristics

The first three case studies utilize data that are randomly generated from the following distributions:

normal distribution with mean 0 and standard deviation 1
uniform distribution with mean 0 and standard deviation \(\sqrt{1/12}\) (uniform over the interval (0,1))
random walk

The other univariate case studies utilize data from scientific processes. The goal is to determine if

Y_i = C + E_i

is a reasonable model. This is done by testing the underlying assumptions. If the assumptions are satisfied, then an estimate of C and an estimate of the uncertainty of C are computed. If the assumptions are not satisfied, we attempt to find a model where the error component does satisfy the underlying assumptions.

Graphical methods that are applied to the data

To test the underlying assumptions, each data set is analyzed using four graphical methods that are particularly suited for this purpose:

run sequence plot which is useful for detecting shifts of location or scale
lag plot which is useful for detecting non-randomness in the data
histogram which is useful for trying to determine the underlying distribution
normal probability plot for deciding whether the data follow the normal distribution

There are a number of other techniques for addressing the underlying assumptions. However, the four plots listed above provide an excellent opportunity for addressing all of the assumptions on a single page of graphics.

Additional graphical techniques are used in certain case studies to develop models that do have error components that satisfy the underlying assumptions.

Quantitative methods that are applied to the data

The normal and uniform random number data sets are also analyzed with the following quantitative techniques, which are explained in more detail in an earlier section:

Summary statistics which include:
- mean
- standard deviation
- autocorrelation coefficient to test for randomness
- normal and uniform probability plot correlation coefficients (ppcc) to test for a normal or uniform distribution, respectively
- Wilk-Shapiro test for a normal distribution
Linear fit of the data as a function of time to assess drift (test for fixed location)
Bartlett test for fixed variance
Autocorrelation plot and coefficient to test for randomness
Runs test to test for lack of randomness
Anderson-Darling test for a normal distribution
Grubbs test for outliers
Summary report

Although the graphical methods applied to the normal and uniform random numbers are sufficient to assess the validity of the underlying assumptions, the quantitative techniques are used to show the different flavor of the graphical and quantitative approaches.

The remaining case studies intermix one or more of these quantitative techniques into the analysis where appropriate.