1.
Exploratory Data Analysis
1.1.
EDA Introduction
1.1.7.

General Problem Categories


Problem Classification

The following table is a convenient way to classify EDA
problems.

Univariate and Control

UNIVARIATE
Data:
A single column of numbers, Y.
Model:
Output:
 A number (the estimated constant in the model).
 An estimate of uncertainty for the constant.
 An estimate of the distribution for the error.
Techniques:

CONTROL
Data:
A single column of numbers, Y.
Model:
Output:
A "yes" or "no" to the question "Is the
system out of control?".
Techniques:


Comparative and Screening

COMPARATIVE
Data:
A single response variable and k independent variables
(Y, X_{1}, X_{2},
... , X_{k}), primary focus is on
one (the primary factor) of these independent
variables.
Model:
y = f(x_{1}, x_{2},
..., x_{k}) + error
Output:
A "yes" or "no" to the question "Is the primary factor
significant?".
Techniques:

SCREENING
Data:
A single response variable and k independent variables
(Y, X_{1}, X_{2},
... , X_{k}).
Model:
y = f(x_{1}, x_{2},
..., x_{k}) + error
Output:
 A ranked list (from most important to least
important) of factors.
 Best settings for the factors.
 A good model/prediction equation relating Y to
the factors.
Techniques:


Optimization and Regression

OPTIMIZATION
Data:
A single response variable and k independent variables
(Y, X_{1}, X_{2},
... , X_{k}).
Model:
y = f(x_{1}, x_{2},
..., x_{k}) + error
Output:
Best settings for the factor variables.
Techniques:

REGRESSION
Data:
A single response variable and k independent variables
(Y, X_{1}, X_{2},
... , X_{k}).
The independent variables can be continuous.
Model:
y = f(x_{1}, x_{2},
..., x_{k}) + error
Output:
A good model/prediction equation relating Y to
the factors.
Techniques:


Time Series and Multivariate

TIME SERIES
Data:
A column of time dependent numbers, Y. In addition,
time is an indpendent variable. The time variable
can be either explicit or implied. If the data
are not equispaced, the time variable should be
explicitly provided.
Model:
y_{t} = f(t) + error
The model can be either a time domain based or
frequency domain based.
Output:
A good model/prediction equation relating Y to
previous values of Y.
Techniques:

MULTIVARIATE
Data:
k factor variables
(X_{1}, X_{2}, ... ,
X_{k}).
Model:
The model is not explicit.
Output:
Identify underlying correlation structure in the data.
Techniques:
Note that multivarate analysis is only covered lightly
in this Handbook.

