Exploratory Data Analysis
1.3. EDA Techniques
1.3.3. Graphical Techniques: Alphabetic
Find transformation to normalize data
Many statistical tests and intervals are based on the assumption of
normality. The assumption of normality often leads to tests that are
simple, mathematically tractable, and powerful compared to tests that
do not make the normality assumption. Unfortunately, many real data
sets are in fact not approximately normal. However, an appropriate
transformation of a data set can often yield a data set that does
follow approximately a normal distribution. This increases the
applicability and usefulness of statistical techniques based on the
The Box-Cox transformation is a particulary useful family of transformations. It is defined as:
Given a particular transformation such as the Box-Cox transformation defined above, it is helpful to define a measure of the normality of the resulting transformation. One measure is to compute the correlation coefficient of a normal probability plot. The correlation is computed between the vertical and horizontal axis variables of the probability plot and is a convenient measure of the linearity of the probability plot (the more linear the probability plot, the better a normal distribution fits the data).
The Box-Cox normality plot is a plot of these correlation coefficients for various values of the parameter. The value of corresponding to the maximum correlation on the plot is then the optimal choice for .
The histogram in the upper left-hand corner shows a data set that has significant right skewness (and so does not follow a normal distribution). The Box-Cox normality plot shows that the maximum value of the correlation coefficient is at = -0.3. The histogram of the data after applying the Box-Cox transformation with = -0.3 shows a data set for which the normality assumption is reasonable. This is verified with a normal probability plot of the transformed data.
Box-Cox normality plots are formed by:
The Box-Cox normality plot can provide answers to the following
Normalization Improves Validity of Tests
|Normality assumptions are critical for many univariate intervals and hypothesis tests. It is important to test the normality assumption. If the data are in fact clearly not normal, the Box-Cox normality plot can often be used to find a transformation that will approximately normalize the data.|
Normal Probability Plot
Box-Cox Linearity Plot
|Software||Box-Cox normality plots are not a standard part of most general purpose statistical software programs. However, the underlying technique is based on a normal probability plot and computing a correlation coefficient. So if a statistical program supports these capabilities, writing a macro for a Box-Cox normality plot should be feasible. Dataplot supports a Box-Cox normality plot directly.|