1.3.3.6. Box-Cox Normality Plot

1. Exploratory Data Analysis
1.3. EDA Techniques
1.3.3. Graphical Techniques: Alphabetic

1.3.3.6. Box-Cox Normality Plot

Purpose:
Find transformation to normalize data

Many statistical tests and intervals are based on the assumption of normality. The assumption of normality often leads to tests that are simple, mathematically tractable, and powerful compared to tests that do not make the normality assumption. Unfortunately, many real data sets are in fact not approximately normal. However, an appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution. This increases the applicability and usefulness of statistical techniques based on the normality assumption.

The Box-Cox transformation is a particulary useful family of transformations. It is defined as:

T (Y) = (Y^{λ} - 1) / λ

where Y is the response variable and

λ

is the transformation parameter. For

λ

= 0, the natural log of the data is taken instead of using the above formula.

Given a particular transformation such as the Box-Cox transformation defined above, it is helpful to define a measure of the normality of the resulting transformation. One measure is to compute the correlation coefficient of a normal probability plot. The correlation is computed between the vertical and horizontal axis variables of the probability plot and is a convenient measure of the linearity of the probability plot (the more linear the probability plot, the better a normal distribution fits the data).

The Box-Cox normality plot is a plot of these correlation coefficients for various values of the $λ$ parameter. The value of $λ$ corresponding to the maximum correlation on the plot is then the optimal choice for $λ$ .

Sample Plot

The histogram in the upper left-hand corner shows a data set (first column) that has significant right skewness (and so does not follow a normal distribution). The Box-Cox normality plot shows that the maximum value of the correlation coefficient is at $λ$ = -0.3. The histogram of the data after applying the Box-Cox transformation with $λ$ = -0.3 shows a data set for which the normality assumption is reasonable. This is verified with a normal probability plot of the transformed data.

Definition

Box-Cox normality plots are formed by:

Vertical axis: Correlation coefficient from the normal probability plot after applying Box-Cox transformation
Horizontal axis: Value for $λ$

Questions

The Box-Cox normality plot can provide answers to the following questions:

Is there a transformation that will normalize my data?
What is the optimal value of the transformation parameter?

Importance:
Normalization Improves Validity of Tests

Normality assumptions are critical for many univariate intervals and hypothesis tests. It is important to test the normality assumption. If the data are in fact clearly not normal, the Box-Cox normality plot can often be used to find a transformation that will approximately normalize the data.

Related Techniques

Normal Probability Plot
Box-Cox Linearity Plot

Software

Box-Cox normality plots are not a standard part of most general purpose statistical software programs. However, the underlying technique is based on a normal probability plot and computing a correlation coefficient. So if a statistical program supports these capabilities, writing a macro for a Box-Cox normality plot should be feasible.