 1. Exploratory Data Analysis
1.3. EDA Techniques
1.3.3. Graphical Techniques: Alphabetic
1.3.3.26. Scatter Plot

## Scatter Plot: Variation of Y Does Depend on X (heteroscedastic)

Scatter Plot Showing Heteroscedastic Variability Discussion This scatter plot of the Alaska pipeline data reveals an approximate linear relationship between X and Y, but more importantly, it reveals a statistical condition referred to as heteroscedasticity (that is, nonconstant variation in Y over the values of X). For a heteroscedastic data set, the variation in Y differs depending on the value of X. In this example, small values of X yield small scatter in Y while large values of X result in large scatter in Y.

Heteroscedasticity complicates the analysis somewhat, but its effects can be overcome by:

1. proper weighting of the data with noisier data being weighted less, or by

2. performing a Y variable transformation to achieve homoscedasticity. The Box-Cox normality plot can help determine a suitable transformation.
Impact of Ignoring Unequal Variability in the Data Fortunately, unweighted regression analyses on heteroscedastic data produce estimates of the coefficients that are unbiased. However, the coefficients will not be as precise as they would be with proper weighting.

Note further that if heteroscedasticity does exist, it is frequently useful to plot and model the local variation $$\mbox{var}(Y_i | X_i)$$ as a function of X, as in $$\mbox{var}(Y_i | X_i) = g(X_i)$$ This modeling has two advantages:

1. it provides additional insight and understanding as to how the response Y relates to X; and

2. it provides a convenient means of forming weights for a weighted regression by simply using

$w_i = W(Y_i | X_i) = \frac{1} {\mbox{Var}(Y_i | X_i)} = \frac{1} {g(X_i)}$
The topic of non-constant variation is discussed in some detail in the process modeling chapter. 