Next Page Previous Page Home Tools & Aids Search Handbook
1. Exploratory Data Analysis
1.3. EDA Techniques
1.3.3. Graphical Techniques: Alphabetic Histogram Histogram Interpretation: Symmetric with Outlier

  A symmetric distribution is one in which the 2 "halves" of the histogram appear as approximate mirror-images of one another. The above example is symmetric with the exception of outlying data near Y = 9.45.

An outlier is a data point which comes from a distribution different (in location, scale, or distributional form) from the bulk of the data. In the real world, outliers have a range of causes, from as simple as

1. operator blunders
2. equipment failures
3. day-to-day effects
4. batch-to-batch differences
5. anomalous input conditions
6. warm-up effects

to more subtle causes such as

7. a change in settings of factors
     which (knowingly or unknowingly)
     affect the response.
8. nature trying to tell use something
     (new science!)

All outliers should be taken seriously and should be investigated thoroughly for explanations. Automatic outlier-rejection schemes (such as throw out all data beyond 4 sample standard deviations from the sample mean) are particularly dangerous. The classic case of automatic outlier-rejection becoming automatic information-rejection was the South Pole ozone depletion problem. Ozone depletion over the South Pole would have detected years earlier except for the fact that the satellite data recording the low ozone readings had outlier-rejection code which automatically screen out the "outliers" (that is, the low ozone readings) before the analysis was conducted. Such inadvertent (and incorrect) purging went on for years. It was not until ground-based South Pole readings started detecting low ozone readings that someone decided to double-check as to why the satellite had not picked up this fact--it had, but it had gotten thrown out!

The best attitude is that outliers are our "friends", outliers are trying to tell us something, and we should not stop until we are comfortable in the explanation for each outlier.

Recommended Next Steps
1. Graphically check for outliers (in
     the commonly-encountered normal case)
     by generating a normal probability plot--
     if the plot is linear except for point(s)
     at the end, then that would suggest
     such points are outliers.

2. Quantitatively check for outliers (in
     the commonly-encountered normal case)
     by carrying out Grubbs test which
     indicates how many sample standard
     deviations away from the sample
     mean are the data in question--
     large values indicates outliers.

Home Tools & Aids Search Handbook Previous Page Next Page