|
1.
Exploratory Data Analysis
1.3. EDA Techniques 1.3.5. Quantitative Techniques
|
|||||||||||
|
Purpose: Test for Distributional Adequacy |
The Kolmogorov-Smirnov test
(Chakravart, Laha,
and Roy, 1967) is used to decide if a sample comes from a
population with a specific distribution.
The Kolmogorov-Smirnov (K-S) test is based on the empirical distribution function (ECDF). Given N ordered data points Y1, Y2, ..., YN, the ECDF is defined as
The graph below is a plot of the empirical distribution function with a normal cumulative distribution function for 100 normal random numbers. The K-S test is based on the maximum distance between these two curves.
|
||||||||||
| Characteristics and Limitations of the K-S Test |
An attractive feature of this test is that
the distribution of the K-S test statistic itself does not
depend on the underlying cumulative distribution function being
tested. Another advantage is that it is an exact test (the
chi-square goodness-of-fit test depends on an adequate sample size
for the approximations to be valid). Despite these advantages,
the K-S test has several important limitations:
Due to limitations 2 and 3 above, many analysts prefer to use the Anderson-Darling goodness-of-fit test. However, the Anderson-Darling test is only available for a few specific distributions. |
||||||||||
| Definition |
The Kolmogorov-Smirnov test is defined by:
|
||||||||||
| Technical Note |
Previous editions of e-Handbook gave the following formula
for the computation of the Kolmogorov-Smirnov goodness of
fit statistic:
For example, for N = 20, the upper bound on the difference between these two formulas is 0.05 (for comparison, the 5% critical value is 0.294). For N = 100, the upper bound is 0.001. In practice, if you have moderate to large sample sizes (say N ≥ 50), these formulas are essentially equivalent. |
||||||||||
|
Sample Output |
Dataplot generated the following output for the
Kolmogorov-Smirnov test where 1,000 random numbers were
generated for a normal, double exponential, t with 3
degrees of freedom, and lognormal distributions. In all cases,
the Kolmogorov-Smirnov test was applied to test for a normal
distribution. The Kolmogorov-Smirnov test accepts the normality
hypothesis for the case of normal data and rejects it for the
double exponential, t, and lognormal data with the
exception of the double exponential data being significant at
the 0.01 significance level.
The normal random numbers were stored in the variable Y1,
the double exponential random numbers were stored in the
variable Y2, the t random numbers were stored in the
variable Y3, and the lognormal random numbers were stored
in the variable Y4.
|
||||||||||
| Questions |
The Kolmogorov-Smirnov test can be used to answer the following
types of questions:
|
||||||||||
| Importance |
Many statistical tests and procedures are based on specific
distributional assumptions.
The assumption of normality
is particularly common in classical statistical tests.
Much reliability modeling is based on the assumption
that the data follow a Weibull distribution.
There are many non-parametric and robust techniques that are not based on strong distributional assumptions. By non-parametric, we mean a technique, such as the sign test, that is not based on a specific distributional assumption. By robust, we mean a statistical technique that performs well under a wide range of distributional assumptions. However, techniques based on specific distributional assumptions are in general more powerful than these non-parametric and robust techniques. By power, we mean the ability to detect a difference when that difference actually exists. Therefore, if the distributional assumptions can be confirmed, the parametric techniques are generally preferred. If you are using a technique that makes a normality (or some other type of distributional) assumption, it is important to confirm that this assumption is in fact justified. If it is, the more powerful parametric techniques can be used. If the distributional assumption is not justified, using a non-parametric or robust technique may be required. |
||||||||||
| Related Techniques |
Anderson-Darling goodness-of-fit Test Chi-Square goodness-of-fit Test Shapiro-Wilk Normality Test Probability Plots Probability Plot Correlation Coefficient Plot |
||||||||||
| Case Study | Airplane glass failure times data | ||||||||||
| Software | Some general purpose statistical software programs, including Dataplot, support the Kolmogorov-Smirnov goodness-of-fit test, at least for some of the more common distributions. | ||||||||||