1.
Exploratory Data Analysis
1.3. EDA Techniques 1.3.5. Quantitative Techniques


Purpose: Test for Distributional Adequacy 
The KolmogorovSmirnov test
(Chakravart, Laha,
and Roy, 1967) is used to decide if a sample comes from a
population with a specific distribution.
The KolmogorovSmirnov (KS) test is based on the empirical distribution function (ECDF). Given N ordered data points Y_{1}, Y_{2}, ..., Y_{N}, the ECDF is defined as
The graph below is a plot of the empirical distribution function with a normal cumulative distribution function for 100 normal random numbers. The KS test is based on the maximum distance between these two curves.


Characteristics and Limitations of the KS Test 
An attractive feature of this test is that
the distribution of the KS test statistic itself does not
depend on the underlying cumulative distribution function being
tested. Another advantage is that it is an exact test (the
chisquare goodnessoffit test depends on an adequate sample size
for the approximations to be valid). Despite these advantages,
the KS test has several important limitations:
Several goodnessoffit tests, such as the AndersonDarling test and the Cramer VonMises test, are refinements of the KS test. As these refined tests are generally considered to be more powerful than the original KS test, many analysts prefer them. Also, the advantage for the KS test of having the critical values be indpendendent of the underlying distribution is not as much of an advantage as first appears. This is due to limitation 3 above (i.e., the distribution parameters are typically not known and have to be estimated from the data). So in practice, the critical values for the KS test have to be determined by simulation just as for the AndersonDarling and Cramer VonMises (and related) tests. Note that although the KS test is typically developed in the context of continuous distributions for uncensored and ungrouped data, the test has in fact been extended to discrete distributions and to censored and grouped data. We do not discuss those cases here. 

Definition 
The KolmogorovSmirnov test is defined by:


Technical Note 
Previous editions of eHandbook gave the following formula
for the computation of the KolmogorovSmirnov goodness of
fit statistic:
For example, for N = 20, the upper bound on the difference between these two formulas is 0.05 (for comparison, the 5% critical value is 0.294). For N = 100, the upper bound is 0.001. In practice, if you have moderate to large sample sizes (say N ≥ 50), these formulas are essentially equivalent. 

KolmogorovSmirnov Test Example 
We generated 1,000 random numbers for normal, double exponential,
t with 3 degrees of freedom, and lognormal distributions.
In all cases, the KolmogorovSmirnov test was applied to test for
a normal distribution.
The normal random numbers were stored in the variable Y1, the double exponential random numbers were stored in the variable Y2, the t random numbers were stored in the variable Y3, and the lognormal random numbers were stored in the variable Y4. H_{0}: the data are normally distributed H_{a}: the data are not normally distributed Y1 test statistic: D = 0.0241492 Y2 test statistic: D = 0.0514086 Y3 test statistic: D = 0.0611935 Y4 test statistic: D = 0.5354889 Significance level: α = 0.05 Critical value: 0.04301 Critical region: Reject H_{0} if D > 0.04301As expected, the null hypothesis is not rejected for the normally distributed data, but is rejected for the remaining three data sets that are not normally distributed. 

Questions 
The KolmogorovSmirnov test can be used to answer the following
types of questions:


Importance 
Many statistical tests and procedures are based on specific
distributional assumptions.
The assumption of normality
is particularly common in classical statistical tests.
Much reliability modeling is based on the assumption
that the data follow a Weibull distribution.
There are many nonparametric and robust techniques that are not based on strong distributional assumptions. By nonparametric, we mean a technique, such as the sign test, that is not based on a specific distributional assumption. By robust, we mean a statistical technique that performs well under a wide range of distributional assumptions. However, techniques based on specific distributional assumptions are in general more powerful than these nonparametric and robust techniques. By power, we mean the ability to detect a difference when that difference actually exists. Therefore, if the distributional assumptions can be confirmed, the parametric techniques are generally preferred. If you are using a technique that makes a normality (or some other type of distributional) assumption, it is important to confirm that this assumption is in fact justified. If it is, the more powerful parametric techniques can be used. If the distributional assumption is not justified, using a nonparametric or robust technique may be required. 

Related Techniques 
AndersonDarling goodnessoffit Test ChiSquare goodnessoffit Test ShapiroWilk Normality Test Probability Plots Probability Plot Correlation Coefficient Plot 

Software  Some general purpose statistical software programs support the KolmogorovSmirnov goodnessoffit test, at least for the more common distributions. Both Dataplot code and R code can be used to generate the analyses in this section. 