|
1.
Exploratory Data Analysis
1.3. EDA Techniques 1.3.5. Quantitative Techniques
|
|||||||||||
|
Purpose: Test for distributional adequacy |
The chi-square test
(Snedecor and Cochran,
1989) is used to test if a sample
of data came from a population with a specific distribution.
An attractive feature of the chi-square goodness-of-fit test is that it can be applied to any univariate distribution for which you can calculate the cumulative distribution function. The chi-square goodness-of-fit test is applied to binned data (i.e., data put into classes). This is actually not a restriction since for non-binned data you can simply calculate a histogram or frequency table before generating the chi-square test. However, the value of the chi-square test statistic are dependent on how the data is binned. Another disadvantage of the chi-square test is that it requires a sufficient sample size in order for the chi-square approximation to be valid. The chi-square test is an alternative to the Anderson-Darling and Kolmogorov-Smirnov goodness-of-fit tests. The chi-square goodness-of-fit test can be applied to discrete distributions such as the binomial and the Poisson. The Kolmogorov-Smirnov and Anderson-Darling tests are restricted to continuous distributions. Additional discussion of the chi-square goodness-of-fit test is contained in the product and process comparisons chapter (chapter 7). |
||||||||||
| Definition |
The chi-square test is defined for the hypothesis:
|
||||||||||
|
Sample Output |
Dataplot generated the following output for the chi-square
test where 1,000 random numbers were generated for the normal,
double exponential, t with 3 degrees
of freedom, and lognormal distributions. In all cases,
the chi-square test was applied to test for a normal
distribution. The test statistics show the characteristics
of the test; when the data are from a normal distribution,
the test statistic is small and the hypothesis is accepted;
when the data are from the double exponential, t, and
lognormal distributions, the statistics are significant and
the hypothesis of an underlying normal distribution is
rejected at significance levels of 0.10, 0.05, and 0.01.
The normal random numbers were stored in the variable Y1, the double exponential random numbers were stored in the variable Y2, the t random numbers were stored in the variable Y3, and the lognormal random numbers were stored in the variable Y4.
*************************************************
** normal chi-square goodness of fit test y1 **
*************************************************
CHI-SQUARED GOODNESS-OF-FIT TEST
NULL HYPOTHESIS H0: DISTRIBUTION FITS THE DATA
ALTERNATE HYPOTHESIS HA: DISTRIBUTION DOES NOT FIT THE DATA
DISTRIBUTION: NORMAL
SAMPLE:
NUMBER OF OBSERVATIONS = 1000
NUMBER OF NON-EMPTY CELLS = 24
NUMBER OF PARAMETERS USED = 0
TEST:
CHI-SQUARED TEST STATISTIC = 17.52155
DEGREES OF FREEDOM = 23
CHI-SQUARED CDF VALUE = 0.217101
ALPHA LEVEL CUTOFF CONCLUSION
10% 32.00690 ACCEPT H0
5% 35.17246 ACCEPT H0
1% 41.63840 ACCEPT H0
CELL NUMBER, BIN MIDPOINT, OBSERVED FREQUENCY,
AND EXPECTED FREQUENCY
WRITTEN TO FILE DPST1F.DAT
*************************************************
** normal chi-square goodness of fit test y2 **
*************************************************
CHI-SQUARED GOODNESS-OF-FIT TEST
NULL HYPOTHESIS H0: DISTRIBUTION FITS THE DATA
ALTERNATE HYPOTHESIS HA: DISTRIBUTION DOES NOT FIT THE DATA
DISTRIBUTION: NORMAL
SAMPLE:
NUMBER OF OBSERVATIONS = 1000
NUMBER OF NON-EMPTY CELLS = 26
NUMBER OF PARAMETERS USED = 0
TEST:
CHI-SQUARED TEST STATISTIC = 2030.784
DEGREES OF FREEDOM = 25
CHI-SQUARED CDF VALUE = 1.000000
ALPHA LEVEL CUTOFF CONCLUSION
10% 34.38158 REJECT H0
5% 37.65248 REJECT H0
1% 44.31411 REJECT H0
CELL NUMBER, BIN MIDPOINT, OBSERVED FREQUENCY,
AND EXPECTED FREQUENCY
WRITTEN TO FILE DPST1F.DAT
*************************************************
** normal chi-square goodness of fit test y3 **
*************************************************
CHI-SQUARED GOODNESS-OF-FIT TEST
NULL HYPOTHESIS H0: DISTRIBUTION FITS THE DATA
ALTERNATE HYPOTHESIS HA: DISTRIBUTION DOES NOT FIT THE DATA
DISTRIBUTION: NORMAL
SAMPLE:
NUMBER OF OBSERVATIONS = 1000
NUMBER OF NON-EMPTY CELLS = 25
NUMBER OF PARAMETERS USED = 0
TEST:
CHI-SQUARED TEST STATISTIC = 103165.4
DEGREES OF FREEDOM = 24
CHI-SQUARED CDF VALUE = 1.000000
ALPHA LEVEL CUTOFF CONCLUSION
10% 33.19624 REJECT H0
5% 36.41503 REJECT H0
1% 42.97982 REJECT H0
CELL NUMBER, BIN MIDPOINT, OBSERVED FREQUENCY,
AND EXPECTED FREQUENCY
WRITTEN TO FILE DPST1F.DAT
*************************************************
** normal chi-square goodness of fit test y4 **
*************************************************
CHI-SQUARED GOODNESS-OF-FIT TEST
NULL HYPOTHESIS H0: DISTRIBUTION FITS THE DATA
ALTERNATE HYPOTHESIS HA: DISTRIBUTION DOES NOT FIT THE DATA
DISTRIBUTION: NORMAL
SAMPLE:
NUMBER OF OBSERVATIONS = 1000
NUMBER OF NON-EMPTY CELLS = 10
NUMBER OF PARAMETERS USED = 0
TEST:
CHI-SQUARED TEST STATISTIC = 1162098.
DEGREES OF FREEDOM = 9
CHI-SQUARED CDF VALUE = 1.000000
ALPHA LEVEL CUTOFF CONCLUSION
10% 14.68366 REJECT H0
5% 16.91898 REJECT H0
1% 21.66600 REJECT H0
CELL NUMBER, BIN MIDPOINT, OBSERVED FREQUENCY,
AND EXPECTED FREQUENCY
WRITTEN TO FILE DPST1F.DAT
As we would hope, the chi-square test does not reject the
normality hypothesis for the normal distribution data set and
rejects it for the three non-normal cases.
|
||||||||||
| Questions |
The chi-square test can be used to answer the following
types of questions:
|
||||||||||
| Importance |
Many statistical tests and procedures are based on specific
distributional assumptions.
The assumption of normality
is particularly common in classical statistical tests.
Much reliability modeling is based on the assumption that
the distribution of the data follows a Weibull distribution.
There are many non-parametric and robust techniques that are not based on strong distributional assumptions. By non-parametric, we mean a technique, such as the sign test, that is not based on a specific distributional assumption. By robust, we mean a statistical technique that performs well under a wide range of distributional assumptions. However, techniques based on specific distributional assumptions are in general more powerful than these non-parametric and robust techniques. By power, we mean the ability to detect a difference when that difference actually exists. Therefore, if the distributional assumption can be confirmed, the parametric techniques are generally preferred. If you are using a technique that makes a normality (or some other type of distributional) assumption, it is important to confirm that this assumption is in fact justified. If it is, the more powerful parametric techniques can be used. If the distributional assumption is not justified, a non-parametric or robust technique may be required. |
||||||||||
| Related Techniques |
Anderson-Darling Goodness-of-Fit Test Kolmogorov-Smirnov Test Shapiro-Wilk Normality Test Probability Plots Probability Plot Correlation Coefficient Plot |
||||||||||
| Case Study | Airplane glass failure times data. | ||||||||||
| Software | Some general purpose statistical software programs, including Dataplot, provide a chi-square goodness-of-fit test for at least some of the common distributions. | ||||||||||