Dataplot Vol 1 Vol 2

# CHI-SQUARE INDEPENDENCE TEST

Name:
CHI-SQUARE INDEPENDENCE TEST (LET)
Type:
Analysis Command
Purpose:
Perform a chi-square test of independence for a two-way contingency table.
Description:
If we have N observations with two variables where each observation can be classified into one of R mutually exclusive categories for variable one and one of C mutually exclusive categories for variable two, then a cross-tabulation of the data results in a two-way contingency table (also referred to as an RxC contingency table). The resulting contingency table has R rows and C columns.

A common question with regards to a two-way contingency table is whether we have independence. By independence, we mean that the row and column variables are unassociated (i.e., knowing the value of the row variable will not help us predict the value of column variable and likewise knowing the value of the column variable will not help us predict the value of the row variable).

A more technical definition for independence is that

P(row i, column j) = P(row i)*P(column j)       for all i,j

One such test is the chi-square test for independence.

H0: The two-way table is independent
Ha: The two-way table is not independent
Test Statistic:
$$T = \sum_{i=1}^{r}{\sum_{j=1}^{c}{\frac{O_{ij} - E_{ij}} {E_{ij}}}}$$

where

 r = the number of rows in the contingency table c = the number of columns in the contingency table Oij = the observed frequency of the ith row and jth column Eij = the expected frequency of the ith row and jth column = $$\frac{R_i C_j}{N}$$ Ri = the sum of the observed frequencies for row i Cj = the sum of the observed frequencies for column j N = the total sample size

Significance Level: $$\alpha$$
Critical Region: T > CHSPPF($$\alpha$$,(r-1)*(c-1))

where CHSPPF is the percent point function of the chi-square distribution and (r-1)*(c-1) is the degrees of freedom

Conclusion: Reject the independence hypothesis if the value of the test statistic is greater than the chi-square value.

This test statistic can also be formulated as

$$\sum_{i=1}^{r}{\sum_{j=1}^{c}{d_{ij}^2}}$$

where

$$d_{ij}^2 = \frac{O_{ij} - E_{ij}} {\sqrt{E_{ij}}}$$

The dij are referred to as the standardized residuals and they show the contribution to the chi-square test statistic of each cell.

Syntax 1:
CHI-SQUARE INDPENDENCE TEST <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

This syntax is used for the case where you have raw data (i.e., the data has not yet been cross tabulated into a two-way table).

Syntax 2:
CHI-SQUARE INDEPENDENCE TEST <m>
<SUBSET/EXCEPT/FOR qualification>
where <m> is a matrix containing the two-way table;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

This syntax is used for the case where we the data have already been cross-tabulated into a two-way contingency table.

Syntax 3:
CHI-SQUARE INDEPENDENCE TEST <n11> <n12> <n21> <n22>
where <n11> is a parameter containing the value for row 1, column 1 of a 2x2 table;
<n12> is a parameter containing the value for row 1, column 2 of a 2x2 table;
<n21> is a parameter containing the value for row 2, column 1 of a 2x2 table;
and <n22> is a parameter containing the value for row 2, column 2 of a 2x2 table.

This syntax is used for the special case where you have a 2x2 table. In this case, you can enter the 4 values directly, although you do need to be careful that the parameters are entered in the order expected above.

Examples:
CHI-SQUARE INDEPENDENCE TEST Y1 Y2
CHI-SQUARE INDEPENDENCE TEST M
CHI-SQUARE INDEPENDENCE TEST N11 N12 N21 N22
Note:
The chi-square approximation is asymptotic. This means that the critical values may not be valid if the expected frequencies are too small.

Cochran suggests that if the minimum expected frequency is less than 1 or if 20% of the expected frequencies are less than 5, the approximation may be poor. However, Conover suggests that this is probably too conservative, particularly if r and c are not too small. He suggests that the minimum expected frequency should be 0.5 and at least half the expected frequencies should be greater than 1.

In any event, if there are too many low expected frequencies, you can do one of the following:

1. If rows or columns with small expected frequencies can be intelligently combined, then this may result in expected frequencies that are sufficiently large.

2. Use Fisher's exact test.
Note:
Conover points out that there are really 3 distinct tests:

1. Only N is fixed. The row and column totals are not fixed (i.e., they are random).

2. Either the row totals or the column totals are fixed before hand.

3. Both the row totals and the column totals are fixed before hand.

Note that in all three cases, the test statistic and the chi-square approximation are the same. What differs is the exact distribution of the test statistic. When either the row or column totals (or both) are fixed, the possible number of contingency tables is reduced.

As long as the expected frequencies are sufficiently large, the chi-square approximation should be adequate for practical purposes.

Note:
Some authors recommend using a continuity correction for this test. In this case, 0.5 is added to the observed frequency in each cell. Dataplot performs this test both with the continuity correction and without the continuity correction.
Note:
The following information is written to the file dpst1f.dat (in the current directory):

Column 1 - row id
Column 2 - column id
Column 3 - row total
Column 4 - column total
Column 5 - expected frequency (Eij)
Column 6 - observed frequency (Oij)

To read this information into Dataplot, enter

SKIP 1
READ DPST1F.DAT ROWID COLID ROWTOT COLTOT ...
EXPFREQ OBSFREQ
Note:
The ASSOCIATION PLOT command can be used to plot the standardized residuals of the chi-square analysis.

The ODDS RATIO INDEPDNENCE TEST is an alternative test for independence based on the LOG(odds ratio).

Default:
None
Synonyms:
None
Related Commands:
 ODDS RATIO INDEPENDENCE TEST = Perform a log(odds ratio) test for independence. FISHER EXACT TEST = Perform Fisher's exact test. ASSOCIATION PLOT = Generate an association plot. SIEVE PLOT = Generate a sieve plot. ROSE PLOT = Generate a Rose plot. BINARY TABULATION PLOT = Generate a binary tabulation plot. ROC CURVE = Generate a ROC curve. ODDS RATIO = Compute the bias corrected odds ratio. LOG ODDS RATIO = Compute the bias corrected log(odds ratio).
Reference:
Conover (1999), "Practical Nonparametric Statistics", Third Edition, Wiley, pp. 204-216.

Friendly (2000), "Visualizing Categorical Data", SAS Institute Inc., p. 90.

Cochran (1952), "The Chi-Square Test of Goodness of Fit", Annals of Mathematical Statistics, 23, pp. 315-345.

Applications:
Categorical Data Analysis
Implementation Date:
2007/3
Program:

. Example from page 61 of Friendly
5  29 14 16
15  54 14 10
20  84 17 94
68 119 26 7
end of data
.
chi-square independence test m

The following output is generated:
           CHI-SQUARE TEST FOR INDEPENDENCE (RXC TABLE)

NULL HYPOTHESIS: THE TWO VARIABLES ARE INDEPENDENT
ALTERNATIVE HYPOTHESIS: THE TWO VARIABLES ARE NOT INDEPENDENT

SAMPLE 1:
NUMBER OF OBSERVATIONS                    =      592
NUMBER OF LEVELS (ROWS)                   =        4

SAMPLE 2:
NUMBER OF OBSERVATIONS                    =      592
NUMBER OF LEVELS (COLUMNS)                =        4

WITHOUT YATES CONTINUITY CORRECTION:
CHI-SQUARE TEST STATISTIC                =    138.2898
DEGREES OF FREEDOM                       =        9
CDF VALUE OF TEST STATISTIC              =    1.000000

WITH YATES CONTINUITY CORRECTION:
CHI-SQUARE TEST STATISTIC                =    132.0374
DEGREES OF FREEDOM                       =        9
CDF VALUE OF TEST STATISTIC              =    1.000000

WITHOUT YATES CONTINUITY CORRECTION
NULL HYPOTHESIS   NULL
NULL          CONFIDENCE    CRITICAL  ACCEPTANCE        HYPOTHESIS
HYPOTHESIS    LEVEL         VALUE     INTERVAL          CONCLUSION
===================================================================
INDEPENDENT      50.0%        8.34     (0,0.500)        REJECT
INDEPENDENT      80.0%       12.24     (0,0.800)        REJECT
INDEPENDENT      90.0%       14.68     (0,0.900)        REJECT
INDEPENDENT      95.0%       16.92     (0,0.950)        REJECT
INDEPENDENT      97.5%       19.02     (0,0.975)        REJECT
INDEPENDENT      99.0%       21.67     (0,0.990)        REJECT

WITH YATES CONTINUITY CORRECTION
NULL HYPOTHESIS   NULL
NULL          CONFIDENCE    CRITICAL  ACCEPTANCE        HYPOTHESIS
HYPOTHESIS    LEVEL         VALUE     INTERVAL          CONCLUSION
===================================================================
INDEPENDENT      50.0%        8.34     (0,0.500)        REJECT
INDEPENDENT      80.0%       12.24     (0,0.800)        REJECT
INDEPENDENT      90.0%       14.68     (0,0.900)        REJECT
INDEPENDENT      95.0%       16.92     (0,0.950)        REJECT
INDEPENDENT      97.5%       19.02     (0,0.975)        REJECT
INDEPENDENT      99.0%       21.67     (0,0.990)        REJECT

Date created: 07/25/2007
Last updated: 12/11/2023