SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Contacts SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 1 Auxiliary Chapter

PEARSON CONTINGENCY COEFICIENT

Name:
    PEARSON CONTINGENCY COEFICIENT (LET)
Type:
    Let Subcommand
Purpose:
    Compute Pearson's contingency coefficient for an RxC contingency table.
Description:
    If we have N observations with two variables where each observation can be classified into one of R mutually exclusive categories for variable one and one of C mutually exclusive categories for variable two, then a cross-tabulation of the data results in a two-way contingency table (also referred to as an RxC contingency table). The resulting contingency table has R rows and C columns.

    A common question with regards to a two-way contingency table is whether we have independence. By independence, we mean that the row and column variables are unassociated (i.e., knowing the value of the row variable will not help us predict the value of column variable and likewise knowing the value of the column variable will not help us predict the value of the row variable).

    A more technical definition for independence is that

      P(row i, column j) = P(row i)*P(column j)       for all i,j

    The standard test statistic for determing independence is the chi-square test statistic:

      T = SUM[i=1 to r][j=1 to c][(O(ij) - E(ij)**2/E(ij)]

    One criticism of this statistic is that it does not give a meaningful description of the degree of dependence (or strength of association). That is, it is useful for determining whether there is dependence. However, since the strength of that association also depends on the degrees of freedom as well as the value of the test statistic, it is not easy to interpert the strength of association.

    The Pearson's contingency coefficient is one method to provide an easier to interpret measure of strength of association. Specifically, it is:

      Pearson's Coefficient = SQRT(T/(N+T))

    where

      T = the chi-square test statistic given above
      N = the total sample size

    So this statistic basically scales the chi-square statistic to a value between 0 (no association) and 1 (maximum association). It has the desirable property of scale invariance. That is, if the sample size increases, the value of Pearson's contingency coefficient does not change as long as values in the table change the same relative to each other.

    The data for the contingency table can be specified in either of the following two ways:

    1. raw data

      In this case, you will have two variables. The first will contain r distinct values and the second will contain c distinct values. Dataplot will automatically perform the cross-tabulation to obtain the counts for each cell. Although the distinct values will typically be integers, this is not strictly required.

    2. table data

      If you only have the resulting contingency table (i.e., the counts for each cell), then you can use the READ MATRIX (or CREATE MATRIX) command to create a matrix with the data. This is demonstrated in the example program below.

      In this case, your data should contain non-negative integers since they represent the counts for each cell.

Syntax 1:
    LET <par> = PEARSON CONTINGENCY COEFICIENT <y1> <y2>
                            <SUBSET/EXCEPT/FOR qualification>
    where <y1> is the first response variable;
                <y2> is the second response variable;
                <par> is a parameter where the computed Pearson contingency coefficient is stored;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    Use this syntax for raw data.

Syntax 2:
    LET <p> = MATRIX GRAND PEARSON CONTINGENCY COEFICIENT
                            <y1> <y2>             <SUBSET/EXCEPT/FOR qualification>
    where <m> is a matrix containing the contingency table;
                <p> is a parameter where the computed Pearson contingency coefficient is stored;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    Use this syntax if your data is a contingency table.

Examples:
    LET A = PEARSON CONTINGENCY COEFICIENT Y1 Y2
    LET A = MATRIX GRAND PEARSON CONTINGENCY COEFICIENT M
Note:
    The Cramer contingency coefficient is more commonly used than the Pearson contingency coefficient.
Note:
    The following additional commands are supported

      TABULATE PEARSON CONTINGENCY COEFICIENT Y1 Y2 X
      CROSS TABULATE PEARSON CONTINGENCY COEFICIENT ...
                  Y1 Y2 X1 X2

      PEARSON CONTINGENCY COEFICIENT PLOT Y1 Y2 X
      CROSS TABULATE PEARSON CONTINGENCY COEFICIENT PLOT ...
                  Y1 Y2 X1 X2

      BOOTSTRAP PEARSON CONTINGENCY COEFICIENT PLOT Y1 Y2
      JACKNIFE PEARSON CONTINGENCY COEFICIENT PLOT Y1 Y2

    The above commands expect the variables to have the same number of observations.

    Note that the above commands are only available if you have raw data.

Default:
    None
Synonyms:
    None
Related Commands: Reference:
    Conover (1999), "Practical Nonparametric Statistics", Third Edition, Wiley, pp. 229-230.

    Friendly (2000), "Visualizing Categorical Data", SAS Institute Inc., p. 61.

Applications:
    Categorical Data Analysis
Implementation Date:
    2007/5
Program:
     
    . Sample data from page 61 of Friendly
    read matrix m
     5  29 14 16
    15  54 14 10
    20  84 17 94
    68 119 26 7
    end of data
    .
    let a = matrix pearson contingency coefficient m
        
    The resulting Pearson's contingency coefficient is 0.435.

Date created: 7/24/2007
Last updated: 7/24/2007
Please email comments on this WWW page to alan.heckert@nist.gov.