SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Contacts SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 1 Auxiliary Chapter

BEST CP

Name:
    BEST CP
Type:
    Analysis Command
Purpose:
    Performs a best CP analysis.
Description:
    In multilinear regression, a common task is to determine the "best" set of independent variables to use in the fit. There are three basic approaches to this problem:

    1. Perform all regressions and pick the best candidate models from this list. This is called "all subsets" regression.

    2. Start with the independent variable that provides the best fit for models containing only one independent variable. Then at each step, add one more independent variable from the remaining variables that provides the most improvement in the fit. Continue until adding an additional variable results in no significant improvement in the model. This is called "forward stepwise" regression.

    3. "Backward stepwise" regression is similar to forward stepwise regression. However, instead of starting with one independent variable and adding one at each stage, the initial model contains all the independent variable. Then one variable is deleted at each stage until the best model is reached.

    There can be some variations in the above approaches. For example,

    1. Different critierion have been proposed for deciding which variable to add/delete at a given step in forward/backward stepwise regression.

    2. Different criterion have been proposed for deciding when to quit adding/deleting variables in forward/backward stepwise regression.

    3. Some algorithms allow an entered variable to be removed at a later step.

    The choice of these critierion is complicated by the fact that adding additional variables will always increase the R2 of the fit (or at least not decrease it). However, including too many variables increases multicolinearity which results in numerically unstable models (i.e., you are essentially fitting noise). In addition, the model becomes more complex than it needs to be. A number of critierion have been proposed that attempt to balance maximizing the fit while trying to protect against overfitting.

    All subsets regression is the preferred algorithm in that it examines all models. However, it can be computationally impractical to perform all subsets regression when the number of independent variables becomes large. The primary disadvantage of forward/backward stepwise regression is that it may miss good candidate models. Also, they pick a single model rather than a list of good candidate models that can be examined closer.

    Dataplot addresses this issue with the BEST CP command. This is based on the following:

    1. It implements the "leaps and bounds" algorithm of Furnival and Wilson (see reference below). This algorithm is a compromise between all subsets regression and forward/backward stepwise regression. It provides an efficient algorithm for identifying the best candidate models without actually computing all possible models. It thus provides the advantage of all subset regressions (i.e., all potential models are included) while remaining computationally practical for a large number of indpendent variables.

    2. Dataplot uses the Mallow's Cp critierion (suggested by Mallows, see References below). The Cp statistic is defined as follows:

        C(p) = RSS(p)/(sigmahat)**2 + 2*p - n

      where

        n = the number of observations p = the number of variables in the regression RSSp = the residual sum of squares using p variables sigmahat2 = an independent estimate of the error

      The residual variance from the full model is used as the estimate of sigmahat.

      If the model is satisfactory, Cp will be approximately equal to p.

    It should be emphasized that the BEST CP command is intended simply to identify good candidate models. Also, the BEST CP command uses a computationally fast algorithm that is not as accurate as the algorithm used by the FIT command. The FIT command should be applied to identified models that are of interest. Also, standard regression diagnostics should be examined to the candidate models of interest.

Syntax:
    BEST CP <y> <x1> ... <xk>               <SUBSET/EXCEPT/FOR qualification>
    where <y> is the response (dependent) variable;
                  <x1> .... <xk> is a list of one or more independent variables;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Examples:
    BEST CP Y X1 X2 X3 X4 X5 X6 X7
    BEST CP Y X1 X2 X3 X4 X5 X6 X7 SUBSET TAG > 1
Note:
    The BEST CP commands requires at least three indpendent variables and at most 38 independent variables.
Note:
    By default, the BEST CP command returns the 10 best candidate models (7 if there only three independent variables). Note that more than 10 may actually be returned. This is due to the fact that additional models may be "free" in the leaps and bounds computations.

    To change the number of candidate models chosen, enter the command

      SET NUMBER OF CP <value>

    where <value> identifies the number of candidate models.

    Note that increasing <value> will result in greater time to generate the best candidate models. In most cases, the default value of 10 is adequate.

Note:
    It can be helpful to plot the results of the BEST CP command.

    Dataplot writes the results of the CP analysis to file. The example program below shows how to generate a CP plot using these files. Specifically,

    1. The Cp statistic and the corresponding model is written to "dpst1f.dat". The following is a sample dpst1f.dat file:
        1 138.731 : X4
        1 142.487 : X2
        1 202.549 : X1
        1 315.155 : X3
        2   2.678 : X1       X2
        2   5.496 : X1       X4
        2  22.373 : X3       X4
        2  62.438 : X2       X3
        2 138.226 : X2       X4
        3   3.018 : X1       X2       X4
        3   3.041 : X1       X2       X3
        3   3.497 : X1       X3       X4
        3   7.338 : X2       X3       X4
        4   5.000 : X1       X2       X3       X4
                   
    2. A coded form of the model is written to the file "dpst2f.dat". This coded form is useful as an identifying plot character in a CP plot. The rows of dpst1f.dat correspond to the rows of dpst2f.dat. The following is a sample dpst2f.dat file:
        4
        2
        1
        3
        12
        14
        34
        23
        24
        124
        123
        134
        234
        1234
                   
Note:
    Dataplot uses code from OMNITAB to implement the BEST CP comamnd.
Note:
    The Mallows CP statistic can be affected by outliers. This is discussed by Ryan (see References below). Currently, Dataplot makes no provisions for outliers in the BEST CP command.
Default:
    10 candidate models are extracted
Synonyms:
    None
Related Commands:
    FIT = Perform linear/nonlinear fitting.
    PARTIAL RESIDUAL PLOT = Generate a partial residual plot.
    PARTIAL REGRESSION PLOT = Generate a partial regression plot.
    PARTIAL LEVERAGE PLOT = Generate a partial leverage plot.
    PARTIAL CCPR PLOT = Generate a CCPR plot.
    PLOT = Generates a data/function plot.
Reference:
    "Regression by Leaps and Bounds", Furnival and Wilson, Technometrics, Vol. 16, No. 4, November, 1974.

    "Choosing a Subset Regression", C. L. Mallows, Joint Statistical Meetings, Los Angeles, CA 1966.

    "OMNITAB 80: An Interpretive System for Statistical and Numerical Data Analysis", Sally Peavy, Shirley Bremer, Ruth Varner, and David Hogben, NIST Special Publication 701.

    "Modern Regresion Methods", Thomas Ryan, John Wiley, 1997.

Applications:
    Multilinear Fitting
Implementation Date:
    2002/7
Program:
     
    skip 25
    read hald647.dat y x1 x2 x3 x4
    .
    echo on
    capture junk.dat
    best cp y x1 x2 x3 x4
    end of capture
    .
    skip 0
    read dpst1f.dat p cp
    read row labels dpst2f.dat
    title case asis
    label case asis
    character rowlabels
    line blank
    tic offset units data
    xtic offset 0.3 0.3
    ytic offset 10 0
    let maxp = maximum p
    major xtic mark number maxp
    xlimits 1 maxp
    title Best CP Plot (HALD647.DAT Example)
    x1label P
    y1label C(p)
    plot cp p
    line solid
    draw data 1 1 maxp maxp
        
    The following output is generated for the BEST CP command.
         REGRESSION WITH 1 VARIABLE
               C(P) STATISTIC  VARIABLES
                    138.731     4
                    142.486     2
                    202.549     1
                    315.154     3
      
         REGRESSIONS WITH  2 VARIABLES
      
                        C(P) =   2.678
         VARIABLE         COEFFICIENT       F RATIO
         X1               0.1468306E+01     146.523
         X2               0.6622506E+00     208.582
      
                        C(P) =   5.496
         VARIABLE         COEFFICIENT       F RATIO
         X1               0.1439958E+01     108.224
         X4              -0.6139536E+00     159.295
      
                        C(P) =  22.373
         VARIABLE         COEFFICIENT       F RATIO
         X3              -0.1199851E+01      40.294
         X4              -0.7246001E+00     100.357
      
                        C(P) =  62.438
         VARIABLE         COEFFICIENT       F RATIO
         X2               0.7313297E+00      36.683
         X3              -0.1008386E+01      11.816
      
                        C(P) = 138.226
         VARIABLE         COEFFICIENT       F RATIO
         X2               0.3109062E+00       0.172
         X4              -0.4569406E+00       0.431
               C(P) STATISTIC  VARIABLES
                    198.094     1  3
      
         REGRESSIONS WITH  3 VARIABLES
      
                        C(P) =   3.018
         VARIABLE         COEFFICIENT       F RATIO
         X1               0.1451938E+01     154.008
         X2               0.4161106E+00       5.026
         X4              -0.2365394E+00       1.863
      
                        C(P) =   3.041
         VARIABLE         COEFFICIENT       F RATIO
         X1               0.1695890E+01      68.717
         X2               0.6569149E+00     220.547
         X3               0.2500176E+00       1.832
      
                        C(P) =   3.497
         VARIABLE         COEFFICIENT       F RATIO
         X1               0.1051854E+01      22.113
         X3              -0.4100432E+00       4.236
         X4              -0.6427962E+00     208.240
      
                        C(P) =   7.338
         VARIABLE         COEFFICIENT       F RATIO
         X2              -0.9234153E+00      12.427
         X3              -0.1447971E+01      96.939
         X4              -0.1557044E+01      41.653
      
         REGRESSIONS WITH  4 VARIABLES
      
                        C(P) =   5.000
         VARIABLE         COEFFICIENT       F RATIO
         X1               0.1551119E+01       4.337
         X2               0.5101846E+00       0.497
         X3               0.1019266E+00       0.018
         X4              -0.1440443E+00       0.041
              14 REGRESSIONS          56 OPERATIONS
      
           NUMBER OF VARIABLES, CP VALUE, VARIABLE LIST WRITTEN
                  TO FILE DPST1F.DAT
           CODED VARIABLE LIST WRITTEN TO TO FILE DPST2F.DAT
        
    The output can be displayed in graphical form. plot generated by sample program

Date created: 8/12/2002
Last updated: 4/4/2003
Please email comments on this WWW page to alan.heckert@nist.gov.