SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Staff SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 1 Vol 2


    Analysis Command
    Performs a best CP analysis.
    In multilinear regression, a common task is to determine the "best" set of independent variables to use in the fit. There are three basic approaches to this problem:

    1. Perform all regressions and pick the best candidate models from this list. This is called "all subsets" regression.

    2. Start with the independent variable that provides the best fit for models containing only one independent variable. Then at each step, add one more independent variable from the remaining variables that provides the most improvement in the fit. Continue until adding an additional variable results in no significant improvement in the model. This is called "forward stepwise" regression.

    3. "Backward stepwise" regression is similar to forward stepwise regression. However, instead of starting with one independent variable and adding one at each stage, the initial model contains all the independent variable. Then one variable is deleted at each stage until the best model is reached.

    There can be some variations in the above approaches. For example,

    1. Different critierion have been proposed for deciding which variable to add/delete at a given step in forward/backward stepwise regression.

    2. Different criterion have been proposed for deciding when to quit adding/deleting variables in forward/backward stepwise regression.

    3. Some algorithms allow an entered variable to be removed at a later step.

    The choice of these critierion is complicated by the fact that adding additional variables will always increase the R2 of the fit (or at least not decrease it). However, including too many variables increases multicolinearity which results in numerically unstable models (i.e., you are essentially fitting noise). In addition, the model becomes more complex than it needs to be. A number of critierion have been proposed that attempt to balance maximizing the fit while trying to protect against overfitting.

    All subsets regression is the preferred algorithm in that it examines all models. However, it can be computationally impractical to perform all subsets regression when the number of independent variables becomes large. The primary disadvantage of forward/backward stepwise regression is that it may miss good candidate models. Also, they pick a single model rather than a list of good candidate models that can be examined closer.

    Dataplot addresses this issue with the BEST CP command. This is based on the following:

    1. It implements the "leaps and bounds" algorithm of Furnival and Wilson (see reference below). This algorithm is a compromise between all subsets regression and forward/backward stepwise regression. It provides an efficient algorithm for identifying the best candidate models without actually computing all possible models. It thus provides the advantage of all subset regressions (i.e., all potential models are included) while remaining computationally practical for a large number of indpendent variables.

    2. Dataplot uses the Mallow's Cp critierion (suggested by Mallows, see References below). The Cp statistic is defined as follows:

        C(p) = RSS(p)/(sigmahat)**2 + 2*p - n


        n = the number of observations
        p = the number of coefficents in the regression
        RSSp = the residual sum of squares for the reduced model
        sigmahat2 = an independent estimate of the error (although this value is typically unknown, it is estimated using the residual variance from the full model)

      If the model is satisfactory, Cp will be approximately equal to p.

    It should be emphasized that the BEST CP command is intended simply to identify good candidate models. Also, the BEST CP command uses a computationally fast algorithm that is not as accurate as the algorithm used by the FIT command. The FIT command should be applied to identified models that are of interest. Also, standard regression diagnostics should be examined to the candidate models of interest.

    BEST CP <y> <x1> ... <xk>               <SUBSET/EXCEPT/FOR qualification>
    where <y> is the response (dependent) variable;
                  <x1> .... <xk> is a list of one or more independent variables;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.
    BEST CP Y X1 X2 X3 X4 X5 X6 X7
    BEST CP Y X1 X2 X3 X4 X5 X6 X7 SUBSET TAG > 1
    The BEST CP commands requires at least three indpendent variables and at most 38 independent variables.
    By default, the BEST CP command returns the 10 best candidate models (7 if there only three independent variables). Note that more than 10 may actually be returned. This is due to the fact that additional models may be "free" in the leaps and bounds computations.

    To change the number of candidate models chosen, enter the command

      SET NUMBER OF CP <value>

    where <value> identifies the number of candidate models.

    Note that increasing <value> will result in greater time to generate the best candidate models. In most cases, the default value of 10 is adequate.

    It can be helpful to plot the results of the BEST CP command.

    Dataplot writes the results of the CP analysis to file. The example program below shows how to generate a CP plot using these files. Specifically,

    1. The Cp statistic and the corresponding model is written to "dpst1f.dat". The following is a sample dpst1f.dat file:
        1 138.731 : X4
        1 142.487 : X2
        1 202.549 : X1
        1 315.155 : X3
        2   2.678 : X1       X2
        2   5.496 : X1       X4
        2  22.373 : X3       X4
        2  62.438 : X2       X3
        2 138.226 : X2       X4
        3   3.018 : X1       X2       X4
        3   3.041 : X1       X2       X3
        3   3.497 : X1       X3       X4
        3   7.338 : X2       X3       X4
        4   5.000 : X1       X2       X3       X4
    2. A coded form of the model is written to the file "dpst2f.dat". This coded form is useful as an identifying plot character in a CP plot. The rows of dpst1f.dat correspond to the rows of dpst2f.dat. The following is a sample dpst2f.dat file:
    Dataplot uses code from OMNITAB to implement the BEST CP comamnd. The OMNITAB algorithm is based on the Furnival and Wilson leaps and bounds algorithm.
    The Mallows CP statistic can be affected by outliers. This is discussed by Ryan (see References below). Currently, Dataplot makes no provisions for outliers in the BEST CP command.
    10 candidate models are extracted
Related Commands: Reference:
    Furnival and Wilson (1974), "Regression by Leaps and Bounds," Technometrics, Vol. 16, No. 4.

    C. L. Mallows (1966), "Choosing a Subset Regression," Joint Statistical Meetings, Los Angeles, CA.

    Sally Peavy, Shirley Bremer, Ruth Varner, and David Hogben (1986), "OMNITAB 80: An Interpretive System for Statistical and Numerical Data Analysis," NIST Special Publication 701.

    Thomas Ryan (1997), "Modern Regresion Methods," John Wiley, pp. 223-228.

    Multilinear Fitting
Implementation Date:
    skip 25
    read hald647.dat y x1 x2 x3 x4
    echo on
    capture junk.dat
    best cp y x1 x2 x3 x4
    end of capture
    skip 0
    read dpst1f.dat p cp
    read row labels dpst2f.dat
    title case asis
    label case asis
    character rowlabels
    line blank
    tic offset units data
    xtic offset 0.3 0.3
    ytic offset 10 0
    let maxp = maximum p
    major xtic mark number maxp
    xlimits 1 maxp
    title Best CP Plot (HALD647.DAT Example)
    x1label P
    y1label C(p)
    plot cp p
    line solid
    draw data 1 1 maxp maxp
    The following output is generated for the BEST CP command.
                Regression with One Variable
     C(p) Statistic      Variables
       138.7308                  4
       142.4864                  2
       202.5488                  1
       315.1543                  3
                Regressions with   2 Variables
    C(p) =        2.678
           Variable    Coefficient        F Ratio
           X1         1.468306            146.522
           X2        0.6622505            208.581
    C(p) =        5.496
           Variable    Coefficient        F Ratio
           X1         1.439958            108.223
           X4       -0.6139536            159.295
    C(p) =       22.373
           Variable    Coefficient        F Ratio
           X3        -1.199851             40.294
           X4       -0.7246001            100.357
    C(p) =       62.438
           Variable    Coefficient        F Ratio
           X2        0.7313296             36.682
           X3        -1.008386             11.816
    C(p) =      138.226
           Variable    Coefficient        F Ratio
           X2        0.3109047              0.172
           X4       -0.4569419              0.431
    C(p) =      138.226
     C(p) Statistic      Variables
       198.0947               1  3
                Regressions with   3 Variables
    C(p) =        3.018
           Variable    Coefficient        F Ratio
           X1         1.451938            154.007
           X2        0.4161098              5.025
           X4       -0.2365402              1.863
    C(p) =        3.041
           Variable    Coefficient        F Ratio
           X1         1.695890             68.716
           X2        0.6569149            220.547
           X3        0.2500176              1.832
    C(p) =        3.497
           Variable    Coefficient        F Ratio
           X1         1.051854             22.112
           X3       -0.4100433              4.235
           X4       -0.6427961            208.240
    C(p) =        7.337
           Variable    Coefficient        F Ratio
           X2       -0.9234160             12.427
           X3        -1.447971             96.940
           X4        -1.557045             41.653
                Regressions with   4 Variables
    C(p) =        5.000
           Variable    Coefficient        F Ratio
           X1         1.551103              4.337
           X2        0.5101676              0.496
           X3        0.1019094              0.018
           X4       -0.1440610              0.041
    The output can be displayed in graphical form. plot generated by sample program

Privacy Policy/Security Notice
Disclaimer | FOIA

NIST is an agency of the U.S. Commerce Department.

Date created: 08/12/2002
Last updated: 12/15/2013

Please email comments on this WWW page to