Dataplot

Vol 1

Vol 2

BEST CP

Name:

BEST CP Type:

Analysis Command Purpose:

Performs a best CP analysis. Description:

Perform all regressions and pick the best candidate models from this list. This is called "all subsets" regression.
Start with the independent variable that provides the best fit for models containing only one independent variable. Then at each step, add one more independent variable from the remaining variables that provides the most improvement in the fit. Continue until adding an additional variable results in no significant improvement in the model. This is called "forward stepwise" regression.
"Backward stepwise" regression is similar to forward stepwise regression. However, instead of starting with one independent variable and adding one at each stage, the initial model contains all the independent variable. Then one variable is deleted at each stage until the best model is reached.

There can be some variations in the above approaches. For example,

Different critierion have been proposed for deciding which variable to add/delete at a given step in forward/backward stepwise regression.
Different criterion have been proposed for deciding when to quit adding/deleting variables in forward/backward stepwise regression.
Some algorithms allow an entered variable to be removed at a later step.

The choice of these critierion is complicated by the fact that adding additional variables will always increase the R² of the fit (or at least not decrease it). However, including too many variables increases multicolinearity which results in numerically unstable models (i.e., you are essentially fitting noise). In addition, the model becomes more complex than it needs to be. A number of critierion have been proposed that attempt to balance maximizing the fit while trying to protect against overfitting.

All subsets regression is the preferred algorithm in that it examines all models. However, it can be computationally impractical to perform all subsets regression when the number of independent variables becomes large. The primary disadvantage of forward/backward stepwise regression is that it may miss good candidate models. Also, they pick a single model rather than a list of good candidate models that can be examined closer.

Dataplot addresses this issue with the BEST CP command. This is based on the following:

It implements the "leaps and bounds" algorithm of Furnival and Wilson (see reference below). This algorithm is a compromise between all subsets regression and forward/backward stepwise regression. It provides an efficient algorithm for identifying the best candidate models without actually computing all possible models. It thus provides the advantage of all subset regressions (i.e., all potential models are included) while remaining computationally practical for a large number of indpendent variables.

Dataplot uses the Mallow's C_p critierion (suggested by Mallows, see References below). The C_p statistic is defined as follows:

\( C_{p} = \frac{RSS_{p}}{\hat{\sigma}^2} + 2p - n \)

where

n	= the number of observations
p	= the number of coefficents in the regression
RSS_p	= the residual sum of squares for the reduced model

\( \hat{\sigma} \)	= an independent estimate of the error (although this value is typically unknown, it is estimated using the residual variance from the full model)

If the model is satisfactory, C_p will be approximately equal to p.

It should be emphasized that the BEST CP command is intended simply to identify good candidate models. Also, the BEST CP command uses a computationally fast algorithm that is not as accurate as the algorithm used by the FIT command. The FIT command should be applied to identified models that are of interest. Also, standard regression diagnostics should be examined to the candidate models of interest.

Syntax:

Examples:

Note:

The BEST CP commands requires at least three indpendent variables and at most 38 independent variables. Note:

To change the number of candidate models chosen, enter the command

SET NUMBER OF CP <value>

where <value> identifies the number of candidate models.

Note that increasing <value> will result in greater time to generate the best candidate models. In most cases, the default value of 10 is adequate.

Note:

Dataplot writes the results of the CP analysis to file. The example program below shows how to generate a CP plot using these files. Specifically,

The C_p statistic and the corresponding model is written to "dpst1f.dat". The following is a sample dpst1f.dat file:

1 138.731 : X4
1 142.487 : X2
1 202.549 : X1
1 315.155 : X3
2   2.678 : X1       X2
2   5.496 : X1       X4
2  22.373 : X3       X4
2  62.438 : X2       X3
2 138.226 : X2       X4
3   3.018 : X1       X2       X4
3   3.041 : X1       X2       X3
3   3.497 : X1       X3       X4
3   7.338 : X2       X3       X4
4   5.000 : X1       X2       X3       X4

A coded form of the model is written to the file "dpst2f.dat". This coded form is useful as an identifying plot character in a CP plot. The rows of dpst1f.dat correspond to the rows of dpst2f.dat. The following is a sample dpst2f.dat file:

Note:

Dataplot uses code from OMNITAB to implement the BEST CP comamnd. The OMNITAB algorithm is based on the Furnival and Wilson leaps and bounds algorithm. Note:

The Mallows CP statistic can be affected by outliers. This is discussed by Ryan (see References below). Currently, Dataplot makes no provisions for outliers in the BEST CP command. Note:

Schwarz introduced an alternative information critierion called the Bayesian Information Critierion (BIC). The BIC penalizes the likelihood more than the AIC for additional parameters. For large n, the BIC can be approximated by

\( \mbox{BIC} = -2 \log(\hat{L}) + k \log(n) \)

\(\hat{L}\) is the maximized value of the likelihood function. In the context of regression, the BIC can be computed as

\( \mbox{BIC} = n \log(\mbox{resvar}) + k \log(n) \)

where

n	=	sample size
k	=	the number of estimated coefficients
resvar	=	\( \frac{\sum_{i=1}^{n}{(Y_{i} - Y_{\mbox{pred}})^2}} {n} \)

The 2013/10 version of Dataplot added the BIC value for the selected models to the output. Note that the models are selected on the basis of Mallow's CP, not BIC. BIC is provided as an additional comparison.

Default:

10 candidate models are extracted Synonyms:

None Related Commands:

FIT	= Perform linear/nonlinear fitting.
PARTIAL RESIDUAL PLOT	= Generate a partial residual plot.
PARTIAL REGRESSION PLOT	= Generate a partial regression plot.
PARTIAL LEVERAGE PLOT	= Generate a partial leverage plot.
PARTIAL CCPR PLOT	= Generate a CCPR plot.
PLOT	= Generates a data/function plot.

Reference:

Technometrics

C. L. Mallows (1966), "Choosing a Subset Regression," Joint Statistical Meetings, Los Angeles, CA.

Sally Peavy, Shirley Bremer, Ruth Varner, and David Hogben (1986), "OMNITAB 80: An Interpretive System for Statistical and Numerical Data Analysis," NIST Special Publication 701.

Thomas Ryan (1997), "Modern Regresion Methods," John Wiley, pp. 223-228.

Schwarz (1978), "Estimating the dimension of a model," Annals of Statistics, Vol. 6, No. 2, pp. 461–464.

Boisbunon, Canu, Fourdrinier, Strawderman, and Wells (2013), "AIC and Cp as estimators of loss for spherically symmetric distributions," arXiv:1308.2766.

Applications:

Multilinear Fitting Implementation Date:

Program:

 
skip 25
read hald647.dat y x1 x2 x3 x4
.
echo on
capture junk.dat
best cp y x1 x2 x3 x4
end of capture
.
skip 0
read dpst1f.dat p cp
read row labels dpst2f.dat
title case asis
label case asis
character rowlabels
line blank
tic offset units data
xtic offset 0.3 0.3
ytic offset 10 0
let maxp = maximum p
major xtic mark number maxp
xlimits 1 maxp
title Best CP Plot (HALD647.DAT Example)
x1label P
y1label C(p)
plot cp p
line solid
draw data 1 1 maxp maxp

            Regression with One Variable
 
---------------------------------------------
 C(p) Statistic            BIC      Variables
---------------------------------------------
      138.73082       59.98154              4
      142.48641       60.30789              2
      202.54876       64.64937              1
      315.15428       70.19729              3
 
 
            Regressions with   2 Variables
 
C(p) =        2.678, BIC =       27.115
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X1              1.46831        146.522
       X2              0.66225        208.581
 
 
C(p) =        5.496, BIC =       30.437
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X1              1.43995        108.224
       X4             -0.61395        159.294
 
 
C(p) =       22.373, BIC =       41.547
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X3             -1.19985         40.295
       X4             -0.72460        100.356
 
 
C(p) =       62.438, BIC =       52.732
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X2              0.73133         36.682
       X3             -1.00838         11.816
 
 
C(p) =      138.226, BIC =       62.324
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X2              0.31090          0.172
       X4             -0.45694          0.431
 
 
---------------------------------------------
 C(p) Statistic            BIC      Variables
---------------------------------------------
      198.09465       66.81153           1  3
 
 
            Regressions with   3 Variables
 
C(p) =        3.018, BIC =       27.234
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X1              1.45194        154.008
       X2              0.41611          5.025
       X4             -0.23654          1.863
 
 
C(p) =        3.041, BIC =       27.271
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X1              1.69588         68.715
       X2              0.65691        220.546
       X3              0.25002          1.832
 
 
C(p) =        3.497, BIC =       27.987
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X1              1.05184         22.112
       X3             -0.41004          4.235
       X4             -0.64280        208.240
 
 
C(p) =        7.337, BIC =       32.836
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X2             -0.92342         12.426
       X3             -1.44797         96.939
       X4             -1.55704         41.654
 
 
            Regressions with   4 Variables
 
C(p) =        5.000, BIC =       29.769
---------------------------------------------
       Variable    Coefficient        F Ratio
---------------------------------------------
       X1              1.55109          4.336
       X2              0.51017          0.497
       X3              0.10191          0.018
       X4             -0.14406          0.041
 
         14 REGRESSIONS          56 OPERATIONS