|
BEST CPName:
There can be some variations in the above approaches. For example,
The choice of these critierion is complicated by the fact that adding additional variables will always increase the R2 of the fit (or at least not decrease it). However, including too many variables increases multicolinearity which results in numerically unstable models (i.e., you are essentially fitting noise). In addition, the model becomes more complex than it needs to be. A number of critierion have been proposed that attempt to balance maximizing the fit while trying to protect against overfitting. All subsets regression is the preferred algorithm in that it examines all models. However, it can be computationally impractical to perform all subsets regression when the number of independent variables becomes large. The primary disadvantage of forward/backward stepwise regression is that it may miss good candidate models. Also, they pick a single model rather than a list of good candidate models that can be examined closer. Dataplot addresses this issue with the BEST CP command. This is based on the following:
It should be emphasized that the BEST CP command is intended simply to identify good candidate models. Also, the BEST CP command uses a computationally fast algorithm that is not as accurate as the algorithm used by the FIT command. The FIT command should be applied to identified models that are of interest. Also, standard regression diagnostics should be examined to the candidate models of interest.
where <y> is the response (dependent) variable; <x1> .... <xk> is a list of one or more independent variables; and where the <SUBSET/EXCEPT/FOR qualification> is optional.
BEST CP Y X1 X2 X3 X4 X5 X6 X7 SUBSET TAG > 1
To change the number of candidate models chosen, enter the command
where <value> identifies the number of candidate models. Note that increasing <value> will result in greater time to generate the best candidate models. In most cases, the default value of 10 is adequate.
Dataplot writes the results of the CP analysis to file. The example program below shows how to generate a CP plot using these files. Specifically,
4
2
1
3
12
14
34
23
24
124
123
134
234
1234
Note:
"Choosing a Subset Regression", C. L. Mallows, Joint Statistical Meetings, Los Angeles, CA 1966. "OMNITAB 80: An Interpretive System for Statistical and Numerical Data Analysis", Sally Peavy, Shirley Bremer, Ruth Varner, and David Hogben, NIST Special Publication 701. "Modern Regresion Methods", Thomas Ryan, John Wiley, 1997.
skip 25
read hald647.dat y x1 x2 x3 x4
.
echo on
capture junk.dat
best cp y x1 x2 x3 x4
end of capture
.
skip 0
read dpst1f.dat p cp
read row labels dpst2f.dat
title case asis
label case asis
character rowlabels
line blank
tic offset units data
xtic offset 0.3 0.3
ytic offset 10 0
let maxp = maximum p
major xtic mark number maxp
xlimits 1 maxp
title Best CP Plot (HALD647.DAT Example)
x1label P
y1label C(p)
plot cp p
line solid
draw data 1 1 maxp maxp
The following output is generated for the BEST CP command.
REGRESSION WITH 1 VARIABLE
C(P) STATISTIC VARIABLES
138.731 4
142.486 2
202.549 1
315.154 3
REGRESSIONS WITH 2 VARIABLES
C(P) = 2.678
VARIABLE COEFFICIENT F RATIO
X1 0.1468306E+01 146.523
X2 0.6622506E+00 208.582
C(P) = 5.496
VARIABLE COEFFICIENT F RATIO
X1 0.1439958E+01 108.224
X4 -0.6139536E+00 159.295
C(P) = 22.373
VARIABLE COEFFICIENT F RATIO
X3 -0.1199851E+01 40.294
X4 -0.7246001E+00 100.357
C(P) = 62.438
VARIABLE COEFFICIENT F RATIO
X2 0.7313297E+00 36.683
X3 -0.1008386E+01 11.816
C(P) = 138.226
VARIABLE COEFFICIENT F RATIO
X2 0.3109062E+00 0.172
X4 -0.4569406E+00 0.431
C(P) STATISTIC VARIABLES
198.094 1 3
REGRESSIONS WITH 3 VARIABLES
C(P) = 3.018
VARIABLE COEFFICIENT F RATIO
X1 0.1451938E+01 154.008
X2 0.4161106E+00 5.026
X4 -0.2365394E+00 1.863
C(P) = 3.041
VARIABLE COEFFICIENT F RATIO
X1 0.1695890E+01 68.717
X2 0.6569149E+00 220.547
X3 0.2500176E+00 1.832
C(P) = 3.497
VARIABLE COEFFICIENT F RATIO
X1 0.1051854E+01 22.113
X3 -0.4100432E+00 4.236
X4 -0.6427962E+00 208.240
C(P) = 7.338
VARIABLE COEFFICIENT F RATIO
X2 -0.9234153E+00 12.427
X3 -0.1447971E+01 96.939
X4 -0.1557044E+01 41.653
REGRESSIONS WITH 4 VARIABLES
C(P) = 5.000
VARIABLE COEFFICIENT F RATIO
X1 0.1551119E+01 4.337
X2 0.5101846E+00 0.497
X3 0.1019266E+00 0.018
X4 -0.1440443E+00 0.041
14 REGRESSIONS 56 OPERATIONS
NUMBER OF VARIABLES, CP VALUE, VARIABLE LIST WRITTEN
TO FILE DPST1F.DAT
CODED VARIABLE LIST WRITTEN TO TO FILE DPST2F.DAT
The output can be displayed in graphical form.
Date created: 8/12/2002 |