
BEST CPName:
There can be some variations in the above approaches. For example,
The choice of these critierion is complicated by the fact that adding additional variables will always increase the R^{2} of the fit (or at least not decrease it). However, including too many variables increases multicolinearity which results in numerically unstable models (i.e., you are essentially fitting noise). In addition, the model becomes more complex than it needs to be. A number of critierion have been proposed that attempt to balance maximizing the fit while trying to protect against overfitting. All subsets regression is the preferred algorithm in that it examines all models. However, it can be computationally impractical to perform all subsets regression when the number of independent variables becomes large. The primary disadvantage of forward/backward stepwise regression is that it may miss good candidate models. Also, they pick a single model rather than a list of good candidate models that can be examined closer. Dataplot addresses this issue with the BEST CP command. This is based on the following:
It should be emphasized that the BEST CP command is intended simply to identify good candidate models. Also, the BEST CP command uses a computationally fast algorithm that is not as accurate as the algorithm used by the FIT command. The FIT command should be applied to identified models that are of interest. Also, standard regression diagnostics should be examined to the candidate models of interest.
where <y> is the response (dependent) variable; <x1> .... <xk> is a list of one or more independent variables; and where the <SUBSET/EXCEPT/FOR qualification> is optional.
BEST CP Y X1 X2 X3 X4 X5 X6 X7 SUBSET TAG > 1
To change the number of candidate models chosen, enter the command
where <value> identifies the number of candidate models. Note that increasing <value> will result in greater time to generate the best candidate models. In most cases, the default value of 10 is adequate.
Dataplot writes the results of the CP analysis to file. The example program below shows how to generate a CP plot using these files. Specifically,
4 2 1 3 12 14 34 23 24 124 123 134 234 1234
C. L. Mallows (1966), "Choosing a Subset Regression," Joint Statistical Meetings, Los Angeles, CA. Sally Peavy, Shirley Bremer, Ruth Varner, and David Hogben (1986), "OMNITAB 80: An Interpretive System for Statistical and Numerical Data Analysis," NIST Special Publication 701. Thomas Ryan (1997), "Modern Regresion Methods," John Wiley, pp. 223228.
skip 25 read hald647.dat y x1 x2 x3 x4 . echo on capture junk.dat best cp y x1 x2 x3 x4 end of capture . skip 0 read dpst1f.dat p cp read row labels dpst2f.dat title case asis label case asis character rowlabels line blank tic offset units data xtic offset 0.3 0.3 ytic offset 10 0 let maxp = maximum p major xtic mark number maxp xlimits 1 maxp title Best CP Plot (HALD647.DAT Example) x1label P y1label C(p) plot cp p line solid draw data 1 1 maxp maxpThe following output is generated for the BEST CP command. Regression with One Variable  C(p) Statistic Variables  138.7308 4 142.4864 2 202.5488 1 315.1543 3 Regressions with 2 Variables C(p) = 2.678  Variable Coefficient F Ratio  X1 1.468306 146.522 X2 0.6622505 208.581 C(p) = 5.496  Variable Coefficient F Ratio  X1 1.439958 108.223 X4 0.6139536 159.295 C(p) = 22.373  Variable Coefficient F Ratio  X3 1.199851 40.294 X4 0.7246001 100.357 C(p) = 62.438  Variable Coefficient F Ratio  X2 0.7313296 36.682 X3 1.008386 11.816 C(p) = 138.226  Variable Coefficient F Ratio  X2 0.3109047 0.172 X4 0.4569419 0.431 C(p) = 138.226  C(p) Statistic Variables  198.0947 1 3 Regressions with 3 Variables C(p) = 3.018  Variable Coefficient F Ratio  X1 1.451938 154.007 X2 0.4161098 5.025 X4 0.2365402 1.863 C(p) = 3.041  Variable Coefficient F Ratio  X1 1.695890 68.716 X2 0.6569149 220.547 X3 0.2500176 1.832 C(p) = 3.497  Variable Coefficient F Ratio  X1 1.051854 22.112 X3 0.4100433 4.235 X4 0.6427961 208.240 C(p) = 7.337  Variable Coefficient F Ratio  X2 0.9234160 12.427 X3 1.447971 96.940 X4 1.557045 41.653 Regressions with 4 Variables C(p) = 5.000  Variable Coefficient F Ratio  X1 1.551103 4.337 X2 0.5101676 0.496 X3 0.1019094 0.018 X4 0.1440610 0.041The output can be displayed in graphical form.
 
Privacy
Policy/Security Notice
NIST is an agency of the U.S. Commerce Department.
Date created: 08/12/2002 