Dataplot Vol 1 Vol 2

# REGRESSION DIAGNOSTICS

Name:
REGRESSION DIAGNOSTICS
Description:
The help for FIT and LINEAR FIT describe the standard outputs and intervals that are generated by the fits and the standard residual analysis that is typically performed after a fit to validate the model. This help section describes the following.

1. There are a number of plots that can be generated to help determine the most appropriate model.

2. A number of additional intervals and tests that are not included in the standard Dataplot outputs can be generated from information that is provided by the fit.

3. There are a number of regression diagnostics that go beyond standard residual analysis.

4. When there are multiple independent variables being considered, it may be useful to assess multi-collinearity.

The above are demonstrated in the Program example below. Note that this program example is meant to show the mechanics of the various plots and commands and is not intended to be treated as a case study.

One purpose of many regression diagnostics is to identify observations with either high leverage or high influence. High leverage points are those that are outliers with respect to the independent variables. Influential points are those that cause large changes in the parameter estimates when they are deleted. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal of the hat matrix (hat matrix = H = X(X'X)**(-1)X'). Dataplot currently writes a number of measures of influence and leverage to the file dpst3f.dat (e.g., the diagonal of the hat matrix, Cook's distance, DFFITS).

Pre-Fit Plots
When there are multiple independent variables, it is common to plot the dependent variable against each of the independent variable. However, these basic plots do not account for the effect of the other independent variables in the model into account. Several plots have been proposed to address this. Specifically,

1. Partial Residual Plot

Partial residual plots are formed as:

$$\mbox{Res} + \hat{\beta}_{i} X_{i}$$ versus $$X_{i}$$

where

Res = residuals from the full model
$$\hat{\beta}_{i}$$ = regression coefficient from the i-th independent variable in the full model
Xi = the i-th independent variable

Partial residual plots are widely discussed in the regression diagnostics literature (e.g., see the References section below). Although they can often be useful, be aware that they can also fail to indicate the proper relationship. In particular, if Xi is highly correlated with any of the other independent variables, the variance indicated by the partial residual plot can be much less than the actual variance.

2. Component and Component-Plus-Residual (CCPR) Plot

The CCPR plot is a refinement of the partial residual plot. It generates a partial residual plot but also adds

$$\hat{\beta}_{i} X_{i}$$ versus Xi

This is the "component" part of the plot and is intended to show where the "fitted line" would lie.

3. Partial Regression Plot

Partial regression plots are also referred to as added variable plots, adjusted variable plots and individual coefficient plots.

Partial regression plots attempt to show the effect of adding an additional variable to the model (given that one or more independent variables are already in the model). Partial regression plots are formed by:

1. Compute the residuals of regressing the response variable against the independent variables but omitting Xi

2. Compute the residuals from regressing Xi against the remaining independent variables.

3. Plot the residuals from (1) against the residuals from (2).

Velleman and Welsch (see References below) express this mathematically as:

Y.[i] versus Xi.[i]

where

Y.[i] = residuals from regressing Y (the response variable) against all the independent variables except Xi
Xi.[i] = residuals from regressing Xi against the remaining indpependent variables.

Velleman and Welsch list the following useful properties for this plot:

1. The least squares linear fit to this plot has the slope i and intercept zero.

2. The residuals from the least squares linear fit to this plot are identical to the residuals from the least squares fit of the original model (Y against all the independent variables including Xi).

3. The influences of individual data values on the estimation of a coefficient are easy to see in this plot.

4. It is easy to see many kinds of failures of the model or violations of the underlying assumptions (nonlinearity, heteroscedasticity, unusual patterns).

Partial regression plots are widely discussed in the regression diagnostics literature (e.g., see the References section below). Since the strengths and weaknesses of partial regression plots are widely discussed in the literature, we will not discuss that in any detail here.

Partial regression plots are related to, but distinct from, partial residual plots. Partial regression plots are most commonly used to identify leverage points and influential data points that might not be leverage points. Partial residual plots are most commonly used to identify the nature of the relationship between Y and Xi (given the effect of the other independent variables in the model). Note that since the simple correlation betweeen the two sets of residuals plotted is equal to the partial correlation between the response variable and Xi partial regression plots will show the correct strength of the linear relationship between the response variable and Xi This is not true for partial residual plots. On the other hand, for the partial regression plot, the x axis is not Xi. This limits its usefulness in determining the need for a transformation (which is the primary purpose of the partial residual plot).

4. Partial Leverage Plot

Partial leverage is used to measure the contribution of the individual independent variables to the leverage of each observation. That is, if hi is the ith row of the diagonal of the hat matrix, how does hi change as we add a variable to the regression model.

The partial leverage is computed as:

$$(PL_{j})_{i} = \frac{(X_{j.\left[ j\right] })_{i}^{2}} {\sum_{k=1}^{n}{(X_{j.\left[ j\right] })_{k}^{2}}}$$

where

j = jth independent variable
i = the ith observation
Xj.[j] = residuals from regressing Xj against the remaining indpependent variables

Note that the partial leverage is the leverage of the i-th point in the partial regression plot for the jth variable.

The interpretation of the partial leverage plot is that data points with large partial leverage for an independent variable can exert undue influence on the selection of that variable in automatic regression model building procedures (e.g., the BEST CP command in Dataplot).

Additional Tests and Intervals:
The standard Dataplot output (including what is written to dpst1f.dat, ... , dpst5f.dat) for linear fits includes parameter estimates and associated confidence intervals, prediction limits and associated confidence intervals (including Bonferonni and Hotelling joint confidence intervals), and the parameter covariance and correlation matrices. There are some additional tests and intervals can also be generated from information provided by the fit.

1. The Program example demonstrates how to obtain point estimates and confidence intervals (including joint confidence intervals) for one or more new points.

2. The Program example demonstrates how to compute confidence intervals for the prediction limits when a different significance level is desired.
Regression Diagnostics
After a fit, it is recommended that various diagnostics be generated to assess the adequacy of the model. The typical assumptions for a good model are that the errors from the model are independent and identically distributed (typically a normal distribution).

At a minimum, the following diagnostics should be generated

1. a 4-plot of the residuals. The 4-plot generates a run sequence plot (to assess common location and scale for the residuals), a lag plot (to check for first order autocorrelation of the residuals), a histogram, and a normal probability plot.

This plot provides a useful check on the basic assumptions for the error term. If the assumptions are violated, this is an indication that there is structure in the data that is not accounted for.

2. If there is a single independent variable, then it is useful to plot the predicted values and the dependent variable against the independent variable.

The term "regression diagnostics" is typically used to denote diagnostic statistics and plots that go beyond the standard residual analysis. This is a brief discussion of "regression diagnostics" with respect to linear fits performed with a non-iterative algorithm.

Regression diagnostics are used to identify outliers in the dependent variable, identify influential points, to identify high leverage points, or in general to uncover features of the data that could cause difficulties for the fitted regression model.

The books by Belsley, Kuh, and Welsch and by Cook and Weisberg listed in the References section discuss regression diagnostics in detail. The mathematical derivations for the diagnostics covered here can be found in these books. Chapter 11 of the Neter, Wasserman, and Kuntner book listed in the References section discusses regression diagnostics in a less theoretical way. We will only give a brief overview of the various regression diagnostics. For more thorough guidance on when these various diagnostics are appropriate and how to interpret them, see one of these texts (or some other text on regression that covers these topics).

At a minimum, diagnostic analysis of a linear fit includes the various plots of the residuals and predicted values described above. For more complete diagnostics, the variables written to the file dpst3f.dat can be analyzed. Specifically, this file contains

 Column 1: the diagonals of the hat matrix (the hat matrix is $$X(X'X)X'$$ where $$X'$$ is the transpose of the $$X$$ matrix). In themselves, the diagonal elements are measures of the leverage of a given point. The minimum leverage is $$\frac{1} {n}$$, the maximum leverage is 1.0 and the average leverage is $$\frac{p} {n}$$ where $$P$$ is the number of variables in the fit. These elements are also used to calculate many other diagnostic statistics. Note that $$H_{ii} = \frac{\mbox{VAR(Predicted Value)}} {\mbox{Residual Variance}}$$ Column 2: the variance of the residuals $$\mbox{VAR(res)} = \mbox{MSE} (1 - H_{ii})$$ Column 3: the standardized residuals. These are the residuals divided by the square root of the mean square error. $$\mbox{STRES} = \frac{\mbox{residual}} {\sqrt{\mbox{MSE}}}$$ Column 4: the internally studentized residuals. These are the residuals divided by their standard deviations. Column 5: the deleted residuals. These are residuals obtained from subtracting the predicted values with the i-th case omitted from the observed value. Column 6: the externally studentized residuals. These are the deleted residuals divided by their standard deviation. Column 7: Cook's distance. This is a measure of the impact of the i-th case on all of the estimated regression coefficients. $$\mbox{Cook} = \frac{\mbox{res}^2}{p \mbox{MSE}} \frac{H_{ii}} {(1 - H_{ii})^2}$$ Column 8: $$\mbox{DFFITS} = \mbox{EXTSRES} \sqrt{H_{ii} (1 - H_{ii})}$$

Additional diagnostic statistics can be computed from these values. Several of the texts in the REFERENCE section below discuss the use and interpretation of these statistics in more detail. These variables can be read in as follows:

SKIP 1
SET READ FORMAT 8E15.7
READ DPST3F.DAT HII VARRES STDRES ISTUDRES DELRES ...
ESTUDRES COOK DFFITS
SKIP 0

Many analysts prefer to use the standardized residuals or the internally studentized residuals for the basic residual analysis. Deleted residuals and externally deleted residuals are used to identify outlying Y observations that the normal residuals do not identify (cases with high leverage tend to generate small residuals even if they are outliers).

Many regression diagnostics depend on the Hat matrix $$(X(X'X)^{-1}X'$$). Dataplot has limits on the maximum number of columns/rows for a matrix which may prohibit the creation of the full hat matrix for some problems. Fortunately, most of the relevant statistics can be derived from the diagonal elements of this matrix (which can be read from the dpst3f.dat file). These are also referred to as the leverage points. The minimum leverage is $$(1/n)$$, the maximum leverage is 1.0, and the average leverage is $$(p/n)$$ where $$p$$ is the number of variables in the fit. As a rule of thumb, leverage points greater than twice the average leverage can be considered high leverage. High leverage points are outliers in terms of the $$X$$ matrix and have an unduly large influence on the predicted values. High leverage points also tend to have small residuals, so they are not typically detected by standard residual analysis.

The DFFITS values are a measure of influence that observation i has on the i-th predicted value. As a rule of thumb, for small or moderate size data sets, values greater than 1 indicate an influential case. For large data sets, values greater than $$2 \sqrt{p/n}$$ indicate influential cases.

Cook’s distance is a measure of the combined impact of the i-th observation on all of the regression coefficients. It is typically compared to the F distribution. Values near or above the 50-th percentile imply that observation has substantial influence on the regression parameters.

The DFFITS values and the Cook’s distance are used to identify high leverage points that are also influential (in the sense that they have a large effect on the fitted model).

Once these variables have been read in, they can be printed, plotted, and used in further calculations like any other variable. This is demonstrated in the program example below. They can also be used to derive additional diagnostic statistics. For example, the program example shows how to compute the Mahalanobis distance and Cook’s V statistic. The Mahalanobis distance is a measure of the distance of an observation from the “center” of the observations and is essentially an alternate measure of leverage. The Cook’s V statistic is the ratio of the variances of the predicted values and the variances of the residuals. Another popular diagnostic is the DFBETA statistic. This is similar to Cook’s distance in that it tries to identify points that have a large influence on the estimated regression coefficients. The distinction is that DFBETA assesses the influence on each individual parameter rather than the parameters as a whole. The DFBETA statistics require the catcher matrix, which is described in the multi-collinearity section below, for easy computation. The usual recommendation for DFBETA is that absolute values larger than 1 for small and medium size data sets and larger than $$2/\sqrt{N}$$ for large data sets should be considered influential. The variables written to file dpst3f.dat are calculated without computing any additional regressions. The statistics based on a case being deleted are computed from mathematically equivalent formulas that do not require additional regressions to be performed. The Neter, Wasserman, and Kunter text gives the formulas that are actually used.

Robust regression is often recommended when there are significant outliers in the data. One common robust technique is called iteratively reweighted least squares (IRLS) (or M-estimation). Note that these techniques protect against outlying Y values, but they do not protect against outliers in the X matrix. Also, they test for single outliers and are not as sensitive for a group of similar outliers. See the documentation for WEIGHTS, BIWEIGHT, and TRICUBE for more information on performing IRLS regression in DATAPLOT. Techniques for protecting against outliers in the X matrix use alternatives to least squares. Two such methods are least median squares regression (also called LSQ regression) and least trimmed squares regression (also called LTS regression). Dataplot does not support these techniques at this time. The documentation for the WEIGHTS command in the Support chapter discusses one approach for dealing with outliers in the X matrix in the context of IRLS.

Multi-collinearity
Multi-collinearity results when the columns of the X matrix have significant interdependence (that is, one column is close to a linear combination of some collection of other columns). Multi-collinearity typically results in numerically unstable estimates in the sense that small changes in the X matrix can result in significant changes in the estimated regression coefficients. It can also cause other significant problems. Pairwise collinearity can be determined from correlation coefficients between independent variables (or from pairwise plots). However, this does not detect higher order multi-collinearity. One measure of this is the multiple correlation coefficient between the j-th variable and the rest of the independent variables. The Variance Inflation Factor (VIF) is a scaled version of this with the following formula:

$$\mbox{VIF}_{j} = \frac{1} {(1 - R_{j})^2}$$

The VIF values are often given as their reciprocals (this is called the tolerance). Fortunately, these values can be computed without performing any additional regressions. The computing formulas are based on the catcher matrix, which is $$X(X'X)^{-1}$$. The equation is:

$$\mbox{VIF}_{j} = \sum_{i=1}^{n}{c_{ij}^2} \sum_{i=1}^{n}{(x_{ij} - \bar{x}_{j})^2}$$

where c is the catcher matrix.

Another measure of multi-collinearity are the condition indices. The condition indices are calculated as follows:

1. Scale the columns of the X matrix to have unit sums of squares.

2. Calculate the singular values of the scaled X matrix and square them.

References:
Cook and Weisberg (1982), "Residuals and Influence in Regression", Chapman and Hall.

Belsley, Kuh, and Welsch, (1980), "Regression Diagnostics", John Wiley.

Neter, Wasserman, and Kunter (1990), "Applied Linear Statistical Models", 3rd ed., Irwin.

Program:

. Note: this program example is meant simply to show how to create
.       the various plots, intervals and statistics.  It is not a
.       case study.
.
. ZARTHAN COMPAY EXAMPLE FROM
. "APPLIED LINEAR STATISTICAL MODELS", BY NETER, WASSERMAN, KUTNER
.
. Y = SALES
. X1 = CONSTANT TERM
. X2 = TARGET POPULATION (IN THOUSANDS)
. X3 = PER CAPITA DISCRETIONARY INCOME (DOLLARS)
.
. Step 1: Read the data
.
DIMENSION 500 COLUMNS
LET NVAR = 2
READ DISTRICT Y X2 X3
1 162 274 2450
2 120 180 3254
3 223 375 3802
4 131 205 2838
5 67 86 2347
6 169 265 3782
7 81 98 3008
8 192 330 2450
9 116 195 2137
10 55 53 2560
11 252 430 4020
12 232 372 4427
13 144 236 2660
14 103 157 2088
15 212 370 2605
END OF DATA
.
LET N = SIZE Y
LET X1 = 1 FOR I = 1 1 N
LET STRING SY  = Sales
LET STRING SX2 = Population
LET STRING SX3 = Income
LET P = NVAR + 1
.
. Step 1: Basic Preliminary Plots
.         1) Independent against dependent
.         2) INDEPENDENT AGAINST INDEPENDENT
.
TITLE OFFSET 2
TITLE AUTOMATIC
TITLE CASE ASIS
LABEL CASE ASIS
CASE ASIS
.
LINE BLANK
CHARACTER CIRCLE
CHARACTER HW 1 0.75
CHARACTER FILL ON
Y1LABEL DISPLACEMENT 12
X1LABEL DISPLACEMENT 10
X2LABEL DISPLACEMENT 15
.
MULTIPLOT CORNER COORDINATES 5 5 95 95
MULTIPLOT SCALE FACTOR 2
MULTIPLOT 2 2
.
Y1LABEL ^SY
LOOP FOR K = 2 1 P
LET RIJ = CORRELATION Y X^K
LET RIJ = ROUND(RIJ,3)
X1LABEL ^SX^K
X2LABEL Correlation: ^RIJ
PLOT Y VS X^K
END OF LOOP
.
LOOP FOR K = 2 1 P
LET IK1 = K + 1
Y1LABEL ^SX^K
LOOP FOR J = IK1 1 P
LET RIJ = CORRELATION X^K X^J
LET RIJ = ROUND(RIJ,3)
X1LABEL ^SX^J
X2LABEL Correlation: ^RIJ
PLOT X^K VS X^J
END OF LOOP
END OF LOOP
END OF MULTIPLOT
LABEL
.
JUSTIFICATION CENTER
MOVE 50 97
TEXT Basic Preliminary Plots

.
. Step 2: Generate the fit
.
SET WRITE DECIMALS 5
SET LIST NEW WINDOW OFF
FEEDBACK OFF
capture program3.out
FIT Y X2 X3
WRITE " "
WRITE " "
WRITE " "
WRITE "         ANOVA Table"
WRITE " "
LIST dpst5f.dat
WRITE " "
WRITE " "
CAPTURE SUSPEND
.
. Step 3b: Generate the basic residual analysis
.
TITLE SIZE 4
TIC MARK LABEL SIZE 4
CHARACTER HW 2 1.5
SET 4PLOT MULTIPLOT ON
TITLE AUTOMATIC
4-PLOT RES
TITLE SIZE 2
TIC MARK LABEL SIZE 2
CHARACTER HW 1 0.75

.
MULTIPLOT 2 2
TITLE AUTOMATIC
Y1LABEL Predicted Values
X1LABEL Response Values
PLOT PRED VS Y
Y1LABEL Residuals
X1LABEL Predicted Values
PLOT RES VS PRED
LOOP FOR K = 2 1 P
Y1LABEL Residuals
X1LABEL Predicted Values
PLOT RES VS X^K
END OF LOOP
END OF MULTIPLOT
LABEL
TITLE

.
. Step 3c: Read the information written to auxiliary files.  Some of
.          these values will be used later in this macro.
.
SKIP 1
READ dpst1f.dat COEF COEFSD TVALUE PARBONLL PARBONUL
READ dpst2f.dat PREDSD PRED95LL PRED95UL PREDBLL PREDBUL PREDHLL PREDHUL
READ dpst3f.dat HII VARRES STDRES STUDRES DELRES ESTUDRES COOK DFFITS
READ dpst4f.dat TEMP1 TEMP2
LET S2B    = VARIABLE TO MATRIX TEMP1 P
LET XTXINV = VARIABLE TO MATRIX TEMP2 P
DELETE TEMP1 TEMP2
SKIP 0
.
. Step 4: Partial Residual and Partial Regression Plots, CCPR and
.         Partial Leverage Plots not generated
.
MULTIPLOT 2 2
TITLE AUTOMATIC
Y1LABEL Residuals + A1*X2
X1LABEL X2
PARTIAL RESIDUAL   PLOT Y X2 X3 X2
Y1LABEL Residuals + A2*X3
X1LABEL X3
PARTIAL RESIDUAL   PLOT Y X2 X3 X3
.
Y1LABEL Residuals: X2 Removed
X1LABEL Residuals: X2 Versus X3
PARTIAL REGRESSION PLOT Y X2 X3 X2
Y1LABEL Residuals: X3 Removed
X1LABEL Residuals: X3 Versus X2
PARTIAL REGRESSION PLOT Y X2 X3 X3
END OF MULTIPLOT
TITLE
LABEL
JUSTIFICATION CENTER
MOVE 50 97
Text Partial Residual and Partial Regression Plots
.

. Step 5: Calculate function to calculate regression estimate
.         for new data (F)
.
LET A0 = COEF(1)
LET FUNCTION F = A0
LET DUMMY = PREDSD(1)
.
.         Use (2*NVAR) rather than (2*P) if no constant term in
.         joint interval
.
LOOP FOR K = 1 1 NVAR
LET INDX = K + 1
LET A^K = COEF(INDX)
LET FUNCTION F = F + (A^K)*(Z^K)
END OF LOOP
.
LET Z1 = DATA 220 375
LET Z2 = DATA 2500 3500
LET YNEW = F
.
CAPTURE RESUME
PRINT " "
PRINT " "
PRINT "NEW X VALUES, ESTIMATED NEW VALUE"
PRINT Z1 Z2 YNEW
CAPTURE SUSPEND
.
. Step 5b: Print the X'X inverse matrix and parameter
.          variance-covariance matrix
.
CAPTURE RESUME
PRINT " "
PRINT " "
PRINT "THE X'X INVERSE MATRIX"
PRINT XTXINV
.
PRINT " "
PRINT " "
PRINT "THE PARAMETER VARIANCE-COVARIANCE MATRIX"
PRINT S2B
CAPTURE SUSPEND
.
. Step 5c: Calculate:
.          1) The variance of a new point (S2YHAT)
.          2) A confidence interval for a new point
.          3) A joint confidence interval for more than one point
.          4) A prediction interval for a new point
.          5) A Scheffe joint prediction interval for more than one point
.
LET NPT = SIZE YNEW
LET NP = N - P
LOOP FOR IK = 1 1 NPT
LET XNEW(1) = 1
LET XNEW(2) = Z1(IK)
LET XNEW(3) = Z2(IK)
LOOP FOR K = 1 1 P
LET DUMMY2 = VECTOR DOT PRODUCT XNEW S2B^K
LET SUMZ(K) = DUMMY2
END OF LOOP
LET S2YHAT = VECTOR DOT PRODUCT SUMZ XNEW
LET S2YPRED = MSE + S2YHAT
LET YHATS2(IK) = S2YHAT
LET YPREDS2(IK) = S2YPRED
LET SYHAT = SQRT(S2YHAT)
LET YHATS(IK) = SYHAT
LET SYPRED = SQRT(S2YPRED)
LET YPREDS(IK) = SYPRED
LET YHAT = YNEW(IK)
CAPTURE RESUME
PRINT " "
PRINT " "
PRINT "THE PREDICTED VALUE FOR THE NEW POINT = ^YHAT"
PRINT "THE VARIANCE OF THE NEW VALUE = ^S2YHAT"
PRINT "THE VARIANCE FOR PREDICTION INTERVALS = ^S2YPRED"
CAPTURE SUSPEND
LET T = TPPF(.975,NP)
LET YHATU = YHAT + T*SYHAT
LET YHATL = YHAT - T*SYHAT
LET YPREDU = YHAT + T*SYPRED
LET YPREDL = YHAT - T*SYPRED
CAPTURE RESUME
PRINT " "
PRINT "95% CONFIDENCE INTERVAL FOR YHAT: ^YHATL <= YHAT <= ^YHATU"
PRINT "95% PREDICTION INTERVAL FOR YHAT: ^YPREDL <= YHAT <= ^YPREDU"
CAPTURE SUSPEND
END OF LOOP
.
LET ALPHA = 0.10
LET DUMMY = 1 - ALPHA/(2*NPT)
LET B = TPPF(DUMMY,NP)
LET JOINTBU = YNEW + B*YHATS
LET JOINTBL = YNEW - B*YHATS
CAPTURE RESUME
PRINT " "
PRINT "90% BONFERRONI JOINT CONFIDENCE INTERVALS FOR NEW VALUES"
PRINT JOINTBL YNEW JOINTBU
CAPTURE SUSPEND
LET W = P*FPPF(.90,P,NP)
LET W = SQRT(W)
LET JOINTWU = YNEW + W*YHATS
LET JOINTWL = YNEW - W*YHATS
CAPTURE RESUME
PRINT " "
PRINT "90% HOTELLING JOINT CONFIDENCE INTERVALS FOR NEW VALUES"
PRINT JOINTWL YNEW JOINTWU
CAPTURE SUSPEND
LET JOINTBU = YNEW + B*YPREDS
LET JOINTBL = YNEW - B*YPREDS
CAPTURE RESUME
PRINT " "
PRINT "90% BONFERRONI JOINT PREDICTION INTERVALS FOR NEW VALUES"
PRINT JOINTBL YNEW JOINTBU
CAPTURE SUSPEND
LET S = NPT*FPPF(.90,NPT,NP)
LET S = SQRT(S)
LET JOINTSU = YNEW + S*YPREDS
LET JOINTSL = YNEW - S*YPREDS
CAPTURE RESUME
PRINT " "
PRINT "90% SCHEFFE JOINT PREDICTION INTERVALS FOR NEW VALUES"
PRINT JOINTSL YNEW JOINTSU
CAPTURE SUSPEND
.
. Step 6: Plot various diagnostic statistics
.
.         Derive Cook's V statistic and Mahalanobis distance
.
LET V = (PREDSD**2)/(RESSD**2)
LET MAHAL = ((HII-1/N)/(1-HII))*(N*(N-2)/(N-1))
LET HBAR = P/N
LET DUMMY = SUM HII
SET WRITE FORMAT 5F10.5
CAPTURE RESUME
PRINT " "
PRINT "            HII           COOK         DFFITS              V          MAHAL"
PRINT HII COOK DFFITS V MAHAL
CAPTURE SUSPEND
SET WRITE FORMAT
.
. Step 6b: Plot various diagnostic statistics
.
.          Plot various residuals
.
X1LABEL
XLIMITS 0 15
MAJOR XTIC MARK NUMBER 4
XTIC OFFSET 0 1
MULTIPLOT 2 2
TITLE STANDARDIZED RESIDUALS
PLOT STDRES
TITLE INTERNALLY STUDENTIZED RESIDUALS
PLOT STUDRES
TITLE DELETED RESIDUALS
X1LABEL PRESS STATISTIC = ^PRESSP
PLOT DELRES
X1LABEL
TITLE EXTERNALLY STUDENTIZED RESIDUALS
PLOT ESTUDRES
END OF MULTIPLOT

.
.          Plot several diagnotic statistics
.
.
MULTIPLOT 2 2
CHARACTER FILL ON OFF
CHARACTER CIRCLE BLANK
LINE BLANK SOLID DOTTED
TITLE PLOT OF LEVERAGE POINTS
Y1LABEL
X1LABEL DOTTED LINE AT 2*AVERAGE LEVERAGE
YTIC OFFSET 0 0.1
LET TEMP6 = DATA 1 N
LET DUMMY = 2*HBAR
LET TEMP4 = DATA DUMMY DUMMY
LET TEMP5 = DATA HBAR HBAR
SPIKE ON
SPIKE BASE HBAR
SPIKE LINE DOTTED
.
PLOT HII AND
PLOT TEMP4 TEMP5 VS TEMP6
SPIKE OFF
YTIC OFFSET 0 0
.
CHARACTER CIRCLE BLANK BLANK
LINE BLANK DOTTED DOTTED
Y1LABEL
X1LABEL DOTTED LINES AT 1 AND 2*SQRT(P/N)
LET TEMP4 = DATA 1 1
LET DUMMY = 2*SQRT(P/N)
LET TEMP5 = DATA DUMMY DUMMY
TITLE PLOT OF DFFITS POINTS
.
PLOT DFFITS AND
PLOT TEMP4 TEMP5 VS TEMP6
.
X1LABEL DOTTED LINES AT 20 AND 50TH PERCENTILES OF F DISTRIBUTION
LET DUMMY = FPPF(.20,P,NP)
LET TEMP4 = DATA DUMMY DUMMY
LET DUMMY = FPPF(.50,P,NP)
LET TEMP5 = DATA DUMMY DUMMY
TITLE PLOT OF COOK'S DISTANCE
.
PLOT COOK AND
PLOT TEMP4 TEMP5 VS TEMP6
.
X1LABEL
TITLE PLOT OF MAHALANOBIS DISTANCE
PLOT MAHAL
.
END OF MULTIPLOT

.
. Step 6c: Calculate:
.
.          1. Catcher matrix
.          2. Variance inflation factors
.          3. Condition numbers of X'X (based on singular
.             values of scaled X)
.          4. Partial regression plots (also called added variable plots)
.          5. Partial leverage plots
.          6. DFBETA'S
.
LET XMAT  = CREATE MATRIX X1 X2 X3
LET C     = CATCHER MATRIX XMAT
LET VIF   = VARIANCE INFLATION FACTORS XMAT
LET RJ    = 0 FOR I = 1 1 NP
LET RJ    = 1 - (1/VIF)  SUBSET VIF > 0
LET TOL   = 1/VIF        SUBSET VIF > 0
LET DCOND = CONDITION INDICES XMAT
CAPTURE RESUME
PRINT " "
PRINT " "
PRINT "                                 CONDITION"
PRINT " Rj-SQUARE       VIF   TOLERANCE   INDICES"
SET WRITE FORMAT 5X,F5.3,F10.5,F12.5,F10.5
PRINT RJ VIF TOL DCOND
CAPTURE SUSPEND
.
MULTIPLOT 2 2
CHARACTER CIRCLE
LINE BLANK
SPIKE BLANK
LIMITS DEFAULT
TIC OFFSET 0 0
MAJOR TIC MARK NUMBER DEFAULT
.
LOOP FOR K = 2 1 P
TITLE Partial Regression Plot for X^K
PARTIAL REGRESSION PLOT Y X2 X3 X^K
.
TITLE PARTIAL LEVERAGE FOR X^K
PARTIAL LEVERAGE PLOT Y X2 X3 X^K
.
LET DUMMY = XTXINV^K(K)
LET DFBETA^K = (C^K*ESTUDRES)/SQRT(DUMMY*(1-HII))
END OF LOOP
END OF MULTIPLOT

.
LET DUMMY = XTXINV1(1)
TITLE PLOT OF DFBETA'S (B0, B1, B2)
LINE BLANK ALL
CHARACTER B0 B1 B2
CHARACTER SIZE 2 ALL
LET TEMP4 = SEQUENCE 1 1 N
LET DFBETA1 = (C1*ESTUDRES)/SQRT(DUMMY*(1-HII))
PLOT DFBETA1 DFBETA2 DFBETA3 VS TEMP4


Date created: 09/02/2021
Last updated: 12/04/2023

Please email comments on this WWW page to alan.heckert@nist.gov.