SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Staff SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 1 Vol 2

LINEAR FIT

Name:
    ... FIT
Type:
    Analysis Command
Purpose:
    Estimate the parameters for a linear, polynomial, or multi-linear least squares fit.
Description:
    The Dataplot FIT command can fit either non-linear models or linear (including polynomial and multi-linear) models.

    Non-linear models are specified by entering an equation (e.g., FIT Y = A + B*X). For non-linear fits, Dataplot uses an iterative modified Levenberg-Marquardt algorithm. Although this algorithm can handle linear and polynomial models, using non-iterative methods specifically designed for linear models are both more efficient and allow additional diagnostics to be computed. The non-iterative fit method is described here (the help for the non-linear fit can be accessed with the HELP FIT command).

    When the FIT command gives a list of variables without a functional equation, the non-iterative (linear) algorithm is used.

    For linear fits, Dataplot adopted the fitting code from the OMNITAB II statistical program. This is a modified Gramm-Schmidt with iterative refinement algorithm. The Gramm-Schmidt algorithm is based on the QR decomposition and is intended for full rank models. Since Gramm-Schmidt algorithms and QR decompositions are well documented in the literature, we do not give the mathematical details here.

    For linear fits, the FIT command generates the following output.

    1. A table containing the parameter estimates, the parameter standard deviations, and the parameter t-values is printed. The t-value is used to determine if a given paramater is statistically significant.

      These values are also written to the file dpst1f.dat. In addition, lower and upper Bonferroni joint confidence limits for the parameters are written to dpst1f.dat with a 5E15.7 format. By default, 95% intervals are used for the Bonferroni intervals. You can define the parameter ALPHA to change the significance level. For example, to use 90% intervals, enter the command:

        LET ALPHA = 0.9

      To read these values into Dataplot variables, enter the command

        SKIP 1 READ DPST1F.DAT COEF COEFSD TVAL BONL BONU

    2. The following are written to the file dpst2f.dat

        Column 1: standard deviations of the predicted values
        Column 2: 95% lower confidence limit for the predicted values
        Column 3: 95% upper confidence limit for the predicted values
        Column 4: 99% lower confidence limit for the predicted values
        Column 5: 99% upper confidence limit for the predicted values
        Column 6: 95% lower joint Bonferroni confidence limit for the predicted values
        Column 7: 95% upper joint Bonferroni confidence limit for the predicted values
        Column 8: 95% lower joint Hotelling confidence limit for the predicted values
        Column 9: 95% upper joint Hotelling confidence limit for the predicted values

      These values are written with a 9E15.7 format. By default, 95% intervals are used for the Bonferroni and Hotelling intervals. You can define the parameter ALPHA to change the significance level. For example, to use 90% intervals, enter the command:

        LET ALPHA = 0.9

      To read these values into Dataplot after the FIT, enter the command

        SKIP 1
        SET READ FORMAT 9E15.7
        READ DPST2F.DAT PREDSD PRED95LL PRED95UL PREDBLL ...
                    PREDBUL PREDHLL PREDHUL

    3. The following are written to the file dpst3f.dat

      The variables written to this file are used in "regression diagnostics". More will be said about this later.

        Column 1: the diagonals of the hat matrix (the hat matrix is \( X(X'X)X' \) where \( X' \) is the transpose of the \( X \) matrix). In themselves, the diagonal elements are measures of the leverage of a given point. The minimum leverage is \( \frac{1} {n} \), the maximum leverage is 1.0 and the average leverage is \( \frac{p} {n} \) where \( P \) is the number of variables in the fit. These elements are also used to calculate many other diagnostic statistics. Note that

          \( H_{ii} = \frac{\mbox{VAR(Predicted Value)}} {\mbox{Residual Variance}} \)
        Column 2: the variance of the residuals

          \( \mbox{VAR(res)} = \mbox{MSE} (1 - H_{ii}) \)
        Column 3: the standardized residuals. These are the residuals divided by the square root of the mean square error.

          \( \mbox{STRES} = \frac{\mbox{residual}} {\sqrt{\mbox{MSE}}} \)
        Column 4: the internally studentized residuals. These are the residuals divided by their standard deviations.
        Column 5: the deleted residuals. These are residuals obtained from subtracting the predicted values with the i-th case omitted from the observed value.
        Column 6: the externally studentized residuals. These are the deleted residuals divided by their standard deviation.
        Column 7: Cook's distance. This is a measure of the impact of the i-th case on all of the estimated regression coefficients.

          \( \mbox{Cook} = \frac{\mbox{res}^2}{p \mbox{MSE}} \frac{H_{ii}} {(1 - H_{ii})^2} \)
        Column 8: \( \mbox{DFFITS} = \mbox{EXTSRES} \sqrt{H_{ii} (1 - H_{ii})} \)

      Additional diagnostic statistics can be computed from these values. Several of the texts in the REFERENCE section below discuss the use and interpretation of these statistics in more detail. These variables can be read in as follows:

        SKIP 1
        SET READ FORMAT 8E15.7
        READ DPST3F.DAT HII VARRES STDRES ISTUDRES DELRES ...
                    ESTUDRES COOK DFFITS
        SKIP 0

      For more disucssion of how these variables can be used, enter

    4. The variance-covariance matrix of the parameters and the inverse of the \( X'X \) matrix are written to the file dpst4f.dat. These values can be used in deriving additional statistics, intervals and tests. The use of these matrices is demonstrated in the Program example given in the HELP REGRESSION DIAGNOSTICS section.

      To read these, you can do the following

        SKIP 1
        READ DPST4F.DAT TEMP1 TEMP2
        LET P = 2; . P denotes the number of parameters
        LET S2B = VARIABLE TO MATRIX TEMP1 P
        LET XTXINV = VARIABLE TO MATRIX TEMP2 P

    5. A regression ANOVA table is written to dpst5f.dat. In addition to the ANOVA table, the \( R^2 \), adjusted \( R^2 \), and Press P statistic are printed. These three parameters are also saved as the internal parameters RSQUARE, ADJRSQUA, and PRESSP, respectively.

      To view the ANOVA table, enter

      LIST dpst5f.dat

      Starting with the August 2021 version, the following values printed in the ANOVA table are now saved as internal parameters

        RESSS - the residual sum of squares
        SSREG - the regression sum of squares
        SSTOTAL - the total sum of squares
        MSE - the mean square error
        MSR - the mean square of the regression
        FSTAT - the value of the F statistic
        FCV95 - the 95% critical value for the F statistic
        FCV99 - the 99% critical value for the F statistic

    6. The residual standard deviation and its corresponding degrees of freedom are are stored in the parameters RESSD and RESDF, respectively. RESDF is the number of observations minus the number of independent variables in the fit (including the constant term). The formula for RESSD is:

        \( \mbox{RESSD} = \sqrt{\frac{\sum_{i=1}^{n} {(Y - \hat{Y})^{2}}} {\mbox{RESDF}}} \)

    7. If there is replication in the independent variables, the replication standard deviation and corresponding degrees of freedom are printed. In addition, a lack of fit F test is performed. These are stored in the parameters REPDF, REPSD, and LOFCDF respectively. The formulas are:

        \( \mbox{REPDF} = \sum_{i=1}^{nrep}{n_{i} - 1} \)

        \( \mbox{REPSD} = \sqrt{\frac{\sum_{i=1}^{n} {(Y - \bar{Y}_{k})^{2}}} {\mbox{REPDF}}} \)

        with \( nrep \), \( n_{i} \) and \( \bar{Y}_{k} \) denoting the number of replications, the number of observations in the i-th replication and the mean of the k-th replication, respectively.

    8. Dataplot saves the predicted values from a fit in the variable PRED and the residual values in the variable RES. These variables can be used in subsequent LET and PLOT commands to generate diagnostic plots of residuals and predicted values.
Syntax:
    <d> FIT <y> <x1> ... <xk>
                <SUBSET/EXCEPT/FOR qualification>
    where <d> is the optional specification of the desired degree:
                      LINEAR or FIRST-DEGREE (the default)
                      QUADRATIC or SECOND-DEGREE
                      CUBIC or THIRD-DEGREE
                      QUARTIC or FOURTH-DEGREE
                      QUINTIC or FIFTH-DEGREE
                      SEXTIC or SIXTH-DEGREE
                      SEPTIC or SEVENTH-DEGREE
                      OCTIC or EIGHT-DEGREE
                      NONIC or NINTH-DEGREE
                      DEXIC or TENTH-DEGREE;
                <y> is the response (= dependent) variable;
                <x1> ... <xk> is a list of 1 to 35 independent variables;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    The estimated parameters are stored in A0, A1, ... , Ak.

    If <d> is omitted, a linear fit is performed. In practice, the linear and quadratic fits receive heavy use while the other degrees are rarely used.

Examples:
    FIT Y X
    LINEAR FIT Y X
    FIT Y X1 X2 X5
    FIT Y X1 X2 X5 SUBSET TAG > 1
    QUADRATIC FIT PRESSURE TEMP
    CUBIC FIT V R
Note:
    Weighted fits are typically used in the following two situations.

    1. Weighting is one approach for dealing with non-constant variation in the residuals. It is not uncommon for the variance of the residuals to increase for the largest (or smallest) values of the independent variable. In this case, weights can be used to give less weight to the less precise measurements. The NIST/SEMATECH e-Handbook contains a disucssion of weighted fits and an example of using weights to address non-constant variation in the following pages

    2. Weights can also used to implement certain types of robust fitting. In this case, weights are used to down weight observations based on the size of the associated residual. Outlier observations can sometimes distort a fit (i.e., in trying to fit the outlier point(s), the bulk of the data is poorly fit). Weighting based on the residuals can often provide a good fit to the bulk of the data without eliminating the outlier observations from the analysis.

      Enter HELP WEIGHTS and HELP BIWEIGHT for examples of this use of weighted fits in Dataplot.

    To specify weights for a least squares fit, enter the command

      WEIGHTS <var>

    where <var> is a variable containing the weights.

    Note that the RES variable contains the absolute value of the residuals after the fit. For residual plots and analysis, it may be preferrable to work with the weighted residuals. You can create this with the command

      LET RESW = W*RES

    where W contains the weight variable.

Note:
    When there are a large number of independent variables, subset selection procedures are often employed to identify the best candidate models. The BEST CP command can be used to perform a "best subsets" analysis based on Mallows Cp. Enter HELP BEST CP for details.

    Another approach is to generate principal components of the independent variables and to perform the fit the based on the first several principal components. Although this approach can reduce problems introduced by multi-colinearity, the downside is that the model may be less interpretable.

Note: Note:
    The following matrix commands can be useful in regresssion diagnostics:

      LET VIF = VARIANCE INFLATION FACTORS
      LET C = CONDITION INDICES X
      LET XTXINV = XTXINV MATRIX X
      LET C = CATCHER MATRIX X

    The Program example in the HELP REGRESSION DIAGNOSTICS also gives an example of using these commands.

Note:
    For multi-linear fits, enter the following command to omit the constant term from the model

      SET FIT ADDITIVE CONSTANT OFF

    To restore the default of including the constant term, enter

      SET FIT ADDITIVE CONSTANT ON
Note:
    Data transformations are often used to improve the quality of the fit. For example, some types of non-linear fits can be restated as linear fits with an appropriate transformation. Also, transformations are often applied to address non-homogeneous variation in the fit. The NIST/SEMATECH e-Handbook contains a disucssion of this issue at

    Data transformations can be generated easily if needed via the LET command. The BOX-COX LINEARITY PLOT can be a useful command for determining an approriate transformation.

    Some analysts prefer to standardize the indpendent variables and the dependent variable by subtracting the mean and dividing by the standard deviation. This is done to provide numerical stability (note that Dataplot scales the data internally before performing the regression calculations) and also so that the data and regression coefficients are on a common scale. The original regression and standardized model are related as follows

      \( x_{i}^{'} = \frac{x_{i} - \bar{x}}{s_{x}} \)

      \( y_{i}^{'} = \frac{y_{i} - \bar{y}}{s_{y}} \)

    with \( \bar{x} \) and \( s_x \) denoting the mean and standard deviation of the independent variable and \( \bar{y} \) and \( s_y \) denoting the mean and standard deviation of the dependent variable.

    The parameters are related by

      \( \beta_{k} = \frac{s_{y}}{s_{k}} \beta_{k}^{'} \)

      \( \beta_{0}^{'} = \bar{y} - \beta_{1} \bar{x}_1 - \ldots - \beta_{p} \bar{x}_p \)

    A variation on this is the correlation transformation (also called the standardized regression model). Specifically

      \( y_{i}^{'} = \frac{1}{\sqrt{n-1}} \frac{y_{i} - \bar{y}}{s_{y}} \)

      \( x_{ik}^{'} = \frac{1}{\sqrt{n-1}} \frac{x_{ik} - \bar{x}_{k}} {s_{x_k}} \)

    With this transformation, the \( X'X \) matrix reduces to a correlation matrix of the independent variables. If there are \( p \) independent variables, these transformations can be generated with the commands

       
      LET N = SIZE Y
      LET FACT = 1/SQRT(N-1)
      LOOP FOR K = 1 1 P
          LET Z^K = STANDARDIZE X^K
          LET Z^K = AFACT*Z^K
      END OF LOOP
      LET YT = STANDARDIZE Y LET YT = AFACT*YT
Note:
    It is recommended that a FIT be followed by a residual analysis to assess the model adequacy. Specifically, the typical assumptions for the residuals are that they are independent with a common distribution having fixed location and variation. It is usually assumed that the common distribution is a normal distribution. The 4-PLOT command generates 4 plots that are useful in testing these assumptions. The NIST/SEMATECH e-Handbook contains a more detailed discussion of this issue at

    In addition, if there is a single independent variable in the model, it can be useful to plot the data with the fitted values overlaid.

    Linear fits allow a much richer set of diagnostics. For a fuller description and an example demonstrating these, enter

Note:
    If you want to suppress the output to files dpst1f.dat, dpst2f.dat, dpst3f.dat, dpst4f.dat and dpst5f.dat, enter the command

      SET FIT AUXILLARY FILES OFF
Note:
    By default, the values written to dpst1f.dat, dpst2f.dat, dpst3f.dat and dpst4f.dat are written using a Fortran E15.7 format (that is, exponential format with 7 significant digits). You can specify the number of signficant digits with the command

      SET AUXILLARY FILES DECIMAL POINTS <value>

    where the default is 7.

Default:
    None
Synonyms:
    None
Related Commands:
    FIT = Generate a non-linear fit.
    PRED = A variable where predicted values are stored.
    RES = A variable where residuals are stored.
    RESSD = A parameter where the residual standard deviation is stored.
    RESDF = A parameter where the residual degrees of freedom is stored.
    REPSD = A parameter where the replication standard deviation is stored.
    REPDF = A parameter where the replication degrees of freedom is stored.
    LOFCDF = A parameter where the lack of fit cdf is stored.
    WEIGHTS = Sets the weights for the fit command.
    BIWEIGHT = Perform a biweight transformation.
    EXACT RATIONAL FIT = Perform an exact rational fit.
    CALIBRATION = Perform a linear or quadratic calibration fit.
    LOWESS = Perform a locally weighted least squares smoothing.
    BOOTSTRAP FIT = = Perform a linear or multi-linear fit based on the bootstrap.
    ORTHOGONAL DISTANCE FIT = = Perform an orthogonal distance fit (useful for errors-in-variables models).
    SPLINE FIT = Perform a spline fit.
    SMOOTH = Perform a smoothing.
    ANOVA = Perform a fixed effects analysis of variance.
    MEDIAN POLISH = Perform a median polish.
    PLOT = Generate a data/function plot.
    4-PLOT = Generate a 4-plot.
References:
    Draper and Smith (1998), "Applied Regression Analysis", Third ed., John Wiley.

    Mosteller and Tukey (1977), "Data Analysis and Regression", Addison-Wesley.

    Cook and Weisberg (1982), "Residuals and Influence in Regression", Chapman and Hall.

    Belsley, Kuh, and Welsch, (1980), "Regression Diagnostics", John Wiley.

    Neter, Wasserman, and Kunter (1990), "Applied Linear Statistical Models", 3rd ed., Irwin.

    Note that linear regression is covered in great detail in many statistics textbooks.

Applications:
    Fitting
Implementation Date:
    1987/06
    1988/09: Support for constant fit
    1992/03: Write COEF, SDCOEF, TCDF to dpst1f.dat
    1993/07: Write diagonal of hat matrix and parameter covariance matrix to file
    1994/01: Write SDPRED and limits to file
    1994/06: Fix bug in dpst4f.dat file for polynomial models
    1996/01: Fix bomb with constant fit
    2002/04: Support for no constant term
    2002/04: Print error message if singularity detected
    2002/06: Additional variables to dpst2f.dat and dpst3f.dat file
    2002/06: Write ANOVA table to dpst5f.dat
    2003/10: Support for HTML and LaTex output
    2013/10: Support for BIC statistic
    2014/06: User option to suppress writing to auxiliary files
    2019/04: User option to specify number of decimal points for auxiliary files
    2021/08: Save RESSS, SSREG, SSTOTAL, MSE, MSR, FSTAT, FCV95, and FCV99 as internal parameters
Program:
     
    . ALASKA PIPELINE RADIOGRAPHIC DEFECT BIAS CURVE
    . PERFORM A LINEAR REGRESSION
    SKIP 25
    READ BERGER1.DAT TRUE MEAS BATCH
    FIT MEAS TRUE
    .
    TITLE OFFSET 2
    TITLE CASE ASIS
    LABEL CASE ASIS
    CASE ASIS
    .
    TITLE Original Data with Predicted Values
    X1LABEL True Depth (in .001 inch)
    Y1LABEL Measured Depth
    CHARACTERS X
    LINES BLANK
    .
    PLOT MEAS PRED VS TRUE
    .
    LABEL
    TITLE
    MULTIPLOT CORNER COORDINATES 0 0 100 100
    SET 4-PLOT MULTIPLOT ON
    TIC MARK LABEL SIZE 4
    CHARACTER SIZE 4
    .
    4-PLOT RES
    .
    END OF MULTIPLOT
    JUSTIFICATION CENTER
    MOVE 50 97
    TEXT 4-Plot of Residuals (ROSZMAN1.DAT)
        
    The following output is generated
                 Least Squares Multilinear Fit
      
     Sample Size:                                        107
     Number of Variables:                                  1
     Residual Standard Deviation:                    7.86476
     Residual Degrees of Freedom:                        105
     BIC:                                          448.67856
      
     Replication Case:
     Replication Standard Deviation:                 6.47902
     Replication Degrees of Freedom:                      68
     Number of Distinct Subsets:                          39
     Lack of Fit F Ratio:                            2.34374
     Lack of Fit F CDF (%):                         99.88354
     Lack of Fit Degrees of Freedom 1:                    37
     Lack of Fit Degrees of Freedom 2:                    68
      
     --------------------------------------------------------------------
                                                    Approximate
                Parameter Estimates          Standard Deviation   t-Value
     --------------------------------------------------------------------
       1  A0                       -1.96750             1.57479   -1.2494
       2  A1        TRUE            1.22297             0.04107   29.7781
        
    plot generated by sample program

    plot generated by sample program

Date created: 09/02/2021
Last updated: 12/04/2023

Please email comments on this WWW page to alan.heckert@nist.gov.