SED navigation bar go to SED home page go to Dataplot home page go to NIST home page SED Home Page SED Staff SED Projects SED Products and Publications Search SED Pages
Dataplot Vol 1 Vol 2

GRUBBS TEST

Name:
    GRUBBS TEST
Type:
    Analysis Command
Purpose:
    Perform a Grubbs test for outliers.
Description:
    The Grubbs test, also know as the maximum normalized residual test, can be used to test for outliers in a univariate data set. Note that this test assumes normality, so you test the data for normality before applying the Grubbs test.

    Grubbs test detects one outlier at a time. For multiple outliers, delete the single outlier detected and run the Grubbs test. Repeat this process until no outliers are detected.

    More formally, the Grubbs test can be defined as follows.
    H0: There are no outliers in the data.
    Ha: There is at least one outlier in the data.
    Test Statistic: \( G = \frac{\max(|X_i| - \bar{x})} {s} \)

    where \( \bar{X} \) and s are the sample mean and standard deviation of the data. That is, the Grubbs test statistic is the largest absolute deviation from the sample mean in units of the sample standard deviation.

    Significance
    Level:
    \( \alpha \)
    Critical
    Region
    The hypothesis of no outliers is rejected if

    \( G > \frac{N - 1} {\sqrt{N}} \sqrt{\frac{t^2_{(\alpha/(2N),N-2)}} {N - 2 + t^2_{(\alpha/(2N),N-2)}}} \)

    where t is the percent point function of the t distribution.

    Note that the above is actually a combination of the following two tests:

    1. the test that the minimum value is an outlier.

    2. the test that the maximum value is an outlier.

    To generate these one-sided tests, the test statistic is

      \( G = \frac{\bar{Y} - Y_{min}} {s} \)

    or

      \( G = \frac{Y_{max} - \bar{Y}} {s} \)

    The significance level in the TPPF function needs to be doubled for the one-sided tests.

    You can request that one of the one-sided tests be performed (see the Syntax section).

    Generally, graphical methods such as the box plot or histogram are used to detect outliers. However, the Grubbs test can be used if you prefer a more formal test.

Syntax 1:
    GRUBBS TEST <y>             <SUBSET/EXCEPT/FOR qualification>
    where <y> is the response variable being tested;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax performs the two-sided test.

Syntax 2:
    GRUBBS MINIMUM TEST <y>             <SUBSET/EXCEPT/FOR qualification>
    where <y> is the response variable being tested;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax performs the one-sided test for the minimum value.

Syntax 3:
    GRUBBS MAXIMUM TEST <y>             <SUBSET/EXCEPT/FOR qualification>
    where <y> is the response variable being tested;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax performs the one-sided test for the maximum value.

Syntax 4:
    GRUBBS TEST <y> <labid>             <SUBSET/EXCEPT/FOR qualification>
    where <y> is the response variable being tested;
                <labid> is a variable containing the lab-id corresponding to each value of the response variable;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax can also be used with the MINIMUM and MAXIMUM version of the tests. The <labid> variable is used to identify the lab-id of the minimum and maximum points. However, it is not used in the computation of the statistic.

Syntax 5:
    GRUBBS MULTIPLE TEST <y1> ... <yk>
                            <SUBSET/EXCEPT/FOR qualification>
    where <y1> ... <yk> is a list of up to k response variables;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax can also be used with the MINIMUM and MAXIMUM version of the tests. This syntax performs a Grubb test on <y1> then on <y2> and so on. Up to 30 response variables can be specified.

    Note that the syntax

      GRUBB MULTIPLE TEST Y1 TO Y4

    is supported. This is equivalent to

      GRUBB MULTIPLE TEST Y1 Y2 Y3 Y4
Syntax 6:
    GRUBBS REPLICATED TEST <y> <x1> ... <xk>
                            <SUBSET/EXCEPT/FOR qualification>
    where <y> is the response variable;
    <x1> ... <xk> is a list of up to k group-id variables;
    and where the <SUBSET/EXCEPT/FOR qualification> is optional.

    This syntax can also be used with the MINIMUM and MAXIMUM version of the tests. This syntax peforms a cross-tabulation of <x1> ... <xk> and performs a Grubbs test for each unique combination of cross-tabulated values. For example, if X1 has 3 levels and X2 has 2 levels, there will be a total of 6 Grubbs tests performed.

    Up to six group-id variables can be specified.

    Note that the syntax

      GRUBB REPLICATED TEST Y X1 TO X4

    is supported. This is equivalent to

      GRUBB REPLICATED TEST Y X1 X2 X3 X4
Examples:
    GRUBBS TEST Y1
    GRUBBS TEST Y1 LABID
    GRUBBS MULTIPLE TEST Y1 Y2 Y3
    GRUBBS REPLICATED TEST Y X1 X2
    GRUBBS TEST Y1 SUBSET TAG > 2
    GRUBBS MINIMUM TEST Y1
    GRUBBS MAXIMUM TEST Y1
Note:
    Masking and swamping are two issues that can affect outlier tests.

    Masking can occur when we specify too few outliers in the test. For example, if we are testing for a single outlier when there are in fact two (or more) outliers, these additional outliers may influence the value of the test statistic enough so that no points are declared as outliers.

    On the other hand, swamping can occur when we specify too many outliers in the test. For example, if we are testing for two outliers when there is in fact only a single outlier, both points may be declared outliers.

    The possibility of masking and swamping are an important reason why it is useful to complement formal outlier tests with graphical methods. Graphics can often help identify cases where masking or swamping may be an issue.

    Also, masking is one reason that trying to apply a single outlier test sequentially can fail. If there are multiple outliers, masking may cause the outlier test for the first outlier to return a conclusion of no outliers (and so the testing for any additional outliers is not done).

    The Grubbs test is used to check for a single outlier. If there are in fact multiple outliers, the results of the Grubbs test can be distorted.

    If multiple outliers are suspected, then the Tietjen-Moore or the generalized extreme studentized deviate tests may be preferred. The Tietjen-Moore test is a generalization of the Grubbs test for the case where multiple outliers may be present. The Tietjen-Moore test requires that the number of suspected outliers be specified exactly while the generalized extreme studentized deviate test only requires that an upper bound on the suspected number of outliers be specified.

Note:
    Tests for outliers are dependent on knowing the distribution of the data. The Grubbs test assumes that the data come from an approximately normal distribution. For this reason, it is strongly recommended that the Grubbs test be complemented with a normal probability test. If the data are not approximately normally distributed, then the Grubbs test may be detecting the non-normality of the data rather than the presence of an outlier.
Note:
    You can specify the number of digits in the Grubbs output with the command

      SET WRITE DECIMALS <value>
Note:
    The GRUBBS TEST command automatically saves the following parameters:

      STATVAL = the value of the test statistic
      CUTOFF0 = the 0 percent point of the reference distribution
      CUTOFF50 = the 50 percent point of the reference distribution
      CUTOFF75 = the 75 percent point of the reference distribution
      CUTOFF90 = the 90 percent point of the reference distribution
      CUTOFF95 = the 95 percent point of the reference distribution
      CUTOFF975 = the 97.5 percent point of the reference distribution
      CUTOFF99 = the 99 percent point of the reference distribution

    If the MULTIPLE or REPLICATED option is used, these values will be written to the file "dpst1f.dat" instead.

Note:
    In addition to the GRUBBS TEST command, the following commands can also be used:

      LET A = GRUBBS CDF Y
      LET A = GRUBBS DIRECTION Y
      LET A = GRUBBS INDEX Y
      LET A = GRUBBS Y

    The GRUBBS INDEX returns the row index of the most extreme point and GRUBBS DIRECTION specifies whether the most extreme point is in the minimum direction (a -1 is returned) or the maximum direction (a +1 is returned).

    In addition to the above LET command, built-in statistics are supported for about 20+ different commands (enter HELP STATISTICS for details).

Note:
    The ASTM E178-16a addresses the case where an independent estimate of the standard deviation is available. In this case, this independent estimate of the standard deviation replaces the standard deviation based on the current sample data in the test statistic for Grubbs test. This independent estimate of the standard deviation will also have an associated degrees of freedom (typically the sample size of the data used to compute this independent estimate of standard deviation).

    Alternatively, the population standard deviation may be considered to be known accurately (usually based on extensive historical data).

    In either of these cases, the critical values for the Grubbs test are modified.

    To support these options, enter the commands

    SET GRUBB STANDARD DEVIATION <value>
    SET GRUBB DEGREES OF FREEDOM <value>

    If the specified standard deviation is positive, Dataplot uses the formulas based on the independent estimate of the standard deviation. If the degrees of freedom are not specified, a value of 10,000 will be used. Essentially, any value greater than 120 is effectively treated as a "known" population standard deviation.

    To compute the critical values using simulation, enter the command

    SET GRUBB TEST CRITICAL VALUES SIMULATION

    To reset the default of basing the critical values on a formula, enter

    SET GRUBB TEST CRITICAL VALUES FORMULA

    The formula from the E178 standard is

      \[ T_{n}(\alpha) = t_{\alpha/n,\nu} \sqrt{1 - (1/n)} \]

    where t is the percent point function of the t distribution and \( \nu \) is the degrees of freedom. For the "known" standard deviation case, the t distribution is replaced with a normal distribution.

Default:
    None
Synonyms:
    MULTIPLE GRUBBS TEST is a synonym for GRUBBS MULTIPLE TEST
    REPLICATED GRUBBS TEST is a synonym for GRUBBS REPLICATED TEST
Related Commands: Reference:
    Grubbs, F. E., "Procedures for Detecting Outlying Observations in Samples", Technometrics, Vol. 11, No. 4, February, 1969, pp. 1-21.

    Stefansky, W., "Rejecting Outliers in Factorial Designs", Technometrics, Vol. 14, 1972, pp. 469-479.

    E178 - 16A (2016), "Standard Practice for Dealing with Outlying Observations", ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshohocken, PA 19428-2959, USA.

Applications:
    Outlier Detection
Implementation Date:
    1998/05  
    2005/5: Corrected the significance levels for the two-sided case (previous version was actually using the significance level for the one-sided case)
    2005/5: Added support for the one-sided tests
    2006/3: Replaced 2005/5 update with Syntax 2 and Syntax 3
    2009/10: Significantly modified the output format
    2009/10: Added support for Syntax 4, Syntax 5, and Syntax 6
    2019/10: Added support for an independent estimate of the standard deviation
Program:
     
    SKIP 25
    READ VANGEL31.DAT Y
    SET WRITE DECIMALS 4
    GRUBBS TEST Y
        
    The following output is generated:
                Grubb Test for Outliers: Test for Minimum and Maximum
                               (Assumption: Normality)
     
    Response Variable: Y
     
    H0: There are no outliers
    Ha: The extreme point is an outlier
     
    Summary Statistics:
    Number of Observations:                              38
    Sample Minimum:                                147.0000
    ID for Sample Minimum:                                1
    Sample Maximum:                                231.0000
    ID for Sample Maximum:                               38
    Sample Mean:                                   185.7894
    Sample SD:                                      18.5954
     
    Grubbs Test Statistic Value:                     2.4312
     
     
    Percent Points of the Reference Distribution
    -----------------------------------
      Percent Point               Value
    -----------------------------------
                0.0    =          0.000
               50.0    =          2.392
               75.0    =          2.601
               90.0    =          2.846
               95.0    =          3.013
               97.5    =          3.169
               99.0    =          3.355
              100.0    =          6.001
     
    Conclusions (Upper 1-Tailed Test)
    ----------------------------------------------
      Alpha    CDF   Critical Value     Conclusion
    ----------------------------------------------
        10%    90%            2.846      Accept H0
         5%    95%            3.013      Accept H0
       2.5%  97.5%            3.169      Accept H0
         1%    99%            3.355      Accept H0
        
Date created: 06/05/2001
Last updated: 12/11/2023

Please email comments on this WWW page to alan.heckert@nist.gov.