DIXON TEST

Name:

DIXON TEST Type:

Analysis Command Purpose:

Perform a Dixon test for a single outlier. Description:

The Dixon text is based on comparing the distance of one end observation from its neighbors with the range of all the observations (or all but one or two observations). This is in contrast to the Grubbs (and the generalizations of Grubbs: the Tietjen-Moore and extreme studentized deviate tests) which are based on the number of standard deviations from the mean of the extreme observations.

Specifically, given a set of ordered observastions Y₁, Y₂, ..., Y_N, the Dixon test is computed as follows:

Sample Size	Test for Minimum	Test for Maximum
3 ≤ N ≤ 7	$\frac{Y_{2} - Y_{1}}{Y_{N} - Y_{1}}$	$\frac{Y_{N} - Y_{N - 1}}{Y_{N} - Y_{1}}$
8 ≤ N ≤ 10	$\frac{Y_{2} - Y_{1}}{Y_{N - 1} - Y_{1}}$	$\frac{Y_{N} - Y_{N - 1}}{Y_{N} - Y_{2}}$
11 ≤ N ≤ 13	$\frac{Y_{3} - Y_{1}}{Y_{N - 1} - Y_{1}}$	$\frac{Y_{N} - Y_{N - 2}}{Y_{N} - Y_{2}}$
14 ≤ N ≤ 30	$\frac{Y_{3} - Y_{1}}{Y_{N - 2} - Y_{1}}$	$\frac{Y_{3} - Y_{1}}{Y_{N - 2} - Y_{1}}$

The critical values are obtained via simulation. The simulation is performed by generating standard normal random sample and computing the Dixon test statistic. The critical values are dynamically generated using 25,000 random samples.

The null hypothesis of no outliers is rejected if the test statistic is greater than the critical value.

There are a number of variants of the Dixon test (e.g., it can be adopted to handle more than one outlier). Dataplot uses the formulation of the Dixon test as given in the ASTM-E178 standard (this is taken from Dixon's Biometrics paper).

Dixon's test is generally limited to the case of small samples. One reason for this is that it is quite sensitive to the number of outliers being tested for and this can be difficult to determine for larger samples. It also assumes that the underlying data distribution (with the exception of the outlier) is normal. For this reason, it is recommended that a Dixon test be preceeded by a normal probability plot. The normal probability can be used to determine if the assumption of normality and the prescence of at most one outlier are in fact reasonable assumptions.

Syntax 1:

If neither MINIMUM or MAXIMUM is given, both the minimum and maximum points will be tested (the more extreme of the two values will be used).

Syntax 2:

The <labid> variable is only used to identify the point being tested as an outlier. It does not affect the computations.

Syntax 3:

This syntax performs a Dixon test on <y1> then on <y2> and so on.

Note that the syntax

DIXON MULTIPLE TEST Y1 TO Y4

is supported. This is equivalent to

DIXON MULTIPLE TEST Y1 Y2 Y3 Y4

Syntax 4:

This syntax peforms a cross-tabulation of <x1> ... <xk> and performs a Dixon test for each unique combination of cross-tabulated values. For example, if X1 has 3 levels and X2 has 2 levels, there will be a total of 6 Dixon tests performed.

Note that the syntax

DIXON REPLICATED TEST Y X1 TO X4

is supported. This is equivalent to

DIXON REPLICATED TEST Y X1 X2 X3 X4

If either the first or last replication variable has all unique elements, this variable will be interpreted as a lab-id variable rather than a replication variable.

Examples:

Note:

Masking can occur when we specify too few outliers in the test. For example, if we are testing for a single outlier when there are in fact two (or more) outliers, these additional outliers may influence the value of the test statistic enough so that no points are declared as outliers.

On the other hand, swamping can occur when we specify too many outliers in the test. For example, if we are testing for two outliers when there is in fact only a single outlier, both points may be declared outliers.

The possibility of masking and swamping are an important reason why it is useful to complement formal outlier tests with graphical methods. Graphics can often help identify cases where masking or swamping may be an issue.

Also, masking is one reason that trying to apply a single outlier test sequentially can fail. If there are multiple outliers, masking may cause the outlier test for the first outlier to return a conclusion of no outliers (and so the testing for any additional outliers is not done).

The Dixon and Grubbs tests are used to check for a single outlier. If there are in fact multiple outliers, the results of these tests can be distorted.

If multiple outliers are suspected, then the Tietjen-Moore or the generalized extreme studentized deviate tests may be preferred. The Tietjen-Moore test is a generalization of the Dixon test for the case where multiple outliers may be present. The Tietjen-Moore test requires that the number of suspected outliers be specified exactly while the generalized extreme studentized deviate test only requires that an upper bound on the suspected number of outliers be specified.

Note:

If you perform a formal goodness of fit test for assessing normality, it is recommended you omit the potential outlier from the test (i.e., we want to distinguish between an outlier and non-normality and the potential outlier may distort the normality test).

Note:

SET WRITE DECIMALS <value>

Note:

STATVAL	=	the value of the test statistic
CUTOFF0	=	the 0 percent point of the reference distribution
CUTOFF50	=	the 50 percent point of the reference distribution
CUTOFF75	=	the 75 percent point of the reference distribution
CUTOFF90	=	the 90 percent point of the reference distribution
CUTOFF95	=	the 95 percent point of the reference distribution
CUTOFF975	=	the 97.5 percent point of the reference distribution
CUTOFF99	=	the 99 percent point of the reference distribution

If the MULTIPLE or REPLICATED option is used, these values will be written to the file "dpst1f.dat" instead.

Note:

LET A = DIXON TEST Y

In addition to the above LET command, built-in statistics are supported for about 17 different commands (enter HELP STATISTICS for details).

Default:

None Synonyms:

Related Commands:

GRUBBS TEST	= Perform a Grubbs outlier test.
TIETJEN-MOORE TEST	= Perform a Tietjen-Moore outlier test.
EXTREME STUDENTIZED DEVIATE TEST	= Perform a extreme studentized deviate outlier test.
GOODNESS OF FIT TEST	= Perform a goodness of fit test (Anderson-Darling, Kolmogorov-Smirnov, chi-square, PPCC)
WILKS SHAPIRO NORMALITY TEST	= Perform a Wilks Shapiro normality test.
HISTOGRAM	= Generate a histogram.
PROBABILITY PLOT	= Generates a probability plot.
BOX PLOT	= Generate a box plot.

Reference:

Biometrics

Dixon and Massey (1957), "Introduction to Statistical Analysis," Second Edition, McGraw-Hill, pp. 275-278.

ASTM E 178 - 08, "Standard Practice for Dealing with Outlying Observations," ASTM International, 100 Barr Harbor Drive, PO BOX C700, West Conshohoceken, PA 19428-2959, USA.

Iglewicz and Hoaglin (1993), "Volume 16: How To Detect and Handle Outliers," The ASQC Basic Reference in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.

Applications:

Outlier Detection Implementation Date:

2011/09 Program:

 
.  Following uses example 1 from ASTM E 178 - 08 standard.
.
.  Response variable is breaking strength (in pounds) of
.  0.104-in hard-drawn copper wire.
.  
let y = data 568 570 570 570 572 578 584 596
.
let a = dixon maximum test y
.
set write decimals 5
dixon maximum test y

            Dixon Test for a Single Outlier: Maximum Case
                          (Assumption: Normality)
 
Response Variable: Y
 
H0: There are no outliers
Ha: The maximum point is an outlier
 
Summary Statistics:
Number of Observations:                               8
Sample Minimum:                               568.00000
ID for Sample Minimum:                                0
Sample Maximum:                               596.00000
ID for Sample Maximum:                                0
Sample Mean:                                  576.00000
Sample SD:                                      9.68061
Sample Range:                                  28.00000
 
Dixon Test Statistic Value:                     0.46153
CDF Value:                                      0.88704
P-Value                                         0.11295
 
 
 
Percent Points of the Reference Distribution
-----------------------------------
  Percent Point               Value
-----------------------------------
            0.0    =          0.000
           25.0    =          0.101
           50.0    =          0.210
           75.0    =          0.349
           80.0    =          0.384
           90.0    =          0.478
           95.0    =          0.552
           97.5    =          0.615
           99.0    =          0.684
           99.5    =          0.724
          100.0    =          0.904
 
Conclusions (Upper 1-Tailed Test)
----------------------------------------------
  Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
    10%    90%            0.478      Accept H0
     5%    95%            0.552      Accept H0
   2.5%   97.5            0.615      Accept H0
     1%    99%            0.684      Accept H0
 
  *Critical Values Based on    25000 Monte Carlo Simulations