Dataplot Vol 1 Vol 2

# DIXON TEST

Name:
DIXON TEST
Type:
Analysis Command
Purpose:
Perform a Dixon test for a single outlier.
Description:
The Dixon test can be used to test for a single outlier in a univariate data set. This test is primarily used for small data sets (Dataplot limits the sample to be between 3 and 30). It can be used to test whether the minimum value is an outlier, the maximum value is an outlier, or either the minimum or maximum value is an outlier.

The Dixon text is based on comparing the distance of one end observation from its neighbors with the range of all the observations (or all but one or two observations). This is in contrast to the Grubbs (and the generalizations of Grubbs: the Tietjen-Moore and extreme studentized deviate tests) which are based on the number of standard deviations from the mean of the extreme observations.

Specifically, given a set of ordered observastions Y1, Y2, ..., YN, the Dixon test is computed as follows:

 Sample Size Test for Minimum Test for Maximum 3 ≤ N ≤ 7 $$\frac{Y_2 - Y_1} {Y_N - Y_1}$$ $$\frac{Y_N - Y_{N-1}} {Y_N - Y_1}$$ 8 ≤ N ≤ 10 $$\frac{Y_2 - Y_1} {Y_{N-1} - Y_1}$$ $$\frac{Y_N - Y_{N-1}} {Y_N - Y_2}$$ 11 ≤ N ≤ 13 $$\frac{Y_3 - Y_1} {Y_{N-1} - Y_1}$$ $$\frac{Y_N - Y_{N-2}} {Y_N - Y_2}$$ 14 ≤ N ≤ 30 $$\frac{Y_3 - Y_1} {Y_{N-2} - Y_1}$$ $$\frac{Y_3 - Y_1} {Y_{N-2} - Y_1}$$

The critical values are obtained via simulation. The simulation is performed by generating standard normal random sample and computing the Dixon test statistic. The critical values are dynamically generated using 25,000 random samples.

The null hypothesis of no outliers is rejected if the test statistic is greater than the critical value.

There are a number of variants of the Dixon test (e.g., it can be adopted to handle more than one outlier). Dataplot uses the formulation of the Dixon test as given in the ASTM-E178 standard (this is taken from Dixon's Biometrics paper).

Dixon's test is generally limited to the case of small samples. One reason for this is that it is quite sensitive to the number of outliers being tested for and this can be difficult to determine for larger samples. It also assumes that the underlying data distribution (with the exception of the outlier) is normal. For this reason, it is recommended that a Dixon test be preceeded by a normal probability plot. The normal probability can be used to determine if the assumption of normality and the prescence of at most one outlier are in fact reasonable assumptions.

Syntax 1:
<MINIMUM/MAXIMUM> DIXON TEST <y>
<SUBSET/EXCEPT/FOR qualification>
where <MINIMUM/MAXIMUM> is an optional keyword specifies whether the minimum or maximum value is tested as an outlier;
<y> is the response variable being tested;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

If neither MINIMUM or MAXIMUM is given, both the minimum and maximum points will be tested (the more extreme of the two values will be used).

Syntax 2:
<MINIMUM/MAXIMUM> DIXON TEST <y> <labid>
<SUBSET/EXCEPT/FOR qualification>
where <MINIMUM/MAXIMUM> is an optional keyword specifies whether the minimum or maximum value is tested as an outlier;
<y> is the response variable being tested;
<labid> is an id-variable;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

The <labid> variable is only used to identify the point being tested as an outlier. It does not affect the computations.

Syntax 3:
<MINIMUM/MAXIMUM> DIXON MULTIPLE TEST <y1> ... <yk>
<SUBSET/EXCEPT/FOR qualification>
where <MINIMUM/MAXIMUM> is an optional keyword specifies whether the minimum or maximum value is tested as an outlier;
<y1> ... <yk> is a list of 1 to 30 response variables;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

This syntax performs a Dixon test on <y1> then on <y2> and so on.

Note that the syntax

DIXON MULTIPLE TEST Y1 TO Y4

is supported. This is equivalent to

DIXON MULTIPLE TEST Y1 Y2 Y3 Y4
Syntax 4:
<MINIMUM/MAXIMUM> DIXON REPLICATED TEST <y> <x1> ... <xk>
<SUBSET/EXCEPT/FOR qualification>
where <MINIMUM/MAXIMUM> is an optional keyword specifies whether the minimum or maximum value is tested as an outlier;
<y> is the response variable;
<x1> ... <xk> is a list of 1 to 6 group-id variables;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.

This syntax peforms a cross-tabulation of <x1> ... <xk> and performs a Dixon test for each unique combination of cross-tabulated values. For example, if X1 has 3 levels and X2 has 2 levels, there will be a total of 6 Dixon tests performed.

Note that the syntax

DIXON REPLICATED TEST Y X1 TO X4

is supported. This is equivalent to

DIXON REPLICATED TEST Y X1 X2 X3 X4

If either the first or last replication variable has all unique elements, this variable will be interpreted as a lab-id variable rather than a replication variable.

Examples:
DIXON TEST Y1
DIXON TEST Y1 LABID
DIXON MULTIPLE TEST Y1 Y2 Y3
DIXON REPLICATED TEST Y X1 X2
DIXON TEST Y1 SUBSET TAG > 2
DIXON MINIMUM TEST Y1
DIXON MAXIMUM TEST Y1
Note:
Masking and swamping are two issues that can affect outlier tests.

Masking can occur when we specify too few outliers in the test. For example, if we are testing for a single outlier when there are in fact two (or more) outliers, these additional outliers may influence the value of the test statistic enough so that no points are declared as outliers.

On the other hand, swamping can occur when we specify too many outliers in the test. For example, if we are testing for two outliers when there is in fact only a single outlier, both points may be declared outliers.

The possibility of masking and swamping are an important reason why it is useful to complement formal outlier tests with graphical methods. Graphics can often help identify cases where masking or swamping may be an issue.

Also, masking is one reason that trying to apply a single outlier test sequentially can fail. If there are multiple outliers, masking may cause the outlier test for the first outlier to return a conclusion of no outliers (and so the testing for any additional outliers is not done).

The Dixon and Grubbs tests are used to check for a single outlier. If there are in fact multiple outliers, the results of these tests can be distorted.

If multiple outliers are suspected, then the Tietjen-Moore or the generalized extreme studentized deviate tests may be preferred. The Tietjen-Moore test is a generalization of the Dixon test for the case where multiple outliers may be present. The Tietjen-Moore test requires that the number of suspected outliers be specified exactly while the generalized extreme studentized deviate test only requires that an upper bound on the suspected number of outliers be specified.

Note:
Tests for outliers are dependent on knowing the distribution of the data. The Dixon test assumes that the data come from an approximately normal distribution. For this reason, it is strongly recommended that the Dixon test be complemented with a normal probability test. If the data are not approximately normally distributed, then the Dixon test may be detecting the non-normality of the data rather than the presence of an outlier.

If you perform a formal goodness of fit test for assessing normality, it is recommended you omit the potential outlier from the test (i.e., we want to distinguish between an outlier and non-normality and the potential outlier may distort the normality test).

Note:
You can specify the number of digits in the Dixon output with the command

SET WRITE DECIMALS <value>
Note:
The DIXON TEST command automatically saves the following parameters:

 STATVAL = the value of the test statistic CUTOFF0 = the 0 percent point of the reference distribution CUTOFF50 = the 50 percent point of the reference distribution CUTOFF75 = the 75 percent point of the reference distribution CUTOFF90 = the 90 percent point of the reference distribution CUTOFF95 = the 95 percent point of the reference distribution CUTOFF975 = the 97.5 percent point of the reference distribution CUTOFF99 = the 99 percent point of the reference distribution

If the MULTIPLE or REPLICATED option is used, these values will be written to the file "dpst1f.dat" instead.

Note:
In addition to the DIXON TEST command, the following commands can also be used:

LET A = DIXON TEST Y

In addition to the above LET command, built-in statistics are supported for about 17 different commands (enter HELP STATISTICS for details).

Default:
None
Synonyms:
MULTIPLE DIXON TEST is a synonym for DIXON MULTIPLE TEST
REPLICATED DIXON TEST is a synonym for DIXON REPLICATED TEST
Related Commands:
 GRUBBS TEST = Perform a Grubbs outlier test. TIETJEN-MOORE TEST = Perform a Tietjen-Moore outlier test. EXTREME STUDENTIZED DEVIATE TEST = Perform a extreme studentized deviate outlier test. GOODNESS OF FIT TEST = Perform a goodness of fit test (Anderson-Darling, Kolmogorov-Smirnov, chi-square, PPCC) WILKS SHAPIRO NORMALITY TEST = Perform a Wilks Shapiro normality test. HISTOGRAM = Generate a histogram. PROBABILITY PLOT = Generates a probability plot. BOX PLOT = Generate a box plot.
Reference:
Dixon (1953), "Processing Data for Outliers," Biometrics, Vol. 9, No. 1, pp. 74-89.

Dixon and Massey (1957), "Introduction to Statistical Analysis," Second Edition, McGraw-Hill, pp. 275-278.

ASTM E 178 - 08, "Standard Practice for Dealing with Outlying Observations," ASTM International, 100 Barr Harbor Drive, PO BOX C700, West Conshohoceken, PA 19428-2959, USA.

Iglewicz and Hoaglin (1993), "Volume 16: How To Detect and Handle Outliers," The ASQC Basic Reference in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.

Applications:
Outlier Detection
Implementation Date:
2011/09
Program:

.  Following uses example 1 from ASTM E 178 - 08 standard.
.
.  Response variable is breaking strength (in pounds) of
.  0.104-in hard-drawn copper wire.
.
let y = data 568 570 570 570 572 578 584 596
.
let a = dixon maximum test y
.
set write decimals 5
dixon maximum test y

The following output is generated.
            Dixon Test for a Single Outlier: Maximum Case
(Assumption: Normality)

Response Variable: Y

H0: There are no outliers
Ha: The maximum point is an outlier

Summary Statistics:
Number of Observations:                               8
Sample Minimum:                               568.00000
ID for Sample Minimum:                                0
Sample Maximum:                               596.00000
ID for Sample Maximum:                                0
Sample Mean:                                  576.00000
Sample SD:                                      9.68061
Sample Range:                                  28.00000

Dixon Test Statistic Value:                     0.46153
CDF Value:                                      0.88704
P-Value                                         0.11295

Percent Points of the Reference Distribution
-----------------------------------
Percent Point               Value
-----------------------------------
0.0    =          0.000
25.0    =          0.101
50.0    =          0.210
75.0    =          0.349
80.0    =          0.384
90.0    =          0.478
95.0    =          0.552
97.5    =          0.615
99.0    =          0.684
99.5    =          0.724
100.0    =          0.904

Conclusions (Upper 1-Tailed Test)
----------------------------------------------
Alpha    CDF   Critical Value     Conclusion
----------------------------------------------
10%    90%            0.478      Accept H0
5%    95%            0.552      Accept H0
2.5%   97.5            0.615      Accept H0
1%    99%            0.684      Accept H0

*Critical Values Based on    25000 Monte Carlo Simulations


NIST is an agency of the U.S. Commerce Department.

Date created: 09/22/2011
Last updated: 10/13/2015