Dataplot Vol 2 Vol 1

# BINARY MATCH DISSIMILARITY

Name:
BINARY MATCH DISSIMILARITY (LET)
BINARY MATCH SIMILARITY (LET)
BINARY ROGERS MATCH DISSIMILARITY (LET)
BINARY ROGERS MATCH SIMILARITY (LET)
BINARY SOKAL MATCH DISSIMILARITY (LET)
BINARY SOKAL MATCH SIMILARITY (LET)
BINARY JACCARD DISSIMILARITY (LET)
BINARY JACCARD SIMILARITY (LET)
BINARY ASYMMETRIC SOKAL MATCH DISSIMILARITY (LET)
BINARY ASYMMETRIC SOKAL MATCH SIMILARITY (LET)
BINARY ASYMMETRIC DICE MATCH DISSIMILARITY (LET)
BINARY ASYMMETRIC DICE MATCH SIMILARITY (LET)
YULES Q (LET)
YULES Y (LET)
YOUDEN INDEX (LET)
Type:
Let Subcommand
Purpose:
Given two binary (i.e., 0 or 1 values) response variables, compute various matching statistics that define either a similarity or dissimilarity score.
Description:
Given two variables with n parired observations where each variable has exactly two possible outcomes, we can generate the following 2x2 table:

Variable 2
Variable 1 Not Present Present Row Total

Not Present A B A + B
Present C D C + D

Column Total A + C B + D A + B + C + D

In the data, we use a value of "0" to denote "not present" and a value of "1" to denote "present".

The parameters A, B, C, and D denote the counts for each category. The various matching statistics combine A, B, C, and D in various ways. A distinction is made between "symmetric" and "asymmetric" matching statistics. Symmetric statistics are typically preferred when the "0" and the "1" outcome are considered equally meaningful. Asymmetric statistics are preferred when the "1" outcome is more meaningful. The case where matching the presence of rare events is what is considered important is an example where the asymmetric scores would be recommended.

Specifically

Symmetric Binary Variables
Similarity:  Matching Coefficient: $$\frac{A + D} {A + B + C + D}$$ Rogers and Tanimoto: $$\frac{A + D} {(A + D) + 2(B + C)}$$ Sokal and Sneath: $$\frac{2(A + D)} {2(A + D) + (B + C)}$$

Dissimilarity:  Matching Coefficient: $$\frac{B + C} {A + B + C + D}$$ Rogers and Tanimoto: $$\frac{2(B + C)} {(A + D) + 2(B + C)}$$ Sokal and Sneath: $$\frac{B + C} {2((A + D) + (B + C)}$$

Asymmetric Binary Variables (most important value coded as 1)

Similarity:  Jaccard Coefficient: $$\frac{A}{A+B+C}$$ Dice Coefficient: $$\frac{2A}{2A + B + C}$$ Sokal Coefficient: $$\frac{A}{A + 2(B + C)}$$

Dissimilarity:  Jaccard Coefficient: $$\frac{B + C}{A + B + C}$$ Dice Coefficient: $$\frac{B + C}{2A + B + C}$$ Sokal Coefficient: $$\frac{2(B + C)}{A + 2(B + C)}$$

Three related statistics are

 Yule's Q: $$\frac{A*D - B*C}{A*D + B*C}$$ Yule's Y: $$\frac{\sqrt{A*D} - \sqrt{B*C}} {\sqrt{A*D} + \sqrt{B*C}}$$ Youden index: $$\frac{A*D - B*C}{(A+B)(C+D)}$$

These statistics are often used to create dissimilarity or similarity matrices that will be used as input to various multivariate procedures such as clustering.

The above statstics where taken from Kauffman and Rousseeuw (see Reference below). They recommend using the matching coefficient for the symmetric case and the Jaccard coefficient for the asymmetric case. However, the above list is not exhaustive and other authors recommend other choices. Also, other sources may have somewhat different formulas for these statistics.

The Youden index (also known as Youden's J statistic) can also be expressed as "sensitivity + specificity - 1". It has a value from 0 (a test gives the same proportion of positive results for groups with and without the disease, i.e., the test has no value) to 1 (there are no false positives and no false negatives).

Yule's Q can take a value from -1 to +1 where -1 indicates total negative correlation, 0 indicates no association, and +1 indicates total positive correlation. Yule's Q is related to the odds ratio in the following way

$$Q = \frac{\mbox{Odds ratio} - 1} {\mbox{Odds ratio} + 1}$$

Yule's Y can be defined in terms of Yule's Q as

$$Y = \frac{(1 - SQRT(1 - Q^2)}{Q}$$

or in terms of the odds ratio

$$Y = \frac{\sqrt{\mbox{Odds ratio}} - 1} {\sqrt{\mbox{Odds ratio}} + 1}$$

Yule's Y is also known as the coefficient of colligation.

Syntax 1:
LET <par> = BINARY MATCH DISSIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed matching dissimilarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 2:
LET <par> = BINARY MATCH SIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed matching similarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 3:
LET <par> = BINARY ROGERS MATCH DISSIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Rogers and Tanimato matching dissimilarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 4:
LET <par> = BINARY ROGERS MATCH SIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Rogers and Tanimato matching similarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 5:
LET <par> = BINARY SOKAL MATCH DISSIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Sokal and Sneath matching dissimilarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 6:
LET <par> = BINARY SOKAL MATCH SIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Sokal and Sneath matching similarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 7:
LET <par> = BINARY JACCARD DISSIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Jaccard dissimilarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 8:
LET <par> = BINARY JACCARD SIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Jaccard similarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 9:
LET <par> = BINARY ASYMMETRIC SOKAL MATCH DISSIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Sokal asymmetric matching dissimilarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 10:
LET <par> = BINARY ASYMMETRIC SOKAL MATCH SIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Sokal asymmetric matching similarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 11:
LET <par> = BINARY ASYMMETRIC DICE MATCH DISSIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Dice asymmetric matching dissimilarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 12:
LET <par> = BINARY ASYMMETRIC DICE MATCH SIMILARITY <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Dice asymmetric matching similarity coefficient is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 13:
LET <par> = YOUDEN INDEX <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Youden index is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 14:
LET <par> = YULES Q <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Yules Q is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Syntax 15:
LET <par> = YULES Y <y1> <y2>
<SUBSET/EXCEPT/FOR qualification>
where <y1> is the first response variable;
<y2> is the second response variable;
<par> is a parameter where the computed Yules Y is stored;
and where the <SUBSET/EXCEPT/FOR qualification> is optional.
Examples:
LET A = BINARY MATCHING DISSIMILARITY Y1 Y2
LET A = BINARY MATCHING DISSIMILARITY Y1 Y2 ...
SUBSET Y1 0 1 SUBSET Y2 0 1
LET A = BINARY MATCHING SIMILARITY Y1 Y2
LET A = BINARY ROGERS MATCH DISSIMILARITY Y1 Y2
LET A = BINARY ROGERS MATCH SIMILARITY Y1 Y2
LET A = BINARY SOKAL MATCH DISSIMILARITY Y1 Y2
LET A = BINARY SOKAL MATCH SIMILARITY Y1 Y2
LET A = BINARY JACCARD DISSIMILARITY Y1 Y2
LET A = BINARY JACCARD SIMILARITY Y1 Y2
LET A = BINARY ASYMMETRIC SOKAL MATCH DISSIMILARITY Y1 Y2
LET A = BINARY ASYMMETRIC SOKAL MATCH SIMILARITY Y1 Y2
LET A = BINARY ASYMMETRIC DICE MATCH DISSIMILARITY Y1 Y2
LET A = BINARY ASYMMETRIC DICE MATCH SIMILARITY Y1 Y2
LET A = YOUDEN INDEX Y1 Y2
LET A = YULES Q Y1 Y2
Note:
The two response variables must have the same number of elements. For raw data, the response variables should only contain the values 0 and 1. See the next Note for a discussion of how to enter the A, B, C, and D values directly.
Note:
There are two ways you can define the response variables:
1. Raw data - in this case, the variables contain 0's and 1's.

If the data is not coded as 0's and 1's, Dataplot will check for the number of distinct values. If there are two distinct values, the minimum value is converted to 0's and the maximum value is converted to 1's. If there is a single distinct value, it is converted to 0's if it is less than 0.5 and to 1's if it is greater than or equal to 0.5. If there are more than two distinct values, an error is returned.

2. Summary data - if there are two observations, the data is assummed to be the 2x2 summary table. That is,

Y1(1) = A
Y1(2) = C
Y2(1) = B
Y2(2) = D
Note:
Dataplot statistics can be used in a number of commands. For details, enter

Default:
None
Synonyms:
None
Related Commands:
 PEARSON DISSIMILARITY = Compute the dissimilarity of two variables based on Pearson correlation. SPEARMAN DISSIMILARITY = Compute the dissimilarity of two variables based on Spearman's rank correlation. KENDALL TAU DISSIMILARITY = Compute the dissimilarity of two variables based on Kendall's tau correlation. COSINE DISTANCE = Compute the cosine distance. MANHATTAN DISTANCE = Compute the Euclidean distance. EUCLIDEAN DISTANCE = Compute the Euclidean distance. MATRIX DISTANCE = Compute various distance metrics for a matrix. GENERATE MATRIX = Compute a matrix of pairwise statistic values. CLUSTER = Perform a cluster analysis.
Reference:
Kaufman and Rousseeuw (1990), "Finding Groups in Data: An Introduction To Cluster Analysis," Wiley.

Youden, W.J. (1950). "Index for rating diagnostic tests," Cancer, No. 3, pp. 32–35.

Yule, G. Udny (1912), "On the Methods of Measuring Association Between Two Attributes," Journal of the Royal Statistical Society, Vol. 75, No. 6, pp. 579–652.

Applications:
Clustering, Multivariate Analysis
Implementation Date:
2017/08
2019/01: Support for Youden index
2019/08: Support for Yule's Y
Program:

.  Example from page 24 of Kaufman and Rousseeuw text.
.  The rows are 8 people and the columns are 10 binary variables
.
set write decimals 3
dimension 100 columns
.
1   0  1  1  0  0  1  0  0  0
0   1  0  0  1  0  0  0  0  0
0   0  1  0  0  0  1  0  0  1
0   1  0  0  0  0  0  1  1  0
1   1  0  0  1  1  0  1  1  0
1   1  0  0  1  0  1  1  0  0
0   0  0  1  0  1  0  0  0  0
0   0  0  1  0  1  0  0  0  0
end of data
.
let d = generate matrix binary match dissimilarity ...
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
print d1 d2 d3 d4 d5
print d6 d7 d8 d9 d10
.
let ad = generate matrix binary jaccard dissimilarity ...
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

The following output is generated
---------------------------------------------------------------------------
D1             D2             D3             D4             D5
---------------------------------------------------------------------------
0.000          0.375          0.375          0.500          0.250
0.375          0.000          0.750          0.875          0.125
0.375          0.750          0.000          0.375          0.625
0.500          0.875          0.375          0.000          0.750
0.250          0.125          0.625          0.750          0.000
0.500          0.625          0.625          0.250          0.500
0.250          0.625          0.125          0.500          0.500
0.250          0.125          0.625          0.750          0.250
0.375          0.250          0.500          0.625          0.375
0.500          0.625          0.125          0.500          0.500

---------------------------------------------------------------------------
D6             D7             D8             D9            D10
---------------------------------------------------------------------------
0.500          0.250          0.250          0.375          0.500
0.625          0.625          0.125          0.250          0.625
0.625          0.125          0.625          0.500          0.125
0.250          0.500          0.750          0.625          0.500
0.500          0.500          0.250          0.375          0.500
0.000          0.750          0.500          0.375          0.500
0.750          0.000          0.500          0.625          0.250
0.500          0.500          0.000          0.125          0.500
0.375          0.625          0.125          0.000          0.375
0.500          0.250          0.500          0.375          0.000

---------------------------------------------------------------------------
---------------------------------------------------------------------------
0.000          0.500          0.429          0.571          0.333
0.500          0.000          0.750          0.875          0.200
0.429          0.750          0.000          0.429          0.625
0.571          0.875          0.429          0.000          0.750
0.333          0.200          0.625          0.750          0.000
0.571          0.714          0.625          0.333          0.571
0.333          0.714          0.167          0.571          0.571
0.333          0.200          0.625          0.750          0.333
0.429          0.333          0.500          0.625          0.429
0.500          0.625          0.143          0.500          0.500

---------------------------------------------------------------------------
---------------------------------------------------------------------------
0.571          0.333          0.333          0.429          0.500
0.714          0.714          0.200          0.333          0.625
0.625          0.167          0.625          0.500          0.143
0.333          0.571          0.750          0.625          0.500
0.571          0.571          0.333          0.429          0.500
0.000          0.750          0.571          0.429          0.500
0.750          0.000          0.571          0.625          0.286
0.571          0.571          0.000          0.167          0.500
0.429          0.625          0.167          0.000          0.375
0.500          0.286          0.500          0.375          0.000



NIST is an agency of the U.S. Commerce Department.

Date created: 09/20/2017
Last updated: 08/30/2019