The DISTANCE Procedure

Proximity Measures

Subsections:

Methods That Accept All Measurement Levels
Methods That Accept Ratio, Interval, and Ordinal Variables
Methods That Accept Ratio Variables
Methods That Accept Symmetric Nominal Variables
Methods That Accept Asymmetric Nominal and Ratio Variables
Methods That Accept Asymmetric Nominal Variables

The following notation is used in this section:

v

the number of variables or the dimensionality

$x_ j$

data for observation x and the jth variable, where $j = 1 \mr{\ to\ } v$

$y_ j$

data for observation y and the jth variable, where $j = 1 \mr{\ to\ } v$

$w_ j$

weight for the jth variable from the WEIGHTS= option in the VAR statement. $w_ j = 0$ when either $x_ j$ or $y_ j$ is missing.

W

the sum of total weights. No matter if the observation is missing or not, its weight is added to this metric.

$\bar{x}$

mean for observation x

$\bar{x} = \sum _{j=1}^{v}w_ j{x_ j}/\sum _{j=1}^{v}w_ j$

$\bar{y}$

mean for observation y

$\bar{y} = \sum _{j=1}^{v}w_ j{y_ j}/\sum _{j=1}^{v}w_ j$

$d(x,y)$

the distance or dissimilarity between observations x and y

$s(x,y)$

the similarity between observations x and y

The factor $W / \sum _{j=1}^{v}w_ j$ is used to adjust some of the proximity measures for missing values.

Methods That Accept All Measurement Levels

GOWER

Gower’s similarity $s_{1}(x,y) = {\sum _{j=1}^{v}w_ j\delta _{x,y}^{j}d_{x,y}^ j}/ {\sum _{j=1}^{v}w_ j\delta _{x,y}^ j}$

$\delta _{x,y}^ j$ is computed as follows:

For nominal, ordinal, interval, or ratio variable,

$\begin{eqnarray*} \delta _{x,y}^ j & = & 1 \end{eqnarray*}$

For asymmetric nominal variable,

$\begin{eqnarray*} \delta _{x,y}^ j & =& 1 \mr{,\ if\ either \ } x_ j \mr{\ or\ } y_ j \mr{\ is\ present}\\ \delta _{x,y}^ j & =& 0 \mr{,\ if\ both \ } x_ j \mr{\ and\ } y_ j \mr{\ are\ absent} \end{eqnarray*}$

For nominal or asymmetric nominal variable,

$\begin{eqnarray*} d_{x,y}^ j & =& 1 \mr{,\ if \ } x_ j = y_ j \\ d_{x,y}^ j & =& 0 \mr{,\ if \ } x_ j \neq y_ j \end{eqnarray*}$

For ordinal, interval, or ratio variable,

$\begin{eqnarray*} d_{x,y}^ j & =& 1-|x_ j-y_ j| \end{eqnarray*}$

DGOWER

1 minus Gower $d_{2}(x,y) = 1 - s_{1}(x,y)$

Methods That Accept Ratio, Interval, and Ordinal Variables

EUCLID

Euclidean distance $d_{3}(x,y) = \sqrt { (\sum _{j=1}^ v{w_ j{(x_ j-y_ j)}^2}) W / (\sum _{j=1}^{v}w_ j)}$

SQEUCLID

squared Euclidean distance $d_{4}(x,y) = (\sum _{j=1}^ v{w_ j{(x_ j-y_ j)}^2}) W / (\sum _{j=1}^{v}w_ j)$

SIZE

size distance $d_{5}(x,y) = |\sum _{j=1}^ v{w_ j(x_ j-y_ j)}| \sqrt {W} / (\sum _{j=1}^{v}w_ j)$

SHAPE

shape distance $d_{6}(x,y) = \sqrt { (\sum _{j=1}^{v}w_ j[(x_ j-\bar{x})- (y_ j-\bar{y})]^2) W / (\sum _{j=1}^{v}w_ j) }$

Note: squared shape distance plus squared size distance equals squared Euclidean distance.

COV

covariance similarity coefficient $s_{7}(x,y) = \sum _{j=1}^{v}w_ j(x_ j-\bar{x}) (y_ j-\bar{y})/\mi{vardiv}$ , where

$\begin{eqnarray*} \mi{vardiv }& =& v \mr{\ \ if \ VARDEF=N }\\ & =& v-1 \mr{\ \ if \ VARDEF=DF}\\ & =& \textstyle \sum _{j=1}^{v}w_ j \mr{\ \ if \ VARDEF=WEIGHT}\\ & =& \textstyle \sum _{j=1}^{v}w_ j-1 \mr{\ \ if \ VARDEF=WDF} \end{eqnarray*}$

CORR

correlation similarity coefficient $s_{8}(x,y) = \frac{\sum _{j=1}^{v}w_ j(x_ j-\bar{x}) (y_ j-\bar{y})}{\sqrt {\sum _{j=1}^{v}w_ j(x_ j-\bar{x})^2 \sum _{j=1}^{v}w_ j(y_ j-\bar{y})^2}}$

DCORR

correlation transformed to Euclidean distance as sqrt(1–CORR) $d_{9}(x,y) = \sqrt {1 - s_{8}(x,y)}$

SQCORR

squared correlation $s_{10}(x,y) = \frac{[\sum _{j=1}^{v}w_ j(x_ j-\bar{x}) (y_ j-\bar{y})]^2}{\sum _{j=1}^{v}w_ j(x_ j-\bar{x})^2 \sum _{j=1}^{v}w_ j(y_ j-\bar{y})^2}$

DSQCORR

squared correlation transformed to squared Euclidean distance as (1–SQCORR)

$d_{11}(x,y) = 1 - s_{10}(x,y)$

L(p)

Minkowski ( $\mr{L}_ p$ ) distance, where p is a positive numeric value

$d_{12}(x,y) = [ (\sum _{j=1}^ v{w_ j{|x_ j-y_ j|}^ p}) W / (\sum _{j=1}^{v}w_ j) ]^{1/p}$

CITYBLOCK

$\mr{L}_1$ $d_{13}(x,y) = (\sum _{j=1}^ v{w_ j|x_ j-y_ j|}) W / (\sum _{j=1}^{v}w_ j)$

CHEBYCHEV

$\mr{L}_\infty$ $d_{14}(x,y) = \max _{j=1}^{v}w_ j|x_ j-y_ j|$

POWER(p,r)

generalized Euclidean distance, where p is a nonnegative numeric value and r is a positive numeric value. The distance between two observations is the rth root of sum of the absolute differences to the pth power between the values for the observations:

$d_{15}(x,y) = [ (\sum _{j=1}^ v{w_ j{|x_ j-y_ j| }^ p}) W / (\sum _{j=1}^{v}w_ j) ]^{1/r}$

Methods That Accept Ratio Variables

SIMRATIO

similarity ratio $s_{16}(x,y) = \frac{\sum _ j^{v}w_ j(x_{j}y_ j)}{\sum _{j=1}^{v}w_ j(x_{j}y_ j)+\sum _ j^{v}w_ j(x_ j-y_ j)^2}$

DISRATIO

one minus similarity ratio $d_{17}(x,y) = 1-s_{16}(x,y)$

NONMETRIC

Lance-Williams nonmetric coefficient $d_{18}(x,y) = \frac{\sum _{j=1}^{v}w_ j|x_ j-y_ j|}{\sum _{j=1}^{v}w_ j(x_ j+y_ j)}$

CANBERRA

Canberra metric coefficient. See Sneath and Sokal (1973, pp. 125–126) $d_{19}(x,y) = \sum _{j=1}^{v}{\frac{{w_ j|x_ j-y_ j|}}{w_ j(x_ j+y_ j)}}$

COSINE

cosine coefficient $s_{20}(x,y) = \frac{\sum _{j=1}^{v}w_ j(x_{j}y_ j)}{\sqrt {\sum _{j=1}^{v}w_ j{x_ j}^2\sum _{j=1}^{v}w_ j{y_ j}^2}}$

DOT

dot (inner) product coefficient $s_{21}(x,y) = [\sum _{j=1}^{v}w_ j(x_{j}y_ j)]/ \sum _{j=1}^{v}w_ j$

OVERLAP

sum of the minimum values $s_{22}(x,y) = \sum _{j=1}^{v}w_ j[\min (x_ j,y_ j)]$

DOVERLAP

maximum of the sum of the x and the sum of y minus overlap $d_{23}(x,y) = \max (\sum _{j=1}^{v}w_{j}x_ j,\sum _{j=1}^{v}w_{j}y_ j)-s_{22}(x,y)$

CHISQ

chi-square If the data represent the frequency counts, chi-square dissimilarity between two sets of frequencies can be computed. A 2-by-v contingency table is illustrated to explain how the chi-square dissimilarity is computed as follows:

	Variable				Row
Observation	Var 1	Var 2	…	Var v	Sum
X	$x_1$	$x_2$	…	$x_ v$	$r_ x$
Y	$y_1$	$y_2$	…	$y_ v$	$r_ y$
Column Sum	$c_1$	$c_2$	…	$c_ v$	T

where

$\begin{eqnarray*} r_ x & =& \textstyle \sum _{j=1}^{v}w_ jx_ j \\ r_ y & =& \textstyle \sum _{j=1}^{v}w_ jy_ j\\ c_ j & =& w_ j(x_ j+y_ j)\\ T & =& r_ x+r_ y=\textstyle \sum _{j=1}^{v}c_ j \end{eqnarray*}$

The chi-square measure is computed as follows:

$d_{24}(x,y) = (\sum _{j=1}^ v{\frac{(w_ jx_ j-E(x_ j))^2}{E(x_ j)}}+ \sum _{j=1}^ v{\frac{(w_ jy_ j-E(y_ j))^2}{E(y_ j)}}) W / (\sum _{j=1}^{v}w_ j)$

where for j= 1, 2, …, v

$\begin{eqnarray*} E(x_ j) & =& r_ xc_ j/T \\ E(y_ j) & =& r_ yc_ j/T \end{eqnarray*}$

CHI

square root of chi-square $d_{25}(x,y) = \sqrt {d_{23}(x,y)}$

PHISQ

phi-square This is the CHISQ dissimilarity normalized by the sum of weights $d_{26}(x,y) = d_{24}(x,y) / (\sum _{j=1}^{v}w_ j)$

PHI

square root of phi-square $d_{27}(x,y) = \sqrt {d_{25}(x,y)}$

Methods That Accept Symmetric Nominal Variables

The following notation is used for computing $d_{28}(x,y)$ to $s_{35}(x,y)$ . Notice that only the nonmissing pairs are discussed below; all the pairs with at least one missing value will be excluded from any of the computations in the following section because $w_ j = 0 \mbox{, if either } x_ j \mbox{ or } y_ j \mbox{ is missing.}$

M

nonmissing matches

$M= \sum _{j=1}^{v}w_ j\delta _{x,y}^{j}$ , where

$\begin{eqnarray*} \delta _{x,y}^{j}& =& 1 \mr{,\ if \ } x_ j = y_ j \\ \delta _{x,y}^{j}& =& 0 \mr{,\ otherwise} \end{eqnarray*}$

X

nonmissing mismatches

$X= \sum _{j=1}^{v}w_ j\delta _{x,y}^{j}$ , where

$\begin{eqnarray*} \delta _{x,y}^{j}& =& 1 \mr{,\ if \ } x_ j \neq y_ j \\ \delta _{x,y}^{j}& =& 0 \mr{, \ otherwise} \end{eqnarray*}$

N

total nonmissing pairs

$N= \sum _{j=1}^{v}w_ j$

HAMMING: Hamming distance $d_{28}(x,y) = X$
MATCH: simple matching coefficient $s_{29}(x,y) = M/N$
DMATCH: simple matching coefficient transformed to Euclidean distance $d_{30}(x,y) = \sqrt {1-M/N} = \sqrt {(X/N)}$
DSQMATCH: simple matching coefficient transformed to squared Euclidean distance $d_{31}(x,y) = 1-M/N = X/N$
HAMANN: Hamann coefficient $s_{32}(x,y) = (M-X) / N$
RT: Roger and Tanimoto $s_{33}(x,y) = M / (M+2X)$
SS1: Sokal and Sneath 1 $s_{34}(x,y) = 2M / (2M+X)$
SS3: Sokal and Sneath 3. The coefficient between an observation and itself is always indeterminate (missing) since there is no mismatch. $s_{35}(x,y) = M / X$

The following notation is used for computing $s_{36}(x,y)$ to $d_{41}(x,y)$ . Notice that only the nonmissing pairs are discussed in the following section; all the pairs with at least one missing value are excluded from any of the computations in the following section because $w_ j=0 \mr{, if\ either\ } x_ j \mr{\ or\ } y_ j \mr{\ is\ missing.}$

Also, the observed nonmissing data of an asymmetric binary variable can have only two possible outcomes: presence or absence. Therefore, the notation, PX (present mismatches), always has a value of zero for an asymmetric binary variable.

The following methods distinguish between the presence and absence of attributes.

X

mismatches with at least one present

$X= \sum _{j=1}^{v}w_ j\delta _{x,y}^{j}$ , where

$\begin{eqnarray*} \delta _{x,y}^{j}& =& 1 \mr{,\ if \ } x_ j \neq y_ j \mr{\ and\ not\ both \ } x_ j \mr{\ and \ } y_ j \mr{\ are\ absent} \\ \delta _{x,y}^{j}& =& 0 \mr{,\ otherwise} \end{eqnarray*}$

PM

present matches

$\mathit{PM}= \sum _{j=1}^{v}w_ j\delta _{x,y}^{j}$ , where

$\begin{eqnarray*} \delta _{x,y}^{j}& =& 1 \mr{,\ if \ } x_ j = y_ j \mr{\ and\ both \ } x_ j \mr{\ and \ } y_ j \mr{\ are\ present} \\ \delta _{x,y}^{j}& = & 0 \mr{,\ otherwise} \end{eqnarray*}$

PX

present mismatches

$\mathit{PX}= \sum _{j=1}^{v}w_ j\delta _{x,y}^{j}$ , where

$\begin{eqnarray*} \delta _{x,y}^{j}& =& 1 \mr{,\ if \ } x_ j \neq y_ j \mr{\ and\ both \ } x_ j \mr{\ and \ } y_ j \mr{\ are\ present} \\ \delta _{x,y}^{j}& = & 0 \mr{,\ otherwise} \end{eqnarray*}$

PP

both present = PM + PX

P

at least one present = PM + X

PAX

present-absent mismatches

$\mathit{PAX}= \sum _{j=1}^{v}w_ j\delta _{x,y}^{j}$ , where

$\begin{eqnarray*} \delta _{x,y}^{j}& =& 1 \mr{,\ if \ } x_ j \neq y_ j \mr{\ and\ either\ } x_ j \mr{\ is\ present\ and \ } y_ j \mr{\ is\ absent\ or}\\ & & x_ j \mr{\ is\ absent\ and \ } y_ j \mr{\ is\ present}\\ \delta _{x,y}^{j}& =& 0 \mr{\ otherwise} \end{eqnarray*}$

N

total nonmissing pairs

$N= \sum _{j=1}^{v}w_ j$

Methods That Accept Asymmetric Nominal and Ratio Variables

JACCARD

Jaccard similarity coefficient

The JACCARD method is equivalent to the SIMRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as sum of the coefficient from the ratio variables (SIMRATIO) and the coefficient from the asymmetric nominal variables.

$s_{36}(x,y) = s_{16}(x,y) + \mathit{PM} / P$

DJACCARD

Jaccard dissimilarity coefficient

The DJACCARD method is equivalent to the DISRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as sum of the coefficient from the ratio variables (DISRATIO) and the coefficient from the asymmetric nominal variables.

$d_{37}(x,y) = d_{17}(x,y) + X / P$

Methods That Accept Asymmetric Nominal Variables

DICE

Dice coefficient or Czekanowski/Sorensen similarity coefficient $s_{38}(x,y) = 2\mathit{PM} / (P + \mathit{PM})$

RR

Russell and Rao. This is the binary equivalent of the dot product coefficient. $s_{39}(x,y) = \mathit{PM} / N$

BLWNM | BRAYCURTIS

Binary Lance and Williams, also known as Bray and Curtis coefficient

$d_{40}(x,y) = X / (\mathit{PAX}+2\mathit{PP})$

K1

Kulcynski 1. The coefficient between an observation and itself is always indeterminate (missing) since there is no mismatch. $d_{41}(x,y) = \mathit{PM} / X$