PROC DISTANCE: Proximity Measures :: SAS/STAT(R) 9.2 User's Guide, Second Edition

The DISTANCE Procedure

Proximity Measures

The following notation is used in this section:

$\text{[math]}$

the number of variables or the dimensionality

$\text{[math]}$

data for observation $\text{[math]}$ and the $\text{[math]}$ th variable, where $\text{[math]}$

$\text{[math]}$

data for observation $\text{[math]}$ and the $\text{[math]}$ th variable, where $\text{[math]}$

$\text{[math]}$

weight for the $\text{[math]}$ th variable from the WEIGHTS= option in the VAR statement. $\text{[math]}$ when either $\text{[math]}$ or $\text{[math]}$ is missing.

$\text{[math]}$

the sum of total weights. No matter if the observation is missing or not, its weight is added to this metric.

$\text{[math]}$

mean for observation $\text{[math]}$

$\text{[math]}$

$\text{[math]}$

mean for observation $\text{[math]}$

$\text{[math]}$

$\text{[math]}$

the distance or dissimilarity between observations $\text{[math]}$ and $\text{[math]}$

$\text{[math]}$

the similarity between observations $\text{[math]}$ and $\text{[math]}$

The factor $\text{[math]}$ is used to adjust some of the proximity measures for missing values.

Methods Accepting All Measurement Levels

GOWER

Gower’s similarity
$\text{[math]}$

$\text{[math]}$ is computed as follows:

For nominal, ordinal, interval, or ratio variable,

$\text{[math]}$

For asymmetric nominal variable,

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

For nominal or asymmetric nominal variable,

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

For ordinal, interval, or ratio variable,

$\text{[math]}$

DGOWER

1 minus Gower
$\text{[math]}$

Methods Accepting Ratio, Interval, and Ordinal Variables

EUCLID

Euclidean distance
$\text{[math]}$

SQEUCLID

squared Euclidean distance
$\text{[math]}$

SIZE

size distance
$\text{[math]}$

SHAPE

shape distance
$\text{[math]}$

Note:squared shape distance plus squared size distance equals squared Euclidean distance.

COV

covariance similarity coefficient
$\text{[math]}$ , where

$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

CORR

correlation similarity coefficient
$\text{[math]}$

DCORR

correlation transformed to Euclidean distance as sqrt(1–CORR)
$\text{[math]}$

SQCORR

squared correlation
$\text{[math]}$

DSQCORR

squared correlation transformed to squared Euclidean distance as (1–SQCORR)

$\text{[math]}$

L( $\text{[math]}$ )

Minkowski ( $\text{[math]}$ ) distance, where $\text{[math]}$ is a positive numeric value

$\text{[math]}$

CITYBLOCK

$\text{[math]}$
$\text{[math]}$

CHEBYCHEV

$\text{[math]}$
$\text{[math]}$

POWER( $\text{[math]}$ )

generalized Euclidean distance, where $\text{[math]}$ is a nonnegative numeric value and $\text{[math]}$ is a positive numeric value. The distance between two observations is the $\text{[math]}$ th root of sum of the absolute differences to the $\text{[math]}$ th power between the values for the observations:

$\text{[math]}$

Methods Accepting Ratio Variables

SIMRATIO

similarity ratio
$\text{[math]}$

DISRATIO

one minus similarity ratio
$\text{[math]}$

NONMETRIC

Lance-Williams nonmetric coefficient
$\text{[math]}$

CANBERRA

Canberra metric coefficient. See Sneath and Sokal (1973), pp. 125–126.
$\text{[math]}$

COSINE

cosine coefficient
$\text{[math]}$

DOT

dot (inner) product coefficient
$\text{[math]}$

OVERLAP

sum of the minimum values
$\text{[math]}$

DOVERLAP

maximum of the sum of the $\text{[math]}$ and the sum of $\text{[math]}$ minus overlap
$\text{[math]}$

CHISQ

chi-squared
If the data represent the frequency counts, chi-squared dissimilarity between two sets of frequencies can be computed. A 2-by- $\text{[math]}$ contingency table is illustrated to explain how the chi-squared dissimilarity is computed as follows:

	Variable				Row
Observation	Var 1	Var 2	...	Var v	Sum
X	$\text{[math]}$	$\text{[math]}$	...	$\text{[math]}$	$\text{[math]}$
Y	$\text{[math]}$	$\text{[math]}$	...	$\text{[math]}$	$\text{[math]}$
Column Sum	$\text{[math]}$	$\text{[math]}$	...	$\text{[math]}$	$\text{[math]}$

where

$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

The chi-squared measure is computed as follows:

$\text{[math]}$

where for $\text{[math]}$ = 1, 2, ..., $\text{[math]}$

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

CHI

squared root of chi-squared
$\text{[math]}$

PHISQ

phi-squared
This is the CHISQ dissimilarity normalized by the sum of weights
$\text{[math]}$

PHI

squared root of phi-squared
$\text{[math]}$

Methods Accepting Symmetric Nominal Variables

The following notation is used for computing $\text{[math]}$ to $\text{[math]}$ . Notice that only the nonmissing pairs are discussed below; all the pairs with at least one missing value will be excluded from any of the computations in the following section because $\text{[math]}$

$\text{[math]}$

nonmissing matches

$\text{[math]}$ , where

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

$\text{[math]}$

nonmissing mismatches

$\text{[math]}$ , where

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

$\text{[math]}$

total nonmissing pairs

$\text{[math]}$

HAMMING: Hamming distance
$\text{[math]}$
MATCH: simple matching coefficient
$\text{[math]}$
DMATCH: simple matching coefficient transformed to Euclidean distance
$\text{[math]}$
DSQMATCH: simple matching coefficient transformed to squared Euclidean distanc
$\text{[math]}$
HAMANN: Hamann coefficient
$\text{[math]}$
RT: Roger and Tanimoto
$\text{[math]}$
SS1: Sokal and Sneath 1
$\text{[math]}$
SS3: Sokal and Sneath 3. The coefficient between an observation and itself is always indeterminate (missing) since there is no mismatch.
$\text{[math]}$

The following notation is used for computing $\text{[math]}$ to $\text{[math]}$ . Notice that only the nonmissing pairs are discussed in the following section; all the pairs with at least one missing value are excluded from any of the computations in the following section because $\text{[math]}$

Also, the observed nonmissing data of an asymmetric binary variable can have only two possible outcomes: presence or absence. Therefore, the notation, PX (present mismatches), always has a value of zero for an asymmetric binary variable.

The following methods distinguish between the presence and absence of attributes.

$\text{[math]}$

mismatches with at least one present

$\text{[math]}$ , where

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

$\text{[math]}$

present matches

$\text{[math]}$ , where

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

$\text{[math]}$

present mismatches

$\text{[math]}$ , where

	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
	$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

$\text{[math]}$

both present = $\text{[math]}$

$\text{[math]}$

at least one present = $\text{[math]}$

$\text{[math]}$

present-absent mismatches

$\text{[math]}$ , where

$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

$\text{[math]}$

total nonmissing pairs

$\text{[math]}$

Methods Accepting Asymmetric Nominal and Ratio Variables

JACCARD

Jaccard similarity coefficient

The JACCARD method is equivalent to the SIMRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as sum of the coefficient from the ratio variables (SIMRATIO) and the coefficient from the asymmetric nominal variables.

$\text{[math]}$

DJACCARD

Jaccard dissimilarity coefficient

The DJACCARD method is equivalent to the DISRATIO method if there are only ratio variables; if there are both ratio and asymmetric nominal variables, the coefficient is computed as sum of the coefficient from the ratio variables (DISRATIO) and the coefficient from the asymmetric nominal variables.

$\text{[math]}$