PROC DISTANCE: PROC DISTANCE Statement :: SAS/STAT(R) 9.22 User's Guide

The DISTANCE Procedure

PROC DISTANCE Statement

PROC DISTANCE <options> ;

The options available with the PROC DISTANCE statement are summarized in Table 32.1 and discussed in the following section.

Table 32.1 Summary of PROC DISTANCE Statement Options
Option	Description
Standardize variables
ADD=	specifies the constant to add to each value after standardizing and multiplying by the value specified in the MULT= option
FUZZ=	specifies the relative fuzz factor for writing the output
INITIAL=	specifies the method for computing initial estimates for the A-estimates
MULT=	specifies the constant to multiply each value by after standardizing
NORM	normalizes the scale estimator to be consistent for the standard deviation of a normal distribution
NOSTD	suppresses standardization
SNORM	normalizes the scale estimator to have an expectation of approximately 1 for a standard normal distribution
STDONLY	standardizes variables only (suppresses computation of the distance matrix)
VARDEF=	specifies the variances divisor
Generate distance matrix
ABSENT=	specifies the value to be used as an absence value for all the asymmetric nominal variables
METHOD=	specifies the method for computing proximity measures
PREFIX=	specifies a prefix for naming the distance variables in the OUT= data set
RANKSCORE=	specifies the method of assigning scores to ordinal variables
SHAPE=	specifies the shape of the proximity matrix to be stored in the OUT= data set
UNDEF=	specifies the numeric constant used to replace undefined distances
Replace missing values
NOMISS	omits observations with missing values from computation of the location and scale measures, if standardization applies; outputs missing values to the distance matrix for observations with missing values
REPLACE	replaces missing data with zero in the standardized data
REPONLY	replaces missing data with the location measure (does not standardize the data)
Specify data set details
DATA=	specifies the input data set
OUT=	specifies the output data set
OUTSDZ=	specifies the output data set for standardized scores

These options and their abbreviations are described (in alphabetical order) in the remainder of this section.

ABSENT=number | qs

specifies the value to be used as an absence value in an irrelevant absent-absent match for all of the asymmetric nominal variables. If you want to specify a different absence value for a particular variable, use the ABSENT= option in the VAR statement. See the ABSENT= option in the section VAR Statement for details.

An absence value for a variable can be either a numeric value or a quoted string consisting of combinations of characters. For instance, ., -999, and "NA" are legal values for the ABSENT= option.

The default absence value for a character variable is "NONE" (notice that a blank value is considered a missing value), and the default absence value for a numeric variable is 0.

ADD=c

specifies a constant, $\text{[math]}$ , to add to each value after standardizing and multiplying by the value you specify in the MULT= option. The default value is 0.

DATA=SAS-data-set

specifies the input data set containing observations from which the proximity is computed. If you omit the DATA= option, the most recently created SAS data set is used.

FUZZ=c

specifies the relative fuzz factor for computing the standardized scores. The default value is 1E–14. For the OUTSDZ= data set, the score is computed as follows:

$\text{[math]}$

where $\text{[math]}$ is the numeric constant specified in the MULT= option, or 1 if MULT= option is not specified.

INITIAL=method

specifies the method of computing initial estimates for the A-estimates (ABW, AWAVE, and AHUBER). The following methods are not allowed for the INITIAL= option: ABW, AHUBER, AWAVE, and IN.

The default value is INITIAL=MAD.

METHOD=method

specifies the method of computing proximity measures.

For use in PROC CLUSTER, distance or dissimilarity measures such as METHOD=EUCLID or METHOD=DGOWER should be chosen.

The following six tables outline the proximity measures available for the METHOD= option. These tables are classified by levels of measurement accepted by each method. There are three to four columns in each table: the proximity measures (Method) column, the upper and lower bounds (Range) column(s), and the types of proximity (Type) column.

The Type column has two possible values: "sim" if a method generate similarity or "dis" if a method generates distance or dissimilarity measures.

For formulas and descriptions of these methods, see the section Details: DISTANCE Procedure.

Table 32.2 lists the range and output matrix type of the GOWER and DGOWER methods. These two methods accept all measurement levels including ratio, interval, ordinal, nominal, and asymmetric nominal. METHOD=GOWER or METHOD=DGOWER always implies standardization. Assuming all the numeric (ordinal, interval, and ratio) variables are standardized by their corresponding default methods, the possible range values for both methods in the second column of this table are on or between 0 and 1. To find out the default methods of standardization for METHOD=GOWER or METHOD=DGOWER, see the STD= option in the section VAR Statement. Entries in this table are as follows:

GOWER: Gower and Legendre (1986) similarity
DGOWER: 1 minus GOWER

Table 32.2 Methods Accepting All Measurement Levels
Method	Range	Type
GOWER	0 to 1	sim
DGOWER	0 to 1	dis

Table 32.3 lists methods accepting ratio, interval, and ordinal variables. Entries in this table are as follows:

EUCLID: Euclidean distance
SQEUCLID: squared Euclidean distance
SIZE: size distance
SHAPE: shape distance
COV: covariance
CORR: correlation
DCORR: correlation transformed to Euclidean distance
SQCORR: squared correlation
DSQCORR: one minus squared correlation
L( $\text{[math]}$ ): Minkowski ( $\text{[math]}$ ) distance, where $\text{[math]}$ is a positive numeric value
CITYBLOCK: $\text{[math]}$ , city-block, or Manhattan distance
CHEBYCHEV: $\text{[math]}$
POWER( $\text{[math]}$ ): generalized Euclidean distance where $\text{[math]}$ is a positive numeric value and $\text{[math]}$ is a nonnegative numeric value. The distance between two observations is the $\text{[math]}$ th root of sum of the absolute differences to the $\text{[math]}$ th power between the values for the observations.

Table 32.3 Methods Accepting Ratio, Interval, and Ordinal Variables
Method	Range	Type
EUCLID	$\text{[math]}$	dis
SQEUCLID	$\text{[math]}$	dis
SIZE	$\text{[math]}$	dis
SHAPE	$\text{[math]}$	dis
COV	$\text{[math]}$	sim
CORR	–1 to 1	sim
DCORR	0 to 2	dis
SQCORR	0 to 1	sim
DSQCORR	0 to 1	dis
L( $\text{[math]}$ )	$\text{[math]}$	dis
CITYBLOCK	$\text{[math]}$	dis
CHEBYCHEV	$\text{[math]}$	dis
POWER( $\text{[math]}$ )	$\text{[math]}$	dis

Table 32.4 lists methods accepting ratio variables. Notice that in the second column of this table, all of the possible range values are nonnegative, because ratio variables are assumed to be positive. Entries in this table are as follows:

SIMRATIO: similarity ratio (if variables are binary, this is the Jaccard coefficient)
DISRATIO: one minus similarity ratio
NONMETRIC: Lance and Williams nonmetric coefficient
CANBERRA: Canberra metric distance coefficient
COSINE: cosine coefficient
DOT: dot (inner) product coefficient
OVERLAP: overlap similarity
DOVERLAP: overlap dissimilarity
CHISQ: chi-squared coefficient
CHI: squared root of chi-squared coefficient
PHISQ: phi-squared coefficient
PHI: squared root of phi-squared coefficient

Table 32.4 Methods Accepting Ratio Variables
Method	Range	Type
SIMRATIO	0 to 1	sim
DISRATIO	0 to 1	dis
NONMETRIC	0 to 1	dis
CANBERRA	0 to 1	dis
COSINE	0 to 1	sim
DOT	$\text{[math]}$	sim
OVERLAP	$\text{[math]}$	sim
DOVERLAP	$\text{[math]}$	dis
CHISQ	$\text{[math]}$	dis
CHI	$\text{[math]}$	dis
PHISQ	$\text{[math]}$	dis
PHI	$\text{[math]}$	dis

Table 32.5 lists methods accepting nominal variables. Entries in this table are:

HAMMING: Hamming distance
MATCH: simple matching coefficient
DMATCH: simple matching coefficient transformed to Euclidean distance
DSQMATCH: simple matching coefficient transformed to squared Euclidean distance
HAMANN: Hamann coefficient
RT: Roger and Tanimoto
SS1: Sokal and Sneath 1
SS3: Sokal and Sneath 3

Table 32.5 Methods Accepting Nominal Variables
Method	Range	Type
HAMMING	0 to $\text{[math]}$	dis
MATCH	0 to 1	sim
DMATCH	0 to 1	dis
DSQMATCH	0 to 1	dis
HAMANN	–1 to 1	sim
RT	0 to 1	sim
SS1	0 to 1	sim
SS3	0 to 1	sim

Note that $\text{[math]}$ denotes the number of variables or dimensionality.

Table 32.6 lists methods that accept asymmetric nominal variables. Use the ABSENT= option to create a value to be considered absent. Entries in this table are as follows:

DICE: Dice coefficient or Czekanowski/Sorensen similarity coefficient
RR: Russell and Rao
BLWNM: Binary Lance and Williams nonmetric, or Bray-Curtis coefficient
K1: Kulcynski 1

Table 32.6 Methods Accepting Asymmetric Nominal Variables
Method	Range	Type
DICE	0 to 1	sim
RR	0 to 1	sim
BLWNM	0 to 1	dis
K1	$\text{[math]}$	sim

Table 32.7 lists methods accepting asymmetric nominal and ratio variables. Use the ABSENT= option to create a value to be considered absent. There are four instead of three columns in this table. The second column contains possible range values if only one level of measurement (either ratio or asymmetric nominal but not both) is specified; the third column contains possible range values if both levels are specified.

The JACCARD method is equivalent to the SIMRATIO method if there is no asymmetric nominal variable; if both ratio and asymmetric nominal variables are present, the coefficient is computed as the sum of the coefficient from the ratio variables and the coefficient from the asymmetric nominal variables. See "Proximity Measures" in the section Details: DISTANCE Procedure for the formula and descriptions of the JACCARD method. Entries in this table are as follows:

JACCARD: Jaccard similarity coefficient
DJACCARD: Jaccard dissimilarity coefficient

Table 32.7 Methods Accepting Asymmetric Nominal and Ratio Variables
Method	Range (one level)	Range (two levels)	Type
JACCARD	0 to 1	0 to 2	sim
DJACCARD	0 to 1	0 to 2	dis

MULT=c

specifies a numeric constant, $\text{[math]}$ , by which to multiply each value after standardizing. The default value is 1.

NOMISS

omits observations with missing values from computation of the location and scale measures when standardizing; generates undefined (missing) distances for observations with missing values when computing distances. Use the UNDEF= option to specify the undefined values.

If a distance matrix is created to be used as an input to PROC CLUSTER, the NOMISS option should not be used because the CLUSTER procedure does not accept distance matrices with missing values.

NORM

normalizes the scale estimator to be consistent for the standard deviation of a normal distribution when you specify the option STD=AGK, STD=IQR, STD=MAD, or STD=SPACING in the VAR statement.

NOSTD

suppresses standardization of the variables. The NOSTD option should not be specified with the STDONLY option or with the REPLACE option.

OUT=SAS-data-set

specifies the name of the SAS data set created by PROC DISTANCE. The output data set contains the BY variables, the ID variable, computed distance variables, the COPY variables, the FREQ variable, and the WEIGHT variables.

If you omit the OUT= option, PROC DISTANCE creates an output data set named according to the DATA $\text{[math]}$ convention.

OUTSDZ=SAS-data-set

specifies the name of the SAS data set containing the standardized scores. The output data set contains a copy of the DATA= data set, except that the analyzed variables have been standardized. Analyzed variables are those listed in the VAR statement.

PREFIX=name

specifies a prefix for naming the distance variables in the OUT= data set. By default, the names are Dist1, Dist2, ..., Dist $\text{[math]}$ . If you specify PREFIX=ABC, the variables are named ABC1, ABC2, ..., ABCn. If the ID statement is also specified, the variables are named by appending the value of the ID variable to the prefix.

RANKSCORE=MIDRANK | INDEX

specifies the method of assigning scores to ordinal variables. The available methods are listed as follows:

MIDRANK: assigns consecutive integers to each category with consideration of the frequency value. This is the default method.
INDEX: assigns consecutive integers to each category regardless of frequencies.

The following example explains how each method assigns the rank scores. Suppose the data contain an ordinal variable ABC with values A, B, C. There are two ways to assign numbers. One is to use midranks, which depend on the frequencies of each category. Another is to assign consecutive integers to each category, regardless of frequencies.

Table 32.8 Example of Assigning Rank Scores
ABC	MIDRANK	INDEX
A	1.5	1
A	1.5	1
B	4	2
B	4	2
B	4	2
C	6	3

REPLACE

replaces missing data with zero in the standardized data (to correspond to the location measure before standardizing). To replace missing data with something else, use the MISSING= option in the VAR statement. The REPLACE option implies standardization.

You cannot specify the following options together:

both the REPLACE and the REPONLY options
both the REPLACE and the NOSTD options

REPONLY

replaces missing data with the location measure specified by the MISSING= option or the STD= option (if the MISSING= option is not specified), but does not standardize the data. If the MISSING= option is not specified and METHOD=GOWER is specified, missing values are replaced by the location measure from the RANGE method (the minimum value), no matter what the value of the STD= option is.

You cannot specify both the REPLACE and the REPONLY options.

SHAPE=TRIANGLE | TRI | SQUARE | SQU | SQR

specifies the shape of the proximity matrix to be stored in the OUT= data set. SHAPE=TRIANGLE requests the matrix to be stored as a lower triangular matrix; SHAPE=SQUARE requests that the matrix be stored as a squared matrix. Use SHAPE=SQUARE if the output data set is to be used as input to the MODECLUS procedures. The default is TRIANGLE.

SNORM

normalizes the scale estimator to have an expectation of approximately 1 for a standard normal distribution when the STD=SPACING option is specified.

STDONLY

standardizes variables only and computes no distance matrix. You must use the OUTSDZ= option to save the standardized scores. You cannot specify both the STDONLY option and the NOSTD option.

UNDEF=n

specifies the numeric constant used to replace undefined distances, such as when an observation has all missing values, or if a divisor is zero.

VARDEF=DF | N | WDF | WEIGHT | WGT

specifies the divisor to be used in the calculation of distance, dissimilarity, or similarity measures, and for standardizing variables whenever a variance or covariance is computed. By default, VARDEF=DF. The values and associated divisors are as follows:

Value	Divisor	Formula
DF	degrees of freedom	$\text{[math]}$
N	number of observations	$\text{[math]}$
WDF	sum of weights minus 1	( $\text{[math]}$
WEIGHT \| WGT	sum of weights	$\text{[math]}$

Top of Page