VAR Statement |
where the syntax for the opt-list is as follows:
ABSENT=value
MISSING=miss-method | value
ORDER=order-option
STD=std-method
WEIGHTS=weight-list
The VAR statement lists variables from which distances are to be computed. The VAR statement is required. The variables can be numeric or character depending on their measurement levels. A variable cannot appear more than once in either the same list or a different list.
level is required. It declares the levels of measurement for those variables specified within the parentheses. Available values for level are as follows:
variables are asymmetric nominal and can be either numeric or character.
variables are symmetric nominal and can be either numeric or character.
variables are ordinal and can be either numeric or character. Values of ordinal variables are replaced by their corresponding rank scores. If standardization is required, the standardized rank scores are output to the data set specified in the OUTSDZ= option. See the RANKSCORE= option in the PROC DISTANCE statement for methods available for assigning rank scores to ordinal variables. After being replaced by scores, ordinal variables are considered interval.
variables are interval and numeric.
variables are ratio and numeric. Ratio variables should always contain positive measurements.
Each variable list can be followed by an option list. Use "/ " after the list of variables to start the option list. An option list contains options that are applied to the variables. The following options are available in the option list:
specifies the value to be used as an absence value in an irrelevant absent-absent match for asymmetric nominal variables.
specifies the method (or numeric value) with which to replace missing data.
selects the order for assigning scores to ordinal variables.
selects the standardization method.
assigns weights to the variables in the list.
If an option is missing from the current attribute list, PROC DISTANCE provides default values for all the variables in the current list.
For example, in the VAR statement
var ratio(x1-x4/std= mad weights= .5 .5 .1 .5 missing= -99) interval(x5/std= range) ordinal(x6/order= desc);
the first option list defines x1–x4 as ratio variables to be standardized by the MAD method. Also, any missing values in x1–x4 should be replaced by –99. x1 is given a weight of 0.5, x2 is given a weight of 0.5, x3 is given a weight of 0.1, and x4 is given a weight of 0.5.
The second option list defines x5 as an interval variable to be standardized by the RANGE method. If the REPLACE option is specified in the PROC DISTANCE statement, missing values in x5 are replaced by the location estimate from the RANGE method. By default, x5 is given a weight of 1.
The last option list defines x6 as an ordinal variable. The scores are assigned from highest to lowest by its unformatted values. Although the STD= option is not specified, x6 is standardized by the default method (STD) because there is more than one level of measurements (ratio, interval, and ordinal) in the VAR statement. Again, if the REPLACE option is specified, missing values in x6 are replaced by the location estimate from the STD method. Finally, by default, x6 is given a weight of 1.
More details for the options are explained as follows.
specifies the standardization method. Valid values for std-method are MEAN, MEDIAN, SUM, EUCLEN, USTD, STD, RANGE, MIDRANGE, MAXABS, IQR, MAD, ABW, AHUBER, AWAVE, AGK, SPACING, and L. Table 33.9 lists available methods of standardization as well as their corresponding location and scale measures.
Method |
Scale |
Location |
---|---|---|
MEAN |
1 |
mean |
MEDIAN |
1 |
median |
SUM |
sum |
0 |
EUCLEN |
Euclidean length |
0 |
USTD |
standard deviation about origin |
0 |
STD |
standard deviation |
mean |
RANGE |
range |
minimum |
MIDRANGE |
range/2 |
midrange |
MAXABS |
maximum absolute value |
0 |
IQR |
interval quartile range |
median |
MAD |
median absolute deviation from median |
median |
ABW(c) |
biweight A-estimate |
biweight 1-step M-estimate |
AHUBER(c) |
Huber A-estimate |
Huber 1-step M-estimate |
AWAVE(c) |
Wave 1-step M-estimate |
Wave A-estimate |
AGK(p) |
AGK estimate (ACECLUS) |
mean |
SPACING(p) |
minimum spacing |
mid minimum-spacing |
L(p) |
|
|
These standardization methods are further documented in the section on the METHOD= option in the PROC STDIZE statement of the STDIZE procedure (see the section Standardization Methods in Chapter 84, The STDIZE Procedure ).
Standardization is not required if there is only one level of measurement, or if only asymmetric nominal and nominal levels are specified; otherwise, standardization is mandatory. When standardization is mandatory, a default method is provided when the STD= option is not specified. You can suppress the mandatory standardization by using the NOSTD option in the PROC DISTANCE statement. See the NOSTD option in the section PROC DISTANCE Statement and the section Mandatory Standardization for details.
The default method is STD for standardizing interval variables and MAXABS for standardizing ratio variables unless METHOD=GOWER or METHOD=DGOWER is specified. If METHOD=GOWER is specified, interval variables are standardized by the RANGE method, and whatever is specified in the STD= option is ignored; if METHOD=DGOWER is specified, the RANGE method is the default standardization method for interval variables. The MAXABS method is the default standardization method for ratio variables for both the GOWER and DGOWER methods.
Notice that a ratio variable should always be positive.
Table 33.10 lists standardization methods and the levels of measurement that can be accepted by each method. For example, the SUM method can be used to standardize ratio variables but not interval or ordinal variables. Also, the AGK and SPACING methods should not be used to standardize ordinal variables. If you apply AGK and SPACING to ranks, the results are degenerate because all the spacings of a given order are equal.
Standardization |
Legitimate |
---|---|
Method |
Levels of Measurement |
MEAN |
ratio, interval, ordinal |
MEDIAN |
ratio, interval, ordinal |
SUM |
ratio |
EUCLEN |
ratio |
USTD |
ratio |
STD |
ratio, interval, ordinal |
RANGE |
ratio, interval, ordinal |
MIDRANGE |
ratio, interval, ordinal |
MAXABS |
ratio |
IQR |
ratio, interval, ordinal |
MAD |
ratio, interval, ordinal |
ABW(c) |
ratio, interval, ordinal |
AHUBER(c) |
ratio, interval, ordinal |
AWAVE(c) |
ratio, interval, ordinal |
AGK(p) |
ratio, interval |
SPACING(p) |
ratio, interval |
L(p) |
ratio, interval, ordinal |
specifies the value to be used as an absence value in an irrelevant absent-absent match for asymmetric nominal variables. The absence value specified here overwrites the absence value specified through the ABSENT= option in the PROC DISTANCE statement for those variables in the current variable list.
An absence value for a variable can be either a numeric value or a quoted string consisting of combinations of characters. For instance, ., –999, "NA" are legal values for the ABSENT= option.
The default for an absence value for a character variable is "NONE" (notice that a blank value is considered a missing value), and the default for an absence value for a numeric variable is 0.
specifies the method or a numeric value for replacing missing values. If you omit the MISSING= option, the REPLACE option replaces missing values with the location measure given by the STD= option. Specify the MISSING= option when you want to replace missing values with a different value. You can specify any method that is valid in the STD= option. The corresponding location measure is used to replace missing values.
If a numeric value is given, the value replaces missing values after standardizing the data. However, when standardization is not mandatory, you can specify the REPONLY option with the MISSING= option to suppress standardization for cases in which you want only to replace missing values.
If the NOSTD option is specified, there is no standardization, but missing values are replaced by the corresponding location measures or by the numeric value of the MISSING= option. See the section Missing Values for details about missing values replacement with and without standardization.
specifies the order for assigning score to ordinal variables. The value for the ORDER= option can be one of the following:
scores are assigned in lowest-to-highest order of unformatted values.
scores are assigned in highest-to-lowest order of unformatted values.
scores are assigned in ascending order by their formatted values. This option can be applied to character variables only, since unformatted values are always used for numeric variables.
scores are assigned in descending order by their formatted values. This option can be applied to character variables only, since unformatted values are always used for numeric variables.
scores are assigned according to the order of their appearance in the input data set.
The default value is ASCENDING.
specifies a list of values for weighting individual variables while computing the proximity. Values in this list can be separated by blanks or commas. You can include one or more items of the form start TO stop BY increment. This list should contain at least one weight. The maximum number of weights you can list is equal to the number of variables. If the number of weights is less than the number of variables, the last value in the weight-list is used for the rest of the variables; conversely, if the number of weights is greater than the number of variables, the trailing weights are discarded.
The default value is 1.