The CLUSTER Procedure 
PROC CLUSTER Statement 
The PROC CLUSTER statement starts the CLUSTER procedure, specifies a clustering method, and optionally specifies details for clustering methods, data sets, data processing, and displayed output.
The METHOD= specification determines the clustering method used by the procedure. Any one of the following 11 methods can be specified for name:
requests average linkage (group average, unweighted pairgroup method using arithmetic averages, UPGMA). Distance data are squared unless you specify the NOSQUARE option.
requests the centroid method (unweighted pairgroup method using centroids, UPGMC, centroid sorting, weightedgroup method). Distance data are squared unless you specify the NOSQUARE option.
requests complete linkage (furthest neighbor, maximum method, diameter method, rank order typal analysis). To reduce distortion of clusters by outliers, the TRIM= option is recommended.
requests density linkage, which is a class of clustering methods using nonparametric probability density estimation. You must also specify either the K=, R=, or HYBRID option to indicate the type of density estimation to be used. See also the MODE= and DIM= options in this section.
requests maximumlikelihood hierarchical clustering for mixtures of spherical multivariate normal distributions with equal variances but possibly unequal mixing proportions. Use METHOD=EML only with coordinate data. See the PENALTY= option for details. The NONORM option does not affect the reported likelihood values but does affect other unrelated criteria. The EML method is much slower than the other methods in the CLUSTER procedure.
requests the LanceWilliams flexiblebeta method. See the BETA= option in this section.
requests McQuitty’s similarity analysis (weighted average linkage, weighted pairgroup method using arithmetic averages, WPGMA).
requests Gower’s median method (weighted pairgroup method using centroids, WPGMC). Distance data are squared unless you specify the NOSQUARE option.
requests single linkage (nearest neighbor, minimum method, connectedness method, elementary linkage analysis, or dendritic method). To reduce chaining, you can use the TRIM= option with METHOD=SINGLE.
requests twostage density linkage. You must also specify the K=, R=, or HYBRID option to indicate the type of density estimation to be used. See also the MODE= and DIM= options in this section.
requests Ward’s minimumvariance method (error sum of squares, trace W). Distance data are squared unless you specify the NOSQUARE option. To reduce distortion by outliers, the TRIM= option is recommended. See the NONORM option.
Table 29.1 summarizes the options in the PROC CLUSTER statement.
Option 
Description 

Specify input and output data sets 

DATA= 
specifies input data set 
OUTTREE= 
creates output data set 
Specify clustering methods 

METHOD= 
specifies clustering method 
BETA= 
specifies beta value for flexible beta method 
MODE= 
specifies the minimum number of members for modal clusters 
PENALTY= 
specifies the penalty coefficient for maximum likelihood 
HYBRID 
specifies Wong’s hybrid clustering method 
Control data processing prior to clustering 

NOEIGEN 
suppresses computation of eigenvalues 
NONORM 
suppresses normalizing of distances 
NOSQUARE 
suppresses squaring of distances 
STANDARD 
standardizes variables 
TRIM= 
omits points with low probability densities 
Control density estimation 

K= 
specifies number of neighbors for thnearestneighbor density estimation 
R= 
specifies radius of sphere of support for uniformkernel density estimation 
Ties 

NOTIE 
suppresses checking for ties 
Control display of the cluster history 

CCC 
displays cubic clustering criterion 
NOID 
suppresses display of ID values 
PRINT= 
specifies number of generations to display 
PSEUDO 
displays pseudo and statistics 
RMSSTD 
displays root mean square standard deviation 
RSQUARE 
displays R square and semipartial R square 
Control other aspects of output 

NOPRINT 
suppresses display of all output 
SIMPLE 
displays simple summary statistics 
PLOTS= 
specifies ODS graphics details 
The following list provides details on these options.
specifies the beta parameter for METHOD=FLEXIBLE. The value of should be less than 1, usually between 0 and . By default, BETA=. Milligan (1987) suggests a somewhat smaller value, perhaps , for data with many outliers.
displays the cubic clustering criterion and approximate expected R square under the uniform null hypothesis (Sarle 1983). The statistics associated with the RSQUARE option, R square and semipartial R square, are also displayed. The CCC option applies only to coordinate data. The CCC option is not appropriate with METHOD=SINGLE because of the method’s tendency to chop off tails of distributions. Computation of the CCC requires the eigenvalues of the covariance matrix. If the number of variables is large, computing the eigenvalues requires much computer time and memory.
names the input data set containing observations to be clustered. By default, the procedure uses the most recently created SAS data set. If the data set is TYPE=DISTANCE, the data are interpreted as a distance matrix; the number of variables must equal the number of observations in the data set or in each BY group. The distances are assumed to be Euclidean, but the procedure accepts other types of distances or dissimilarities. If the data set is not TYPE=DISTANCE, the data are interpreted as coordinates in a Euclidean space, and Euclidean distances are computed. For more about TYPE=DISTANCE data sets, see Appendix A, Special SAS Data Sets.
You cannot use a TYPE=CORR data set as input to PROC CLUSTER, since the procedure uses dissimilarity measures. Instead, you can use a DATA step or the IML procedure to extract the correlation matrix from a TYPE=CORR data set and transform the values to dissimilarities such as or , where is the correlation.
All methods produce the same results when used with coordinate data as when used with Euclidean distances computed from the coordinates. However, the DIM= option must be used with distance data if you specify METHOD=TWOSTAGE or METHOD=DENSITY or if you specify the TRIM= option.
Certain methods that are most naturally defined in terms of coordinates require squared Euclidean distances to be used in the combinatorial distance formulas (Lance and Williams 1967). For this reason, distance data are automatically squared when used with METHOD=AVERAGE, METHOD=CENTROID, METHOD=MEDIAN, or METHOD=WARD. If you want the combinatorial formulas to be applied to the (unsquared) distances with these methods, use the NOSQUARE option.
specifies the dimensionality used when computing density estimates with the TRIM= option, METHOD=DENSITY, or METHOD=TWOSTAGE. The values of must be greater than or equal to 1. The default is the number of variables if the data are coordinates; the default is 1 if the data are distances.
requests Wong’s (1982) hybrid clustering method in which density estimates are computed from a preliminary cluster analysis using the means method. The DATA= data set must contain means, frequencies, and root mean square standard deviations of the preliminary clusters (see the FREQ and RMSSTD statements). To use HYBRID, you must use either a FREQ statement or a DATA= data set that contains a _FREQ_ variable, and you must also use either an RMSSTD statement or a DATA= data set that contains an _RMSSTD_ variable.
The MEAN= data set produced by the FASTCLUS procedure is suitable for input to the CLUSTER procedure for hybrid clustering. Since this data set contains _FREQ_ and _RMSSTD_ variables, you can use it as input and then omit the FREQ and RMSSTD statements.
You must specify either METHOD=DENSITY or METHOD=TWOSTAGE with the HYBRID option. You cannot use this option in combination with the TRIM=, K=, or R= option.
specifies the number of neighbors to use for thnearestneighbor density estimation (Silverman 1986, pp. 19–21 and 96–99). The number of neighbors () must be at least two but less than the number of observations. See the MODE= option, which follows.
Density estimation is used with the TRIM=, METHOD=DENSITY, and METHOD=TWOSTAGE options.
specifies that, when two clusters are joined, each must have at least members in order for either cluster to be designated a modal cluster. If you specify MODE=1, each cluster must also have a maximum density greater than the fusion density in order for either cluster to be designated a modal cluster.
Use the MODE= option only with METHOD=DENSITY or METHOD=TWOSTAGE. With METHOD=TWOSTAGE, the MODE= option affects the number of modal clusters formed. With METHOD=DENSITY, the MODE= option does not affect the clustering process but does determine the number of modal clusters reported on the output and identified by the _MODE_ variable in the output data set.
If you specify the K= option, the default value of MODE= is the same as the value of K= because the use of thnearestneighbor density estimation limits the resolution that can be obtained for clusters with fewer than members. If you do not specify the K= option, the default is MODE=2.
If you specify MODE=0, the default value is used instead of 0.
If you specify a FREQ statement or if a _FREQ_ variable appears in the input data set, the MODE= value is compared with the number of actual observations in the clusters being joined, not with the sum of the frequencies in the clusters.
suppresses computation of the eigenvalues of the covariance matrix and substitutes the variances of the variables for the eigenvalues when computing the cubic clustering criterion. The NOEIGEN option saves time if the number of variables is large, but it should be used only if the variables are nearly uncorrelated. If you specify the NOEIGEN option and the variables are highly correlated, the cubic clustering criterion might be very liberal. The NOEIGEN option applies only to coordinate data.
suppresses the display of ID values for the clusters joined at each generation of the cluster history.
prevents the distances from being normalized to unit mean or unit root mean square with most methods. With METHOD=WARD, the NONORM option prevents the betweencluster sum of squares from being normalized by the total sum of squares to yield a squared semipartial correlation. The NONORM option does not affect the reported likelihood values with METHOD=EML, but it does affect other unrelated criteria, such as the _DIST_ variable.
suppresses the display of all output. Note that this option temporarily disables the Output Delivery System (ODS). For more information, see Chapter 20, Using the Output Delivery System.
prevents input distances from being squared with METHOD=AVERAGE,
METHOD=CENTROID, METHOD=MEDIAN, or METHOD=WARD.
If you specify the NOSQUARE option with distance data, the data are assumed to be squared Euclidean distances for computing Rsquare and related statistics defined in a Euclidean coordinate system.
If you specify the NOSQUARE option with coordinate data with METHOD=CENTROID, METHOD=MEDIAN, or METHOD=WARD, then the combinatorial formula is applied to unsquared Euclidean distances. The resulting cluster distances do not have their usual Euclidean interpretation and are therefore labeled "False" in the output.
prevents PROC CLUSTER from checking for ties for minimum distance between clusters at each generation of the cluster history. If your data are measured with such precision that ties are unlikely, then you can specify the NOTIE option to reduce slightly the time and space required by the procedure. See the section Ties for more information.
creates an output data set that can be used by the TREE procedure to draw a tree diagram. You must give the data set a twolevel name to save it. See SAS Language Reference: Concepts for a discussion of permanent data sets. If you omit the OUTTREE= option, the data set is named by using the DATA convention and is not permanently saved. If you do not want to create an output data set, use OUTTREE=_NULL_.
specifies the penalty coefficient used with METHOD=EML. See the section Clustering Methods for more information. Values for must be greater than zero. By default, PENALTY=2.
controls the plots produced through ODS Graphics.
PROC CLUSTER can produce line plots of the cubic clustering criterion, the pseudo statistic, and the pseudo statistic from the cluster history table. These statistics are useful for estimating the number of clusters. Each statistic is plotted against the number of clusters.
To obtain ODS Graphics plots from PROC CLUSTER, you must do two things. First, enable ODS Graphics before running PROC CLUSTER. For example:
ods graphics on; proc cluster plots=all; run; ods graphics off;
Second, request that PROC CLUSTER compute the desired statistics by specifying the CCC or PSEUDO options, or by specifying the statistics in a plotrequest in the PLOT option. PROC CLUSTER might be unable to compute the statistics in some cases; for details, see the CCC and PSEUDO options. If a statistic cannot be computed, it cannot be plotted. PROC CLUSTER plots all of these statistics that are computed unless you tell it specifically what to plot using PLOTS=.
The maximum number of clusters shown in all the plots is the minimum of the following quantities:
the number of observations
the value of the PRINT= option, if that option is specified
the maximum number of clusters for which CCC is computed, if CCC is plotted
The globalplotoptions apply to all plots generated by the CLUSTER procedure. The global plot options are as follows:
breaks a plot that is otherwise paneled into plots separate plots for each statistic. This option can be abbreviated as UNPACK.
has no effect, but is accepted for consistency with other procedures.
The following plotrequests can be specified:
implicitly specifies the CCC and PSEUDO options and, if possible, produces all three plots.
suppresses all plots.
implicitly specifies the CCC option and, if possible, plots the cubic clustering criterion against the number of clusters.
implicitly specifies the PSEUDO option and, if possible, plots the pseudo statistic and the pseudo statistic against the number of clusters.
implicitly specifies the PSEUDO option and, if possible, plots the pseudo statistic against the number of clusters.
implicitly specifies the PSEUDO option and, if possible, plots the pseudo statistic against the number of clusters.
When you specify only one plotrequest, you can omit the parentheses around the plotrequest. You can specify one or more of the CCC, PSEUDO, PSF, or PST2 plot requests in the same PLOT option. For example, all of the following are valid:
PROC CLUSTER PLOTS=(CCC PST2); PROC CLUSTER PLOTS=(PSF); PROC CLUSTER PLOTS=PSF;
The first statement plots both the cubic clustering criterion and the pseudo statistic, while the second and third statements plot the pseudo statistic only.
The names of the graphs that PROC CLUSTER generates are listed in Table 29.5, along with the required statements and options.
specifies the number of generations of the cluster history to display. The P= option displays the latest generations; for example, P=5 displays the cluster history from 1 cluster through 5 clusters. The value of P= must be a nonnegative integer. The default is to display all generations. Specify PRINT=0 to suppress the cluster history.
displays pseudo and statistics. This option is effective only when the data are coordinates or when METHOD=AVERAGE, METHOD=CENTROID, or METHOD=WARD is specified. See the section Miscellaneous Formulas for more information. The PSEUDO option is not appropriate with METHOD=SINGLE because of the method’s tendency to chop off tails of distributions.
specifies the radius of the sphere of support for uniformkernel density estimation (Silverman 1986, pp. 11–13 and 75–94). The value of R= must be greater than zero.
Density estimation is used with the TRIM=, METHOD=DENSITY, and METHOD=TWOSTAGE options.
displays the root mean square standard deviation of each cluster. This option is effective only when the data are coordinates or when METHOD=AVERAGE, METHOD=CENTROID, or METHOD=WARD is specified.
See the section Miscellaneous Formulas for more information.
displays the R square and semipartial R square. This option is effective only when the data are coordinates or when METHOD=AVERAGE or METHOD=CENTROID is specified. The R square and semipartial R square statistics are always displayed with METHOD=WARD. See the section Miscellaneous Formulas for more information..
displays means, standard deviations, skewness, kurtosis, and a coefficient of bimodality. The SIMPLE option applies only to coordinate data. See the section Miscellaneous Formulas for more information.
standardizes the variables to mean 0 and standard deviation 1. The STANDARD option applies only to coordinate data.
omits points with low estimated probability densities from the analysis. Valid values for the TRIM= option are . If , then is the proportion of observations omitted. If , then is interpreted as a percentage. A specification of TRIM=10, which trims 10% of the points, is a reasonable value for many data sets. Densities are estimated by the thnearestneighbor or uniformkernel method. Trimmed points are indicated by a negative value of the _FREQ_ variable in the OUTTREE= data set.
You must use either the K= or R= option when you use TRIM=. You cannot use the HYBRID option in combination with TRIM=, so you might want to use the DIM= option instead. If you specify the STANDARD option in combination with TRIM=, the variables are standardized both before and after trimming.
The TRIM= option is useful for removing outliers and reducing chaining. Trimming is highly recommended with METHOD=WARD or METHOD=COMPLETE because clusters from these methods can be severely distorted by outliers. Trimming is also valuable with METHOD=SINGLE since single linkage is the method most susceptible to chaining. Most other methods also benefit from trimming. However, trimming is unnecessary with METHOD=TWOSTAGE or METHOD=DENSITY when thnearestneighbor density estimation is used.
Use of the TRIM= option can spuriously inflate the cubic clustering criterion and the pseudo and statistics. Trimming only outliers improves the accuracy of the statistics, but trimming saddle regions between clusters yields excessively large values.
Copyright © 2009 by SAS Institute Inc., Cary, NC, USA. All rights reserved.