The purpose of cluster analysis is to place objects into groups, or clusters, suggested by the data,
not defined a priori, such that objects in a given cluster tend to be similar to each other in some
sense, and objects in different clusters tend to be dissimilar. You can also use cluster analysis to
summarize data rather than to find "natural" or "real" clusters; this use of clustering is sometimes
called dissection. The SAS/STAT procedures for clustering are oriented toward disjoint or hierarchical
clusters from coordinate data, distance data, or a correlation or covariance matrix.
The SAS/STAT cluster analysis procedures include the following:
 ACECLUS Procedure — Obtains approximate
estimates of the pooled withincluster covariance matrix when the clusters are assumed to
be multivariate normal with equal covariance matrices
 CLUSTER Procedure — Hierarchically clusters
the observations in a SAS data
 DISTANCE Procedure — Computes various measures of
distance, dissimilarity, or similarity between the observations (rows) of a SAS data set.
Proximity measures are stored as a lower triangular matrix or a square matrix in an output
data set that can then be used as input to the CLUSTER, MDS, and MODECLUS procedures.
 FASTCLUS Procedure — Disjoint
cluster analysis on the basis of distances computed from one or more quantitative variables
 MODECLUS Procedure — Clusters observations
in a SAS data set
 TREE Procedure — Produces a tree diagram, also known as
a dendrogram or phenogram, from a data set created by the CLUSTER or VARCLUS procedure
 VARCLUS Procedure — Divides a set of numeric
variables into disjoint or hierarchical clusters
ACECLUS Procedure
The ACECLUS procedure obtains approximate estimates of the pooled withincluster covariance matrix when the clusters are assumed to
be multivariate normal with equal covariance matrices. Neither cluster membership nor the number of clusters needs to be known.
PROC ACECLUS is useful for preprocessing data to be subsequently clustered by the CLUSTER or FASTCLUS procedure.
The procedure enables you to do the following:
 choose between the following clustering methods:
 use a number of pairs, m, with the smallest distances to form the estimate at each iteration
 use all pairs closer than a given cutoff value to form the estimate at each iteration
 specify the metric in which computations are performed
 specify the number or proportion of pairs for estimating withincluster covariance
 specify the threshold for including pairs in the estimation of the withincluster covariance

 perform BY group processing, which enables you to obtain separate analyses on grouped observations
 perform weighted analysis
 create a data set that corresponds to any output table
 create a data set that contains means, standard deviations, number of observations, covariances,
estimated withincluster covariances, eigenvalues, and canonical coefficients
 produce a PPplot of distances between pairs from last iteration
 produce a QQplot of power transformation of distances between pairs from last iteration

For further details, see
ACECLUS Procedure
CLUSTER Procedure
The CLUSTER procedure hierarchically clusters the observations in a SAS data set by using one of 11 methods.
The data can be coordinates or distances. If the data are coordinates, PROC CLUSTER computes (possibly squared) Euclidean distances.
The following are highlights of the CLUSTER procedure's features:
 supports the following clustering methods:
 average linkage
 centroid method
 complete linkage
 density linkage (including Wong's hybrid and kthnearestneighbor methods)
 maximum likelihood for mixtures of spherical multivariate normal distributions with equal variances but possibly unequal mixing proportions
 flexiblebeta method
 McQuitty's similarity analysis
 median method
 single linkage
 twostage density linkage
 Ward's minimumvariance

 displays a history of the clustering process, showing statistics useful for estimating
the number of clusters in the population from which the data are sampled
 creates a data set that can be used by the TREE procedure to draw a tree diagram of the
cluster hierarchy or to output the cluster membership at any desired level
 performs BY group processing, which enables you to obtain separate analyses on grouped observations
 creates a data set that corresponds to any output table
 automatically produce graphs by using ODS Graphics

For further details, see
CLUSTER Procedure
DISTANCE Procedure
The DISTANCE procedure computes various measures of distance, dissimilarity, or similarity between the observations (rows) of an
input SAS data set, which can contain numeric or character variables, or both, depending on which proximity measure is used.
The proximity measures are stored as a lower triangular matrix or a square matrix in an output data set that can then be used as
input to the CLUSTER, MDS, and MODECLUS procedures.
The following are highlights of the DISTANCE procedure's features:
 provides various nonparametric and parametric methods for standardizing variables
 proximity measures accept four levels of measurement: nominal, ordinal, interval, and ratio

 supports BY group processing, which enables you to obtain separate analyses on grouped observations

For further details, see
DISTANCE Procedure
FASTCLUS Procedure
The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables.
The observations are divided into clusters such that every observation belongs to one and only one cluster.
The following are highlights of the procedure's features:
 uses Euclidean distances, so the cluster centers are based on least squares estimation (kmeans model)
 designed to find good clusters (but not necessarily the best possible clusters) with only two or three
passes through the data set
 can be an effective procedure for detecting outliers because outliers often appear as clusters with only one member
 can use an L_{p} (least pth powers) clustering criterion
 is intended for use with large data sets, with 100 or more observations

 uses algorithms that place a larger influence on variables with larger variance
 produces brief summaries of the clusters
 produces an output data set containing a cluster membership variable
 performs BY group processing, which enables you to obtain separate analysis on grouped observations
 computes weighted cluster means
 creates a SAS data set that corresponds to any output table

For further details, see
FASTCLUS Procedure
MODECLUS Procedure
The MODECLUS procedure clusters observations in a SAS data set by using any of several algorithms based on nonparametric density estimates.
The following are highlights of the MODECLUS procedure's features:
 data can be numeric coordinates or distances
 performs approximate significance tests for the number of clusters and can hierarchically join nonsignificant clusters
 produces output data sets containing:
 density estimates and cluster membership
 various cluster statistics including approximate pvalues
 a summary of the number of clusters generated by various algorithms
 smoothing parameters
 significance levels

 performs BY group processing, which enables you to obtain separate analyses on grouped observations
 creates a SAS data set that corresponds to any output table

For further details, see
MODECLUS Procedure
TREE Procedure
The TREE procedure reads a data set created by the CLUSTER or VARCLUS procedure and produces a tree diagram (also known as a
dendrogram or phenogram), which displays the results of a hierarchical clustering analysis as a tree structure.
The following are highlights of the TREE procedure's features:
 produces a diagram of the tree structure that is oriented vertically, with the root at the top, or the diagram can be oriented horizontally,
with the root at the left
 provides options that enable you to control the displayed diagram

 performs BY group processing, which enables you to obtain separate analyses on grouped observations
 creates a SAS data set that corresponds to any output table

For further details, see
TREE Procedure
VARCLUS Procedure
The VARCLUS procedure divides a set of numeric variables into disjoint or hierarchical clusters.
Associated with each cluster is a linear combination of the variables in the cluster. This linear combination can be
either the first principal component or the centroid component. The following are highlights of the VARCLUS procedure's features:
 attempts to maximize the variance that is explained by the cluster components, summed over all the clusters
 cluster components are oblique
 enables you to analyze either the correlation or the covariance matrix
 creates an output data set that can be used with the SCORE procedure to compute component scores for each cluster
 creates an output data set that can be used with the TREE procedure to draw a tree diagram of hierarchical clusters

 enables you to base the clustering on partial correlations
 performs BY group processing, which enables you to obtain separate analyses on grouped observations
 performs weighted analysis
 creates a SAS data set that corresponds to any output table
 automatically displays a dendrogram (tree diagram of hierarchical clusters) by using ODS Graphics

For further details, see
VARCLUS Procedure