FOCUS AREAS

SAS/STAT Topics

SAS/STAT Software

Cluster Analysis

The purpose of cluster analysis is to place objects into groups, or clusters, suggested by the data, not defined a priori, such that objects in a given cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar. You can also use cluster analysis to summarize data rather than to find "natural" or "real" clusters; this use of clustering is sometimes called dissection. The SAS/STAT procedures for clustering are oriented toward disjoint or hierarchical clusters from coordinate data, distance data, or a correlation or covariance matrix.

The SAS/STAT cluster analysis procedures include the following:

ACECLUS Procedure


The ACECLUS procedure obtains approximate estimates of the pooled within-cluster covariance matrix when the clusters are assumed to be multivariate normal with equal covariance matrices. Neither cluster membership nor the number of clusters needs to be known. PROC ACECLUS is useful for preprocessing data to be subsequently clustered by the CLUSTER or FASTCLUS procedure. The procedure enables you to do the following:

  • choose between the following clustering methods:
    • use a number of pairs, m, with the smallest distances to form the estimate at each iteration
    • use all pairs closer than a given cutoff value to form the estimate at each iteration
  • specify the metric in which computations are performed
  • specify the number or proportion of pairs for estimating within-cluster covariance
  • specify the threshold for including pairs in the estimation of the within-cluster covariance
  • perform BY group processing, which enables you to obtain separate analyses on grouped observations
  • perform weighted analysis
  • create a data set that corresponds to any output table
  • create a data set that contains means, standard deviations, number of observations, covariances, estimated within-cluster covariances, eigenvalues, and canonical coefficients
  • produce a PP-plot of distances between pairs from last iteration
  • produce a QQ-plot of power transformation of distances between pairs from last iteration
For further details, see ACECLUS Procedure

CLUSTER Procedure


The CLUSTER procedure hierarchically clusters the observations in a SAS data set by using one of 11 methods. The data can be coordinates or distances. If the data are coordinates, PROC CLUSTER computes (possibly squared) Euclidean distances. The following are highlights of the CLUSTER procedure's features:

  • supports the following clustering methods:
    • average linkage
    • centroid method
    • complete linkage
    • density linkage (including Wong's hybrid and kth-nearest-neighbor methods)
    • maximum likelihood for mixtures of spherical multivariate normal distributions with equal variances but possibly unequal mixing proportions
    • flexible-beta method
    • McQuitty's similarity analysis
    • median method
    • single linkage
    • two-stage density linkage
    • Ward's minimum-variance
  • displays a history of the clustering process, showing statistics useful for estimating the number of clusters in the population from which the data are sampled
  • creates a data set that can be used by the TREE procedure to draw a tree diagram of the cluster hierarchy or to output the cluster membership at any desired level
  • performs BY group processing, which enables you to obtain separate analyses on grouped observations
  • creates a data set that corresponds to any output table
  • automatically produce graphs by using ODS Graphics
For further details, see CLUSTER Procedure

DISTANCE Procedure


The DISTANCE procedure computes various measures of distance, dissimilarity, or similarity between the observations (rows) of an input SAS data set, which can contain numeric or character variables, or both, depending on which proximity measure is used. The proximity measures are stored as a lower triangular matrix or a square matrix in an output data set that can then be used as input to the CLUSTER, MDS, and MODECLUS procedures. The following are highlights of the DISTANCE procedure's features:

  • provides various nonparametric and parametric methods for standardizing variables
  • proximity measures accept four levels of measurement: nominal, ordinal, interval, and ratio
  • supports BY group processing, which enables you to obtain separate analyses on grouped observations
For further details, see DISTANCE Procedure

FASTCLUS Procedure


The FASTCLUS procedure performs a disjoint cluster analysis on the basis of distances computed from one or more quantitative variables. The observations are divided into clusters such that every observation belongs to one and only one cluster. The following are highlights of the procedure's features:

  • uses Euclidean distances, so the cluster centers are based on least squares estimation (k-means model)
  • designed to find good clusters (but not necessarily the best possible clusters) with only two or three passes through the data set
  • can be an effective procedure for detecting outliers because outliers often appear as clusters with only one member
  • can use an Lp (least pth powers) clustering criterion
  • is intended for use with large data sets, with 100 or more observations
  • uses algorithms that place a larger influence on variables with larger variance
  • produces brief summaries of the clusters
  • produces an output data set containing a cluster membership variable
  • performs BY group processing, which enables you to obtain separate analysis on grouped observations
  • computes weighted cluster means
  • creates a SAS data set that corresponds to any output table
For further details, see FASTCLUS Procedure

MODECLUS Procedure


The MODECLUS procedure clusters observations in a SAS data set by using any of several algorithms based on nonparametric density estimates. The following are highlights of the MODECLUS procedure's features:

  • data can be numeric coordinates or distances
  • performs approximate significance tests for the number of clusters and can hierarchically join nonsignificant clusters
  • produces output data sets containing:
    • density estimates and cluster membership
    • various cluster statistics including approximate p-values
    • a summary of the number of clusters generated by various algorithms
    • smoothing parameters
    • significance levels
  • performs BY group processing, which enables you to obtain separate analyses on grouped observations
  • creates a SAS data set that corresponds to any output table
For further details, see MODECLUS Procedure

TREE Procedure


The TREE procedure reads a data set created by the CLUSTER or VARCLUS procedure and produces a tree diagram (also known as a dendrogram or phenogram), which displays the results of a hierarchical clustering analysis as a tree structure. The following are highlights of the TREE procedure's features:

  • produces a diagram of the tree structure that is oriented vertically, with the root at the top, or the diagram can be oriented horizontally, with the root at the left
  • provides options that enable you to control the displayed diagram
  • performs BY group processing, which enables you to obtain separate analyses on grouped observations
  • creates a SAS data set that corresponds to any output table
For further details, see TREE Procedure

VARCLUS Procedure


The VARCLUS procedure divides a set of numeric variables into disjoint or hierarchical clusters. Associated with each cluster is a linear combination of the variables in the cluster. This linear combination can be either the first principal component or the centroid component. The following are highlights of the VARCLUS procedure's features:

  • attempts to maximize the variance that is explained by the cluster components, summed over all the clusters
  • cluster components are oblique
  • enables you to analyze either the correlation or the covariance matrix
  • creates an output data set that can be used with the SCORE procedure to compute component scores for each cluster
  • creates an output data set that can be used with the TREE procedure to draw a tree diagram of hierarchical clusters
  • enables you to base the clustering on partial correlations
  • performs BY group processing, which enables you to obtain separate analyses on grouped observations
  • performs weighted analysis
  • creates a SAS data set that corresponds to any output table
  • automatically displays a dendrogram (tree diagram of hierarchical clusters) by using ODS Graphics
For further details, see VARCLUS Procedure