Introduction to Clustering Procedures


Clustering Observations

PROC CLUSTER is easier to use than PROC FASTCLUS because one run produces results from one cluster up to as many as you like. You must run PROC FASTCLUS once for each number of clusters.

The time required by PROC FASTCLUS is roughly proportional to the number of observations, whereas the time required by PROC CLUSTER with most methods varies with the square or cube of the number of observations. Therefore, you can use PROC FASTCLUS with much larger data sets than PROC CLUSTER.

If you want to hierarchically cluster a data set that is too large to use with PROC CLUSTER directly, you can have PROC FASTCLUS produce, for example, 50 clusters, and let PROC CLUSTER analyze these 50 clusters instead of the entire data set. The MEAN= data set produced by PROC FASTCLUS contains two special variables:

  • The variable _FREQ_ gives the number of observations in the cluster.

  • The variable _RMSSTD_ gives the root mean square across variables of the cluster standard deviations.

These variables are automatically used by PROC CLUSTER to give the correct results when clustering clusters. For example, you could specify Ward’s minimum variance method Ward (1963):

proc fastclus maxclusters=50 mean=temp;
   var x y z;
run;

ods graphics on;
proc cluster method=ward outtree=tree;
   var x y z;
run;

Or you could specify Wong’s hybrid method (Wong, 1982):

proc fastclus maxclusters=50 mean=temp;
   var x y z;
run;

ods graphics on;
proc cluster method=density hybrid outtree=tree;
   var x y z;
run;

More detailed examples are given in Chapter 33: The CLUSTER Procedure.