![]() | ![]() | ![]() | ![]() |
See The Number of Clusters section of the Introduction to Clustering Procedures chapter in the SAS/STAT User's Guide. Also see the papers by Milligan and Cooper and Cooper and Milligan referenced there. You can examine several statistics in the printed output to determine the number of clusters if you specify the options PSEUDO and CCC. To interpret:
Start at the top of the printed output and look for the first relatively large value, then move back up one cluster.
Look for a relatively large value.
The best way to use the CCC is to plot its value against the number of clusters, ranging from one cluster up to about one-tenth the number of observations. The CCC may not behave well if the average number of observations per cluster is less than ten. The following guidelines should be used for interpreting the CCC: * Peaks on the plot with the CCC greater than 2 or 3 indicate good clusterings. * Peaks with the CCC between 0 and 2 indicate possible clusters but should be interpreted cautiously. * There may be several peaks if the data has a hierarchical structure. * Very distinct nonhierarchical spherical clusters usually show a sharp rise before the peak followed by a gradual decline. * Very distinct nonhierarchical elliptical clusters often show a sharp rise to the correct number of clusters followed by a further gradual increase and eventually a gradual decline. * If all values of the CCC are negative and decreasing for two or more clusters, the distribution is probably unimodal or long-tailed. * Very negative values of the CCC, say, -30, may be due to outliers. Outliers generally should be removed before clustering. * If the CCC increases continually as the number of clusters increases, the distribution may be grainy or the data may have been excessively rounded or recorded with just a few digits. A final and very important warning: neither the CCC nor R-square is an appropriate criterion for clusters that are highly elongated or irregularly shaped. If you do not have prior substantive reasons for expecting compact clusters, use a nonparametric clustering method such as those provided in PROC CLUSTER or PROC MODECLUS rather than Ward's method or K-means.
For additional information on the CCC, please refer to SAS Technical Report A-108 Cubic Clustering Criterion.
Look for a value that explains as much variance as you think appropriate. Milligan and Cooper demonstrated that changes in the R-Square are not very useful for estimating the number of clusters, but it may be useful if you are interested solely in data reduction.
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | All | n/a |