Usage Note 22540: How can I tell how many clusters are in my data set?
See
The Number of Clusters
section of the Introduction to Clustering Procedures chapter in the
SAS/STAT User's Guide. Also see
the papers by Milligan and Cooper and Cooper and Milligan referenced
there. You can examine several statistics in the printed output to determine
the number of clusters if you specify the options PSEUDO and CCC. To interpret:
- pseudo T statistic:
Start at the top of the printed output and
look for the first relatively large value, then move back up one cluster.
- pseudo F:
Look for a relatively large value.
- CCC:
The best way to use the CCC is to plot its value against the number of
clusters, ranging from one cluster up to about one-tenth the number of
observations. The CCC may not behave well if the average number of observations
per cluster is less than ten. The following guidelines
should be used for interpreting the CCC:
* Peaks on the plot with the CCC greater than 2 or 3 indicate good clusterings.
* Peaks with the CCC between 0 and 2 indicate possible clusters but should be
interpreted cautiously.
* There may be several peaks if the data has a hierarchical structure.
* Very distinct nonhierarchical spherical clusters usually show a sharp rise
before the peak followed by a gradual decline.
* Very distinct nonhierarchical elliptical clusters often show a sharp rise to
the correct number of clusters followed by a further gradual increase and
eventually a gradual decline.
* If all values of the CCC are negative and decreasing for two or more
clusters, the distribution is probably unimodal or long-tailed.
* Very negative values of the CCC, say, -30, may be due to outliers. Outliers
generally should be removed before clustering.
* If the CCC increases continually as the number of clusters increases, the
distribution may be grainy or the data may have been excessively rounded or
recorded with just a few digits.
A final and very important warning: neither the CCC nor R-square is an
appropriate criterion for clusters that are highly elongated or irregularly
shaped. If you do not have prior substantive reasons for expecting compact
clusters, use a nonparametric clustering method such as those provided in
PROC CLUSTER or PROC MODECLUS rather than Ward's method or K-means.
For additional information on the CCC, please refer to SAS Technical Report
A-108 Cubic Clustering Criterion.
- R-Square:
Look for a value that explains as much variance as
you think appropriate. Milligan and Cooper demonstrated that changes in the
R-Square are not very useful for estimating the number of clusters, but it
may be useful if you are interested solely in data reduction.
Operating System and Release Information
*
For software releases that are not yet generally available, the Fixed
Release is the software release in which the problem is planned to be
fixed.
Type: | Usage Note |
Priority: | low |
Topic: | SAS Reference ==> Procedures ==> CLUSTER SAS Reference ==> Procedures ==> FASTCLUS Analytics ==> Multivariate Analysis Analytics ==> Data Mining Analytics ==> Cluster Analysis
|
Date Modified: | 2009-08-05 13:50:39 |
Date Created: | 2002-12-16 10:56:39 |