This section illustrates how PROC MODECLUS can be used to examine the clusters of data in the following artificial data set.
data example; input x y @@; datalines; 18 18 20 22 21 20 12 23 17 12 23 25 25 20 16 27 20 13 28 22 80 20 75 19 77 23 81 26 55 21 64 24 72 26 70 35 75 30 78 42 18 52 27 57 41 61 48 64 59 72 69 72 80 80 31 53 51 69 72 81 ;
It is a good practice to plot the data to check for obvious clusters or pathologies prior to the analysis. In this example, with only two variables and a small sample size, the SGPLOT procedure in the following statements produces a scatter plot:
proc sgplot; scatter y=y x=x; run;
Figure 78.1 suggests three clusters. Of these clusters, the one in the lower-left corner is the most compact, while the lower-right cluster is more dispersed.
The upper cluster is elongated and would be difficult for most clustering algorithms to identify as a single cluster. The plot also suggests that a Euclidean distance of 10 or 20 is a good initial guess for the neighborhood size in density estimation and clustering.
Figure 78.1: Scatter Plot of Data
To obtain a cluster analysis in PROC MODECLUS, you must specify the METHOD= option; for most purposes, METHOD=1 is recommended. The cluster analysis can be performed with a list of radii (R=10 15 35), as shown in the following PROC MODECLUS statement. An output data set containing the cluster membership is created with the OUT= option. The following statements produce Figure 78.2 through Figure 78.5:
proc modeclus data=example method=1 r=10 15 35 out=out; run;
For each cluster solution, PROC MODECLUS produces a table of cluster statistics including the cluster number, the number of observations in the cluster, the maximum estimated density within the cluster, the number of observations in the cluster having a neighbor that belongs to a different cluster, and the estimated saddle density of the cluster. The results are displayed in Figure 78.2, Figure 78.3, and Figure 78.4 for three different radii. A smaller radius (R=10) yields a larger number of clusters (6), as displayed in Figure 78.2; a larger radius (R=35) includes all observations in a single cluster, as displayed in Figure 78.4. Note that all clusters in these three figures are "isolated" since their corresponding boundary frequencies are all zeros. Consequently, all the estimated saddle densities are missing. A table summarizing each cluster solution is then produced at the end, as displayed in Figure 78.5.
Figure 78.2: Results from PROC MODECLUS for METHOD=1 and R=10
Figure 78.3: Results from PROC MODECLUS for METHOD=1 and R=15
Figure 78.4: Results from PROC MODECLUS for METHOD=1 and R=35
Figure 78.5: Summary Table
The OUT= data set contains a complete copy of the input data set for each cluster solution. By using a BY statement in the following PROC SGPLOT statement, you can examine the differences in cluster memberships for each radius as shown in Figure 78.6 through Figure 78.8:
proc sgplot data=out noautolegend; scatter y=y x=x / group=cluster markerchar=cluster; by _r_; run;
Figure 78.6: Scatter Plots of Cluster Memberships with _R_=10
Figure 78.7: Scatter Plots of Cluster Memberships with _R_=15
Figure 78.8: Scatter Plots of Cluster Memberships with _R_=35