![]() | ![]() | ![]() | ![]() |
Suppose you develop a clustering solution for a set of data, and you have new data whose observations you want to assign to those clusters. Clustering of the original data can be done using various procedures including PROC CLUSTER and PROC FASTCLUS. After developing and saving the clustering solution for the original data, there are several ways that the new observations can be assigned to the previously generated clusters. Approaches using PROC FASTCLUS and PROC DISCRIM are described here.
Preparing data for clustering and scoring
PROC CLUSTER and PROC FASTCLUS are used below to create cluster solutions for the Sashelp.Class data set. In each case, the clustering solution is saved for later use in assigning new observations to the clusters (also called scoring the new data).
The variables involved in developing the cluster solutions are first standardized. Here, PROC STDIZE is used with the range method. The standardized variables are saved in the StdClass output data set. The location and scale measures and saved in the Standardizing_Info output data set.
Suppose you have a set of new measurements that are taken on the same variables—Age, Height, and Weight. You want to assign the new observations into previously derived clusters. The new data, New, is first standardized using the same method and the same location and scale information as the original Class data set.
The PROC FASTCLUS method is suitable if the original clusters were obtained by PROC CLUSTER with METHOD=WARD or by PROC FASTCLUS. With other methods, there is no formal justification for using FASTCLUS, although results are probably reasonable with METHOD=AVERAGE or CENTROID. For solutions obtained from the MEDIAN or FLEXIBLE methods, use this scoring method with caution. For METHOD=COMPLETE, use FASTCLUS with LEAST=MAX.
Example 1: Assigning new observations to clusters with PROC FASTCLUS using a cluster solution from PROC CLUSTER
In this example, Ward's method is used in PROC CLUSTER to cluster the data. The results are saved in the OUTTREE= output data set. A three-cluster solution is saved in the OUT= data set from PROC TREE.
To use PROC FASTCLUS for scoring the new data, specify the existing cluster means as the seeds. The following MEANS step computes the variable means for each of the three clusters. The means are saved using the same names as the original variables.
PROC FASTCLUS is used to assign the new observations to the clusters. Since the object is to use existing clusters, specify MAXITER=0 so that no iterations are done. REPLACE=NONE specifies that the seeds provided in the Means data set are kept at their input values. MAXC=3 keeps the maximum number of clusters at 3. By using the existing cluster means, each new observation is assigned into the cluster to which it is closest. The final cluster assignments are saved in the NewClusters data set.
Example 2: Assigning new observations to clusters with PROC FASTCLUS using a cluster solution from PROC FASTCLUS
A three-cluster solution for the original data is found using the following PROC FASTCLUS step. The cluster solution statistics are saved in the OUTSTAT= data set, FastclusStats.
The OUTSTAT= data set with the original cluster information is used as the INSTAT= data set to assign the new observations to clusters. The final cluster assignments are saved in the NewClusters data set.
Example 3: Assigning new observations to clusters with PROC DISCRIM using a cluster solution from PROC CLUSTER
PROC DISCRIM with METHOD=NPAR is suitable if the original clusters were obtained by PROC MODECLUS or by PROC CLUSTER with METHOD=DENSITY or TWOSTAGE. Be sure to specify the same density-estimation method in PROC DISCRIM as was used for clustering. PROC DISCRIM using nearest neighbor (K=1) is the natural choice for clusters obtained using METHOD=SINGLE. PROC DISCRIM with METHOD=NORMAL can be problematic because the distribution assumptions in PROC DISCRIM do not correspond to any of the clustering methods.
In this example, the original cluster solution from PROC CLUSTER and PROC TREE is used. PROC DISCRIM uses the original cluster information in the ClassClus data set to assign each observation in StdNEW into the cluster having the highest posterior probability for that observation. The final cluster assignments are saved in the NewClusters data set.
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | All | n/a |
Type: | Usage Note |
Priority: | low |
Topic: | SAS Reference ==> Procedures ==> MODECLUS SAS Reference ==> Procedures ==> DISCRIM Analytics ==> Discriminant Analysis Analytics ==> Nonparametric Analysis Analytics ==> Cluster Analysis Analytics ==> Data Mining Analytics ==> Multivariate Analysis SAS Reference ==> Procedures ==> FASTCLUS SAS Reference ==> Procedures ==> CLUSTER |
Date Modified: | 2020-12-08 14:53:28 |
Date Created: | 2002-12-16 10:56:37 |