22544 - Assigning new observations to clusters defined using previous data

Usage Note 22544: Assigning new observations to clusters defined using previous data

Suppose you develop a clustering solution for a set of data, and you have new data whose observations you want to assign to those clusters. Clustering of the original data can be done using various procedures including PROC CLUSTER and PROC FASTCLUS. After developing and saving the clustering solution for the original data, there are several ways that the new observations can be assigned to the previously generated clusters. Approaches using PROC FASTCLUS and PROC DISCRIM are described here.

Preparing data for clustering and scoring

PROC CLUSTER and PROC FASTCLUS are used below to create cluster solutions for the Sashelp.Class data set. In each case, the clustering solution is saved for later use in assigning new observations to the clusters (also called scoring the new data).

The variables involved in developing the cluster solutions are first standardized. Here, PROC STDIZE is used with the range method. The standardized variables are saved in the StdClass output data set. The location and scale measures and saved in the Standardizing_Info output data set.

proc stdize data=sashelp.class out=StdClass method=range

  outstat=standardizing_info;

  var age height weight;

run;

Suppose you have a set of new measurements that are taken on the same variables—Age, Height, and Weight. You want to assign the new observations into previously derived clusters. The new data, New, is first standardized using the same method and the same location and scale information as the original Class data set.

proc stdize data=New out=StdNew method=in(standardizing_info);

  var age height weight;

run;

Assigning observations to clusters with PROC FASTCLUS

The PROC FASTCLUS method is suitable if the original clusters were obtained by PROC CLUSTER with METHOD=WARD or by PROC FASTCLUS. With other methods, there is no formal justification for using FASTCLUS, although results are probably reasonable with METHOD=AVERAGE or CENTROID. For solutions obtained from the MEDIAN or FLEXIBLE methods, use this scoring method with caution. For METHOD=COMPLETE, use FASTCLUS with LEAST=MAX.

Example 1: Assigning new observations to clusters with PROC FASTCLUS using a cluster solution from PROC CLUSTER

In this example, Ward's method is used in PROC CLUSTER to cluster the data. The results are saved in the OUTTREE= output data set. A three-cluster solution is saved in the OUT= data set from PROC TREE.

proc cluster data=StdClass outtree=tree method=ward;

  var age height weight;

  copy name sex;

  id name;

run;

proc tree data=tree out=ClassClus(drop=clusname _name_) n=3;

  copy age height weight name sex;

run;

To use PROC FASTCLUS for scoring the new data, specify the existing cluster means as the seeds. The following MEANS step computes the variable means for each of the three clusters. The means are saved using the same names as the original variables.

proc means data=ClassClus;

  class cluster;

  var age height weight;

  output out=Means(drop=_type_ _freq_ where=(cluster ne .)) mean=;

run;

PROC FASTCLUS is used to assign the new observations to the clusters. Since the object is to use existing clusters, specify MAXITER=0 so that no iterations are done. REPLACE=NONE specifies that the seeds provided in the Means data set are kept at their input values. MAXC=3 keeps the maximum number of clusters at 3. By using the existing cluster means, each new observation is assigned into the cluster to which it is closest. The final cluster assignments are saved in the NewClusters data set.

proc fastclus data=StdNew seed=Means maxiter=0 replace=none maxc=3 out=NewClusters;

  var age height weight;

run;

Example 2: Assigning new observations to clusters with PROC FASTCLUS using a cluster solution from PROC FASTCLUS

A three-cluster solution for the original data is found using the following PROC FASTCLUS step. The cluster solution statistics are saved in the OUTSTAT= data set, FastclusStats.

proc fastclus data=StdClass outstat=FastclusStats maxclusters=3 maxiter=0;

  var age height weight;

run;

The OUTSTAT= data set with the original cluster information is used as the INSTAT= data set to assign the new observations to clusters. The final cluster assignments are saved in the NewClusters data set.

proc fastclus data=StdNew instat=FastclusStats out=NewClusters;

  var age height weight;

run;

Assigning observations to clusters with PROC DISCRIM

Example 3: Assigning new observations to clusters with PROC DISCRIM using a cluster solution from PROC CLUSTER

PROC DISCRIM with METHOD=NPAR is suitable if the original clusters were obtained by PROC MODECLUS or by PROC CLUSTER with METHOD=DENSITY or TWOSTAGE. Be sure to specify the same density-estimation method in PROC DISCRIM as was used for clustering. PROC DISCRIM using nearest neighbor (K=1) is the natural choice for clusters obtained using METHOD=SINGLE. PROC DISCRIM with METHOD=NORMAL can be problematic because the distribution assumptions in PROC DISCRIM do not correspond to any of the clustering methods.

In this example, the original cluster solution from PROC CLUSTER and PROC TREE is used. PROC DISCRIM uses the original cluster information in the ClassClus data set to assign each observation in StdNEW into the cluster having the highest posterior probability for that observation. The final cluster assignments are saved in the NewClusters data set.

proc cluster data=StdClass outtree=tree method=den r=1;

  var age height weight;

  copy name sex;

  id name;

run;

proc tree data=tree out=ClassClus(drop=clusname _name_) n=3;

  copy age height weight name sex;

run;

proc discrim data=ClassClus testdata=StdNew testout=NewClusters method=npar kernel=uniform r=1;

  class cluster;

  var age height weight;

run;

Operating System and Release Information

Product Family	Product	System	SAS Release
			Reported	Fixed*
SAS System	SAS/STAT	All	n/a

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:	low
Topic:	SAS Reference ==> Procedures ==> MODECLUS SAS Reference ==> Procedures ==> DISCRIM Analytics ==> Discriminant Analysis Analytics ==> Nonparametric Analysis Analytics ==> Cluster Analysis Analytics ==> Data Mining Analytics ==> Multivariate Analysis SAS Reference ==> Procedures ==> FASTCLUS SAS Reference ==> Procedures ==> CLUSTER

Date Modified:	2020-12-08 14:53:28
Date Created:	2002-12-16 10:56:37

Support

Usage Note 22544: Assigning new observations to clusters defined using previous data

Assigning observations to clusters with PROC FASTCLUS

Assigning observations to clusters with PROC DISCRIM

Operating System and Release Information