This example demonstrates how you can use PROC VARCLUS to cluster variables.
The following data are job ratings of police officers. The officers were rated by their supervisors on 13 job skills on a scale from 1 to 9. There is also an overall rating that is not used in this analysis. The following DATA step creates the SAS data set JobRat:
data JobRat; input (Communication_Skills Problem_Solving Learning_Ability Judgement_under_Pressure Observational_Skills Willingness_to_Confront_Problems Interest_in_People Interpersonal_Sensitivity Desire_for_Self_Improvement Appearance Dependability Physical_Ability Integrity Overall_Rating) (1.); datalines; 26838853879867 74758876857667 56757863775875 67869777988997 ... more lines ... 99997899799799 99899899899899 76656399567486 ;
The following statements cluster the variables:
proc varclus data=JobRat maxclusters=3; var Communication_Skills--Integrity; run;
The DATA= option specifies the SAS data set JobRat as input.
The MAXCLUSTERS=3 option specifies that no more than three clusters be computed. By default, PROC VARCLUS splits and optimizes clusters until all clusters have a second eigenvalue less than one. In this example, the default setting would produce only two clusters, but going to three clusters produces a more interesting result.
The VAR statement lists the numeric variables (Communication_Skills -- Integrity) to be used in the analysis. The overall rating is omitted from the list of variables.
Although PROC VARCLUS displays output for one cluster, two clusters, and three clusters, the following figures display only the final analysis for three clusters.
For each cluster, Figure 96.1 displays the number of variables in the cluster, the cluster variation, the total explained variation, and the proportion of the total variance explained by the variables in the cluster. The variance explained by the variables in a cluster is similar to the variance explained by a factor in common factor analysis, but it includes contributions only from the variables in the cluster rather than from all variables.
The line labeled "Total variation explained" in Figure 96.1 gives the sum of the explained variation over all clusters. The final "Proportion" represents the total explained variation divided by the sum of cluster variation. This value, 0.6715, indicates that about 67% of the total variation in the data can be accounted for by the three cluster components.
|Cluster Summary for 3 Clusters|
Figure 96.2 shows how the variables are clustered. Figure 96.2 also displays the R-square value of each variable with its own cluster and the R-square value with its nearest cluster. The R-square value for a variable with the nearest cluster should be low if the clusters are well separated. The last column displays the ratio of / for each variable. Small values of this ratio indicate good clustering.
|3 Clusters||R-squared with||1-R**2
Figure 96.3 displays the standardized scoring coefficients that are used to compute the first principal component of each cluster. Since each variable is assigned to one and only one cluster, each row of the scoring coefficients contains only one nonzero value.
|Standardized Scoring Coefficients|
Figure 96.4 displays the cluster structure and the intercluster correlations. The structure table displays the correlation of each variable with each cluster component. The table of intercorrelations contains the correlations between the cluster components.
PROC VARCLUS next displays the summary table of statistics for the cluster history (Figure 96.5). The first three columns give the number of clusters, the total variation explained by clusters, and the proportion of variation explained by clusters, respectively.
As displayed in the first row of Figure 96.5, the variation explained by the first principal component of all the variables is 6.547402, and the proportion of variation explained is 0.5036.
When the number of clusters is two, the total variation explained is 7.96775 and the proportion of variation explained by the two clusters is 0.6129. The larger second eigenvalue of the clusters is 0.937902; so by default, PROC VARCLUS would stop splitting clusters at this point. But because the MAXCLUSTERS=3 option was specified in this example, PROC VARCLUS continues to the three-cluster solution.
When the number of clusters increases to three, the total variation explained is 8.729286 and the proportion of variation explained by the two clusters is 0.6715. The largest second eigenvalue of the clusters is 0.709323. The statistical improvement from increasing the number of clusters from two to three seems modest, but the interpretability of the three clusters argues for the three-cluster solution.
Figure 96.5 also displays the minimum proportion of variance explained by a cluster, the minimum R square for a variable, and the maximum () ratio for a variable. The last quantity is the maximum ratio of the value for a variable’s own cluster to the value for its nearest cluster.