Previous Page | Next Page

 The STDIZE Procedure

## Example 82.1 Standardization of Variables in Cluster Analysis

To illustrate the effect of standardization in cluster analysis, this example uses the Fish data set described in the "Getting Started" section of Chapter 34, The FASTCLUS Procedure. The numbers are measurements taken on 159 fish caught from the same lake (Laengelmavesi) near Tampere in Finland (Puranen; 1917). The complete data set is displayed in Chapter 83, The STEPDISC Procedure.

The species (bream, parkki, pike, perch, roach, smelt, and whitefish), weight, three different length measurements (measured from the nose of the fish to the beginning of its tail, the notch of its tail, and the end of its tail), height, and width of each fish are recorded. The height and width are recorded as percentages of the third length variable.

Several new variables are created in the Fish data set: Weight3, Height, Width, and logLengthRatio. The weight of a fish indicates its size—a heavier pike tends to be larger than a lighter pike. To get a one-dimensional measure of the size of a fish, take the cubic root of the weight (Weight3). The variables Height, Width, Length1, Length2, and Length3 are rescaled in order to adjust for dimensionality. The logLengthRatio variable measures the tail length.

Because the new variables Weight3logLengthRatio depend on the variable Weight, observations with missing values for Weight are not added to the data set. Consequently, there are 157 observations in the SAS data set Fish.

Before you perform a cluster analysis on coordinate data, it is necessary to consider scaling or transforming the variables since variables with large variances tend to have a larger effect on the resulting clusters than variables with small variances do.

This example uses three different approaches to standardize or transform the data prior to the cluster analysis. The first approach uses several standardization methods provided in the STDIZE procedure. However, since standardization is not always appropriate prior to the clustering (refer to Milligan and Cooper (1987) for a Monte Carlo study on various methods of variable standardization), the second approach performs the cluster analysis with no standardization. The third approach invokes the ACECLUS procedure to transform the data into a within-cluster covariance matrix.

The clustering is performed by the FASTCLUS procedure to find seven clusters. Note that the variables Length2 and Length3 are eliminated from this analysis since they both are significantly and highly correlated with the variable Length1. The correlation coefficients are 0.9958 and 0.9604, respectively. An output data set is created, and the FREQ procedure is invoked to compare the clusters with the species classification.

The DATA step is as follows, after the initial PROC FORMAT step:

```proc format;
value specfmt
1='Bream'
2='Roach'
3='Whitefish'
4='Parkki'
5='Perch'
6='Pike'
7='Smelt';
run;
```
```data Fish (drop=HtPct WidthPct);
title 'Fish Measurement Data';
input Species Weight Length1 Length2 Length3 HtPct
WidthPct @@;

if Weight <= 0 or Weight=. then delete;
Weight3=Weight**(1/3);
Height=HtPct*Length3/(Weight3*100);
Width=WidthPct*Length3/(Weight3*100);
Length1=Length1/Weight3;
Length2=Length2/Weight3;
Length3=Length3/Weight3;
logLengthRatio=log(Length3/Length1);

format Species specfmt.;
symbol = put(Species, specfmt2.);
datalines;
1  242.0 23.2 25.4 30.0 38.4 13.4 1  290.0 24.0 26.3 31.2 40.0 13.8
1  340.0 23.9 26.5 31.1 39.8 15.1 1  363.0 26.3 29.0 33.5 38.0 13.3
1  430.0 26.5 29.0 34.0 36.6 15.1 1  450.0 26.8 29.7 34.7 39.2 14.2
1  500.0 26.8 29.7 34.5 41.1 15.3 1  390.0 27.6 30.0 35.0 36.2 13.4
1  450.0 27.6 30.0 35.1 39.9 13.8 1  500.0 28.5 30.7 36.2 39.3 13.7
1  475.0 28.4 31.0 36.2 39.4 14.1 1  500.0 28.7 31.0 36.2 39.7 13.3
1  500.0 29.1 31.5 36.4 37.8 12.0 1     .  29.5 32.0 37.3 37.3 13.6
1  600.0 29.4 32.0 37.2 40.2 13.9 1  600.0 29.4 32.0 37.2 41.5 15.0
1  700.0 30.4 33.0 38.3 38.8 13.8 1  700.0 30.4 33.0 38.5 38.8 13.5
1  610.0 30.9 33.5 38.6 40.5 13.3 1  650.0 31.0 33.5 38.7 37.4 14.8
1  575.0 31.3 34.0 39.5 38.3 14.1 1  685.0 31.4 34.0 39.2 40.8 13.7
1  620.0 31.5 34.5 39.7 39.1 13.3 1  680.0 31.8 35.0 40.6 38.1 15.1
1  700.0 31.9 35.0 40.5 40.1 13.8 1  725.0 31.8 35.0 40.9 40.0 14.8

... more lines ...

7   19.7 13.2 14.3 15.2 18.9 13.6 7   19.9 13.8 15.0 16.2 18.1 11.6
;
```

The following macro, Std, standardizes the Fish data. The macro reads a single argument, mtd, which selects the METHOD= specification to be used in PROC STDIZE.

```/*--- macro for standardization ---*/

%macro Std(mtd);
title2 "Data are Standardized by PROC STDIZE with METHOD= &mtd";
proc stdize data=fish out=sdzout method=&mtd;
var Length1 logLengthRatio Height Width Weight3;
run;
%mend Std;
```

The following macro, FastFreq, includes a PROC FASTCLUS statement for performing cluster analysis and a PROC FREQ statement for crosstabulating species with the cluster membership information that is derived from the previous PROC FASTCLUS statement. The macro reads a single argument, ds, which selects the input data set to be used in PROC FASTCLUS.

```/*--- macro for clustering and crosstabulating ---*/
/*--- cluster membership with species          ---*/
%macro FastFreq(ds);
proc fastclus data=&ds out=clust maxclusters=7 maxiter=100 noprint;
var Length1 logLengthRatio Height Width Weight3;
run;

proc freq data=clust;
tables species*cluster;
run;
%mend FastFreq;
```

The following analysis (labeled ‘Approach 1’) includes 18 different methods of standardization followed by clustering. Since there is a large amount of output from this approach, only results from METHOD=STD, METHOD=RANGE, METHOD=AGK(0.14), and METHOD=SPACING(0.14) are shown. The following statements produce Output 82.1.1 through Output 82.1.4.

```/**********************************************************/
/*                                                        */
/*     Approach 1: data are standardized by PROC STDIZE   */
/*                                                        */
/**********************************************************/

%Std(MEAN);
%FastFreq(sdzout);

%Std(MEDIAN);
%FastFreq(sdzout);

%Std(SUM);
%FastFreq(sdzout);

%Std(EUCLEN);
%FastFreq(sdzout);

%Std(USTD);
%FastFreq(sdzout);

%Std(STD);
%FastFreq(sdzout);

%Std(RANGE);
%FastFreq(sdzout);

%Std(MIDRANGE);
%FastFreq(sdzout);

%Std(MAXABS);
%FastFreq(sdzout);

%Std(IQR);
%FastFreq(sdzout);

%FastFreq(sdzout);

%Std(AGK(.14));
%FastFreq(sdzout);

%Std(SPACING(.14));
%FastFreq(sdzout);

%Std(ABW(5));
%FastFreq(sdzout);

%Std(AWAVE(5));
%FastFreq(sdzout);

%Std(L(1));
%FastFreq(sdzout);

%Std(L(1.5));
%FastFreq(sdzout);

%Std(L(2));
%FastFreq(sdzout);
```

Output 82.1.1 Data Are Standardized by PROC STDIZE with METHOD=STD
 Fish Measurement Data Data are Standardized by PROC STDIZE with METHOD= STD

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of Species by CLUSTER
Species CLUSTER(Cluster)
1 2 3 4 5 6 7 Total
Bream
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 34 21.66 100 100
 0 0 0 0
 34 21.66
Roach
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 19 12.1 100 38
 19 12.10
Whitefish
 0 0 0 0
 2 1.27 33.33 10.53
 0 0 0 0
 1 0.64 16.67 7.69
 0 0 0 0
 0 0 0 0
 3 1.91 50 6
 6 3.82
Parkki
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 11 7.01 100 100
 0 0 0 0
 0 0 0 0
 11 7.01
Perch
 0 0 0 0
 17 10.83 30.36 89.47
 0 0 0 0
 12 7.64 21.43 92.31
 0 0 0 0
 0 0 0 0
 27 17.2 48.21 54
 56 35.67
Pike
 17 10.83 100 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 17 10.83
Smelt
 0 0 0 0
 0 0 0 0
 13 8.28 92.86 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 1 0.64 7.14 2
 14 8.92
Total
 17 10.83
 19 12.1
 13 8.28
 13 8.28
 11 7.01
 34 21.66
 50 31.85
 157 100

Output 82.1.2 Data Are Standardized by PROC STDIZE with METHOD=RANGE
 Fish Measurement Data Data are Standardized by PROC STDIZE with METHOD= RANGE

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of Species by CLUSTER
Species CLUSTER(Cluster)
1 2 3 4 5 6 7 Total
Bream
 0 0 0 0
 0 0 0 0
 34 21.66 100 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 34 21.66
Roach
 0 0 0 0
 0 0 0 0
 0 0 0 0
 19 12.1 100 61.29
 0 0 0 0
 0 0 0 0
 0 0 0 0
 19 12.10
Whitefish
 0 0 0 0
 0 0 0 0
 0 0 0 0
 3 1.91 50 9.68
 3 1.91 50 13.04
 0 0 0 0
 0 0 0 0
 6 3.82
Parkki
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 11 7.01 100 100
 0 0 0 0
 11 7.01
Perch
 0 0 0 0
 0 0 0 0
 0 0 0 0
 9 5.73 16.07 29.03
 20 12.74 35.71 86.96
 0 0 0 0
 27 17.2 48.21 100
 56 35.67
Pike
 17 10.83 100 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 17 10.83
Smelt
 0 0 0 0
 14 8.92 100 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 14 8.92
Total
 17 10.83
 14 8.92
 34 21.66
 31 19.75
 23 14.65
 11 7.01
 27 17.2
 157 100

Output 82.1.3 Data Are Standardized by PROC STDIZE with METHOD=AGK(0.14)
 Fish Measurement Data Data are Standardized by PROC STDIZE with METHOD= AGK(.14)

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of Species by CLUSTER
Species CLUSTER(Cluster)
1 2 3 4 5 6 7 Total
Bream
 0 0 0 0
 0 0 0 0
 34 21.66 100 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 34 21.66
Roach
 0 0 0 0
 0 0 0 0
 0 0 0 0
 17 10.83 89.47 73.91
 0 0 0 0
 0 0 0 0
 2 1.27 10.53 5.71
 19 12.10
Whitefish
 0 0 0 0
 0 0 0 0
 0 0 0 0
 3 1.91 50 13.04
 0 0 0 0
 3 1.91 50 13.04
 0 0 0 0
 6 3.82
Parkki
 11 7.01 100 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 11 7.01
Perch
 0 0 0 0
 0 0 0 0
 0 0 0 0
 3 1.91 5.36 13.04
 0 0 0 0
 20 12.74 35.71 86.96
 33 21.02 58.93 94.29
 56 35.67
Pike
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 17 10.83 100 100
 0 0 0 0
 0 0 0 0
 17 10.83
Smelt
 0 0 0 0
 14 8.92 100 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 14 8.92
Total
 11 7.01
 14 8.92
 34 21.66
 23 14.65
 17 10.83
 23 14.65
 35 22.29
 157 100

Output 82.1.4 Data Are Standardized by PROC STDIZE with METHOD=SPACING(0.14)
 Fish Measurement Data Data are Standardized by PROC STDIZE with METHOD= SPACING(.14)

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of Species by CLUSTER
Species CLUSTER(Cluster)
1 2 3 4 5 6 7 Total
Bream
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 34 21.66 100 100
 34 21.66
Roach
 0 0 0 0
 0 0 0 0
 0 0 0 0
 17 10.83 89.47 85
 0 0 0 0
 2 1.27 10.53 5.26
 0 0 0 0
 19 12.10
Whitefish
 3 1.91 50 13.04
 0 0 0 0
 0 0 0 0
 3 1.91 50 15
 0 0 0 0
 0 0 0 0
 0 0 0 0
 6 3.82
Parkki
 0 0 0 0
 0 0 0 0
 11 7.01 100 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 11 7.01
Perch
 20 12.74 35.71 86.96
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 36 22.93 64.29 94.74
 0 0 0 0
 56 35.67
Pike
 0 0 0 0
 17 10.83 100 100
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 17 10.83
Smelt
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 14 8.92 100 100
 0 0 0 0
 0 0 0 0
 14 8.92
Total
 23 14.65
 17 10.83
 11 7.01
 20 12.74
 14 8.92
 38 24.2
 34 21.66
 157 100

The following analysis (labeled ‘Approach 2’) applies the cluster analysis directly to the original data. The following statements produce Output 82.1.5.

```/**********************************************************/
/*                                                        */
/*         Approach 2: data are untransformed             */
/*                                                        */
/**********************************************************/

title2 'Data are Untransformed';
%FastFreq(fish);
```

Output 82.1.5 Untransformed Data
 Fish Measurement Data Data are Untransformed

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of Species by CLUSTER
Species CLUSTER(Cluster)
1 2 3 4 5 6 7 Total
Bream
 13 8.28 38.24 44.83
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 21 13.38 61.76 47.73
 34 21.66
Roach
 3 1.91 15.79 10.34
 4 2.55 21.05 25
 0 0 0 0
 0 0 0 0
 12 7.64 63.16 30.77
 0 0 0 0
 0 0 0 0
 19 12.10
Whitefish
 3 1.91 50 10.34
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 3 1.91 50 6.82
 6 3.82
Parkki
 2 1.27 18.18 6.9
 3 1.91 27.27 18.75
 0 0 0 0
 0 0 0 0
 6 3.82 54.55 15.38
 0 0 0 0
 0 0 0 0
 11 7.01
Perch
 8 5.1 14.29 27.59
 9 5.73 16.07 56.25
 0 0 0 0
 1 0.64 1.79 6.67
 20 12.74 35.71 51.28
 0 0 0 0
 18 11.46 32.14 40.91
 56 35.67
Pike
 0 0 0 0
 0 0 0 0
 10 6.37 58.82 100
 0 0 0 0
 1 0.64 5.88 2.56
 4 2.55 23.53 100
 2 1.27 11.76 4.55
 17 10.83
Smelt
 0 0 0 0
 0 0 0 0
 0 0 0 0
 14 8.92 100 93.33
 0 0 0 0
 0 0 0 0
 0 0 0 0
 14 8.92
Total
 29 18.47
 16 10.19
 10 6.37
 15 9.55
 39 24.84
 4 2.55
 44 28.03
 157 100

The following analysis (labeled ‘Approach 3’) transforms the original data with the ACECLUS procedure and creates a TYPE=ACE output data set that is used as an input data set for the cluster analysis. The following statements produce Output 82.1.6.

```/**********************************************************/
/*                                                        */
/*    Approach 3: data are transformed by PROC ACECLUS    */
/*                                                        */
/**********************************************************/

title2 'Data are Transformed by PROC ACECLUS';
proc aceclus data=fish out=ace p=.02 noprint;
var Length1 logLengthRatio Height Width Weight3;
run;
%FastFreq(ace);
```

Output 82.1.6 Data Are Transformed by PROC ACECLUS
 Fish Measurement Data Data are Transformed by PROC ACECLUS

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of Species by CLUSTER
Species CLUSTER(Cluster)
1 2 3 4 5 6 7 Total
Bream
 13 8.28 38.24 44.83
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 21 13.38 61.76 47.73
 34 21.66
Roach
 3 1.91 15.79 10.34
 4 2.55 21.05 25
 0 0 0 0
 0 0 0 0
 12 7.64 63.16 30.77
 0 0 0 0
 0 0 0 0
 19 12.10
Whitefish
 3 1.91 50 10.34
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0
 3 1.91 50 6.82
 6 3.82
Parkki
 2 1.27 18.18 6.9
 3 1.91 27.27 18.75
 0 0 0 0
 0 0 0 0
 6 3.82 54.55 15.38
 0 0 0 0
 0 0 0 0
 11 7.01
Perch
 8 5.1 14.29 27.59
 9 5.73 16.07 56.25
 0 0 0 0
 1 0.64 1.79 6.67
 20 12.74 35.71 51.28
 0 0 0 0
 18 11.46 32.14 40.91
 56 35.67
Pike
 0 0 0 0
 0 0 0 0
 10 6.37 58.82 100
 0 0 0 0
 1 0.64 5.88 2.56
 4 2.55 23.53 100
 2 1.27 11.76 4.55
 17 10.83
Smelt
 0 0 0 0
 0 0 0 0
 0 0 0 0
 14 8.92 100 93.33
 0 0 0 0
 0 0 0 0
 0 0 0 0
 14 8.92
Total
 29 18.47
 16 10.19
 10 6.37
 15 9.55
 39 24.84
 4 2.55
 44 28.03
 157 100

Table 82.4 displays a table summarizing each classification results. In this table, the first column represents the standardization method, the second column represents the number of clusters that the seven species are classified into, and the third column represents the total number of observations that are misclassified.

Table 82.4 Summary of Clustering Results

Method of Standardization

Number of Clusters

Misclassification

MEAN

5

71

MEDIAN

5

71

SUM

6

51

EUCLEN

6

45

USTD

6

45

STD

5

33

RANGE

7

32

MIDRANGE

7

32

MAXABS

7

26

IQR

5

28

4

35

ABW(5)

6

34

AWAVE(5)

6

29

AGK(0.14)

7

28

SPACING(0.14)

7

25

L(1)

6

41

L(1.5)

5

33

L(2)

5

33

untransformed

5

71

PROC ACECLUS

5

71

Consider the results displayed in Output 82.1.1. In that analysis, the method of standardization is STD, and the number of clusters and the number of misclassifications are computed as shown in Table 82.5.

Table 82.5 Computations of Numbers of Clusters and Misclassification When Standardization Method Is STD

Species

Cluster Number

Misclassification in Each Species

Bream

6

0

Roach

7

0

Whitefish

7

3

Parkki

5

0

Perch

7

29

Pike

1

0

Smelt

3

1

In Output 82.1.1, the bream species is classified as cluster 6 since all 34 bream are categorized into cluster 6 with no misclassification. A similar pattern is seen with the roach, parkki, pike, and smelt species.

For the whitefish species, two fish are categorized into cluster 2, one fish is categorized into cluster 4, and three fish are categorized into cluster 7. Because the majority of this species is categorized into cluster 7, it is recorded in Table 82.5 as being classified as cluster 7 with 3 misclassifications. A similar pattern is seen with the perch species: it is classified as cluster 7 with 29 misclassifications.

In summary, when the standardization method is STD, seven species of fish are classified into only five clusters and the total number of misclassified observations is 33.

The result of this analysis demonstrates that when variables are standardized by the STDIZE procedure with methods including RANGE, MIDRANGE, MAXABS, AGK(0.14), and SPACING(0.14), the FASTCLUS procedure produces the correct number of clusters and less misclassification than it does when other standardization methods are used. The SPACING method attains the best result, probably because the variables Length1 and Height both exhibit marked groupings (bimodality) in their distributions.

 Previous Page | Next Page | Top of Page