The DISTANCE Procedure

Example 36.2 Financial Data – Stock Dividends

The following data set contains the average dividend yields for 15 utility stocks in the United States. The observations are names of the companies, and the variables correspond to the annual dividend yields for the period 1986–1990. The objective is to group similar stocks into clusters.

Before the cluster analysis is performed, the correlation similarity is chosen for measuring the closeness between each observation. Since distance type of measures are required by PROC CLUSTER, METHOD=DCORR is used in the PROC DISTANCE statement to transform the correlation measures to the distance measures. Notice that in Output 36.2.1, all the values in the distance matrix are between 0 and 2.

PROC CLUSTER performs hierarchical clustering by using agglomerative methods based on the distance data created from the previous PROC DISTANCE statement. Since the cubic clustering criterion is not suitable for distance data, only the pseudo F statistic is requested to identify the number of clusters.

The two clustering methods are Ward’s and the average linkage methods. Since the results of the pseudo $t^2$ statistic from both Ward’s and the average linkage methods contain many missing values, only the plot of the pseudo F statistic versus the number of clusters is requested along with the dendrogram by specifying PLOTS(ONLY)=(PSF DENDROGRAM) in the PROC CLUSTER statement.

Both Output 36.2.2 and Output 36.2.3 suggest four clusters. Both methods produce the same clustering result, as shown in Output 36.2.4 and Output 36.2.5. The four clusters are as follows:

  • Cincinnati G&E and Detroit Edison

  • Texas Utilities and Pennsylvania Power & Light

  • Union Electric, Iowa-Ill Gas & Electric, Oklahoma Gas & Electric, and Wisconsin Energy

  • Orange & Rockland Utilities, Kentucky Utilities, Kansas Power & Light, Allegheny Power, Green Mountain Power, Dominion Resources, and Minnesota Power & Light

title 'Stock Dividends';

data stock;
   length Company $ 27;
   input Company &$  Div_1986 Div_1987 Div_1988 Div_1989 Div_1990;
   datalines;
Cincinnati G&E               8.4    8.2    8.4    8.1    8.0
Texas Utilities              7.9    8.9   10.4    8.9    8.3
Detroit Edison               9.7   10.7   11.4    7.8    6.5
Orange & Rockland Utilities  6.5    7.2    7.3    7.7    7.9
Kentucky Utilities           6.5    6.9    7.0    7.2    7.5
Kansas Power & Light         5.9    6.4    6.9    7.4    8.0
Union Electric               7.1    7.5    8.4    7.8    7.7
Dominion Resources           6.7    6.9    7.0    7.0    7.4
Allegheny Power              6.7    7.3    7.8    7.9    8.3
Minnesota Power & Light      5.6    6.1    7.2    7.0    7.5
Iowa-Ill Gas & Electric      7.1    7.5    8.5    7.8    8.0
Pennsylvania Power & Light   7.2    7.6    7.7    7.4    7.1
Oklahoma Gas & Electric      6.1    6.7    7.4    6.7    6.8
Wisconsin Energy             5.1    5.7    6.0    5.7    5.9
Green Mountain Power         7.1    7.4    7.8    7.8    8.3
;
proc distance data=stock method=dcorr out=distdcorr;
   var interval(div_1986 div_1987 div_1988 div_1989 div_1990);
   id company;
run;
proc print data=distdcorr;
   id company;
   title2 'Distance Matrix for 15 Utility Stocks';
run;
title2;
ods graphics on;

/* compute pseudo statistic versus number of clusters and create plot */
proc cluster data=distdcorr method=ward pseudo plots(only)=(psf dendrogram);
   id company;
run;
/* compute pseudo statistic versus number of clusters and create plot */
proc cluster data=distdcorr method=average pseudo plots(only)=(psf dendrogram);
   id company;
run;

ods graphics off;

Output 36.2.1: Distance Matrix Based on the DCORR Coefficient

Stock Dividends
Distance Matrix for 15 Utility Stocks

Company Cincinnati_G_E Texas_Utilities Detroit_Edison Orange___Rockland_Utilities Kentucky_Utilities Kansas_Power___Light Union_Electric Dominion_Resources Allegheny_Power Minnesota_Power___Light Iowa_Ill_Gas___Electric Pennsylvania_Power___Light Oklahoma_Gas___Electric Wisconsin_Energy Green_Mountain_Power
Cincinnati G&E 0.00000 . . . . . . . . . . . . . .
Texas Utilities 0.82056 0.00000 . . . . . . . . . . . . .
Detroit Edison 0.40511 0.65453 0.00000 . . . . . . . . . . . .
Orange & Rockland Utilities 1.35380 0.88583 1.27306 0.00000 . . . . . . . . . . .
Kentucky Utilities 1.35581 0.92539 1.29382 0.12268 0.00000 . . . . . . . . . .
Kansas Power & Light 1.34227 0.94371 1.31696 0.19905 0.12874 0.00000 . . . . . . . . .
Union Electric 0.98516 0.29043 0.89048 0.68798 0.71824 0.72082 0.00000 . . . . . . . .
Dominion Resources 1.32945 0.96853 1.29016 0.33290 0.21510 0.24189 0.76587 0.00000 . . . . . . .
Allegheny Power 1.30492 0.81666 1.24565 0.17844 0.15759 0.17029 0.58452 0.27819 0.00000 . . . . . .
Minnesota Power & Light 1.24069 0.74082 1.20432 0.32581 0.30462 0.27231 0.48372 0.35733 0.15615 0.00000 . . . . .
Iowa-Ill Gas & Electric 1.04924 0.43100 0.97616 0.61166 0.61760 0.61736 0.16923 0.63545 0.47900 0.36368 0.00000 . . . .
Pennsylvania Power & Light 0.74931 0.37821 0.44256 1.03566 1.08878 1.12876 0.63285 1.14354 1.02358 0.99384 0.75596 0.00000 . . .
Oklahoma Gas & Electric 1.00604 0.30141 0.86200 0.68021 0.70259 0.73158 0.17122 0.72977 0.58391 0.50744 0.19673 0.60216 0.00000 . .
Wisconsin Energy 1.17988 0.54830 1.03081 0.45013 0.47184 0.53381 0.37405 0.51969 0.37522 0.36319 0.30259 0.76085 0.28070 0.00000 .
Green Mountain Power 1.30397 0.88063 1.27176 0.26948 0.17909 0.15377 0.64869 0.17360 0.13958 0.19370 0.52083 1.09269 0.64175 0.44814 0



Output 36.2.2: Pseudo F versus Number of Clusters When METHOD=WARD

Pseudo  versus Number of Clusters When METHOD=WARD


Output 36.2.3: Pseudo F versus Number of Clusters When METHOD=AVERAGE

Pseudo  versus Number of Clusters When METHOD=AVERAGE


Output 36.2.4: Dendrogram of Semipartial R-Square Values When METHOD=WARD

Dendrogram of Semipartial R-Square Values When METHOD=WARD


Output 36.2.5: Dendrogram of Average Distance between Clusters When METHOD=AVERAGE

Dendrogram of Average Distance between Clusters When METHOD=AVERAGE