Example 33.1 Divorce Grounds – the Jaccard Coefficient
A wide variety of distance and similarity measures are used in cluster analysis (Anderberg; 1973; Sneath and Sokal; 1973). If your data are in coordinate form and you want to use a non-Euclidean distance for clustering, you can compute a distance matrix by using the DISTANCE procedure.
Similarity measures must be converted to dissimilarities before being used in PROC CLUSTER. Such conversion can be done in a variety of ways, such as taking reciprocals or subtracting from a large value. The choice of conversion method depends on the application and the similarity measure. If applicable, PROC DISTANCE provides a corresponding dissimilarity measure for each similarity measure.
In the following example, the observations are states. Binary-valued variables correspond to various grounds for divorce and indicate whether the grounds for divorce apply in each of the U.S. states. A value of "1" indicates that the ground for divorce applies, and a value of "0" indicates the opposite. The 0-0 matches are treated as totally irrelevant; therefore, each variable has an asymmetric nominal level of measurement. The absence value is 0.
The DISTANCE procedure is used to compute the Jaccard coefficient (Anderberg; 1973, pp. 89, 115, and 117) between each pair of states. The Jaccard coefficient is defined as the number of variables that are coded as 1 for both states divided by the number of variables that are coded as 1 for either or both states. Since dissimilarity measures are required by PROC CLUSTER, the DJACCARD coefficient is selected. Output 33.1.1 displays the distance matrix between the first 10 states.
The CENTROID method is used to perform the cluster analysis, and the resulting tree diagram from PROC CLUSTER is saved into the tree output data set. Output 33.1.2 displays the cluster history.
The TREE procedure generates nine clusters in the output data set out. After being sorted by the state, the out data set is then merged with the input data set divorce. After being sorted by the state, the merged data set is printed to display the cluster membership as shown in Output 33.1.3.
The following statements produce Output 33.1.1 through Output 33.1.3:
data divorce;
input State $15.
(Incompatibility Cruelty Desertion Non_Support Alcohol
Felony Impotence Insanity Separation) (1.) @@;
if mod(_n_,2) then input +4 @@; else input;
datalines;
Alabama 111111111 Alaska 111011110
Arizona 100000000 Arkansas 011111111
California 100000010 Colorado 100000000
... more lines ...
Wisconsin 100000001 Wyoming 100000011
;
title 'Grounds for Divorce';
proc distance data=divorce method=djaccard absent=0 out=distjacc;
var anominal(Incompatibility--Separation);
id state;
run;
proc print data=distjacc(obs=10);
id state; var alabama--georgia;
title2 'First 10 States';
run;
title2;
proc cluster data=distjacc method=centroid
pseudo outtree=tree;
id state;
var alabama--wyoming;
run;
proc tree data=tree noprint n=9 out=out;
id state;
run;
proc sort;
by state;
run;
data clus;
merge divorce out;
by state;
run;
proc sort;
by cluster;
run;
proc print;
id state;
var Incompatibility--Separation;
by cluster;
run;
Output 33.1.1
Distance Matrix Based on the Jaccard Coefficient
0.00000 |
. |
. |
. |
. |
. |
. |
. |
. |
. |
0.22222 |
0.00000 |
. |
. |
. |
. |
. |
. |
. |
. |
0.88889 |
0.85714 |
0.00000 |
. |
. |
. |
. |
. |
. |
. |
0.11111 |
0.33333 |
1.00000 |
0.00000 |
. |
. |
. |
. |
. |
. |
0.77778 |
0.71429 |
0.50000 |
0.88889 |
0.00000 |
. |
. |
. |
. |
. |
0.88889 |
0.85714 |
0.00000 |
1.00000 |
0.50000 |
0.00000 |
. |
. |
. |
. |
0.11111 |
0.33333 |
0.87500 |
0.22222 |
0.75000 |
0.87500 |
0.00000 |
. |
. |
. |
0.77778 |
0.87500 |
0.50000 |
0.88889 |
0.66667 |
0.50000 |
0.75000 |
0.00000 |
. |
. |
0.77778 |
0.71429 |
0.50000 |
0.88889 |
0.00000 |
0.50000 |
0.75000 |
0.66667 |
0.00000 |
. |
0.22222 |
0.00000 |
0.85714 |
0.33333 |
0.71429 |
0.85714 |
0.33333 |
0.87500 |
0.71429 |
0 |
Output 33.1.2
Clustering History
The CLUSTER Procedure
Centroid Hierarchical Cluster Analysis
Arizona |
Colorado |
2 |
. |
. |
0 |
T |
California |
Florida |
2 |
. |
. |
0 |
T |
Alaska |
Georgia |
2 |
. |
. |
0 |
T |
Delaware |
Hawaii |
2 |
. |
. |
0 |
T |
Connecticut |
Idaho |
2 |
. |
. |
0 |
T |
CL49 |
Iowa |
3 |
. |
. |
0 |
T |
CL47 |
Kansas |
3 |
. |
. |
0 |
T |
CL44 |
Kentucky |
4 |
. |
. |
0 |
T |
CL42 |
Michigan |
5 |
. |
. |
0 |
T |
CL41 |
Minnesota |
6 |
. |
. |
0 |
T |
CL43 |
Mississippi |
4 |
. |
. |
0 |
T |
CL40 |
Missouri |
7 |
. |
. |
0 |
T |
CL38 |
Montana |
8 |
. |
. |
0 |
T |
CL37 |
Nebraska |
9 |
. |
. |
0 |
T |
North Dakota |
Oklahoma |
2 |
. |
. |
0 |
T |
CL36 |
Oregon |
10 |
. |
. |
0 |
T |
Massachusetts |
Rhode Island |
2 |
. |
. |
0 |
T |
New Hampshire |
Tennessee |
2 |
. |
. |
0 |
T |
CL46 |
Washington |
3 |
. |
. |
0 |
T |
CL31 |
Wisconsin |
4 |
. |
. |
0 |
T |
Nevada |
Wyoming |
2 |
. |
. |
0 |
|
Alabama |
Arkansas |
2 |
1561 |
. |
0.1599 |
T |
CL33 |
CL32 |
4 |
479 |
. |
0.1799 |
T |
CL39 |
CL35 |
6 |
265 |
. |
0.1799 |
T |
CL45 |
West Virginia |
3 |
231 |
. |
0.1799 |
|
Maryland |
Pennsylvania |
2 |
199 |
. |
0.2399 |
|
CL28 |
Utah |
3 |
167 |
3.2 |
0.2468 |
|
CL27 |
Ohio |
5 |
136 |
5.4 |
0.2698 |
|
CL26 |
Maine |
7 |
111 |
8.9 |
0.2998 |
|
CL23 |
CL21 |
10 |
75.2 |
8.7 |
0.3004 |
|
CL25 |
New Jersey |
4 |
71.8 |
6.5 |
0.3053 |
T |
CL19 |
Texas |
5 |
69.1 |
2.5 |
0.3077 |
|
CL20 |
CL22 |
15 |
48.7 |
9.9 |
0.3219 |
|
New York |
Virginia |
2 |
50.1 |
. |
0.3598 |
|
CL18 |
Vermont |
6 |
49.4 |
2.9 |
0.3797 |
|
CL17 |
Illinois |
16 |
47.0 |
3.2 |
0.4425 |
|
CL14 |
CL15 |
22 |
29.2 |
15.3 |
0.4722 |
|
CL48 |
CL29 |
4 |
29.5 |
. |
0.4797 |
T |
CL13 |
CL24 |
24 |
27.6 |
4.5 |
0.5042 |
|
CL11 |
South Dakota |
25 |
28.4 |
2.4 |
0.5449 |
|
Louisiana |
CL16 |
3 |
30.3 |
3.5 |
0.5844 |
|
CL34 |
CL30 |
14 |
23.3 |
. |
0.7196 |
|
CL8 |
CL12 |
18 |
19.3 |
15.0 |
0.7175 |
|
CL10 |
South Carolina |
26 |
21.4 |
4.2 |
0.7384 |
|
CL6 |
New Mexico |
27 |
24.0 |
4.7 |
0.8303 |
|
CL5 |
Indiana |
28 |
28.9 |
4.1 |
0.8343 |
|
CL4 |
CL9 |
31 |
31.7 |
10.9 |
0.8472 |
|
CL3 |
North Carolina |
32 |
55.1 |
4.1 |
1.0017 |
|
CL2 |
CL7 |
50 |
. |
55.1 |
1.0663 |
|
Output 33.1.3
Cluster Membership
CLUSTER=1
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
CLUSTER=2
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
CLUSTER=3
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
CLUSTER=4
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
CLUSTER=5
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
CLUSTER=6
CLUSTER=7
CLUSTER=8
CLUSTER=9