The CORRESP Procedure

Example 34.1 Simple and Multiple Correspondence Analysis of Automobiles and Their Owners

In this example, PROC CORRESP creates a contingency table from categorical data and performs a simple correspondence analysis. The data are from a sample of individuals who were asked to provide information about themselves and their automobiles. The questions included origin of the automobile (American, Japanese, or European) and family status (single, married, single living with children, or married living with children).

The first steps read the input data and assign formats:

title1 'Automobile Owners and Auto Attributes';
title2 'Simple Correspondence Analysis';

proc format;
   value Origin  1 = 'American' 2 = 'Japanese' 3 = 'European';
   value Size    1 = 'Small'    2 = 'Medium'   3 = 'Large';
   value Type    1 = 'Family'   2 = 'Sporty'   3 = 'Work';
   value Home    1 = 'Own'      2 = 'Rent';
   value Sex     1 = 'Male'     2 = 'Female';
   value Income  1 = '1 Income' 2 = '2 Incomes';
   value Marital 1 = 'Single with Kids' 2 = 'Married with Kids'
                 3 = 'Single'           4 = 'Married';
run;

data Cars;
   missing a;
   input (Origin Size Type Home Income Marital Kids Sex) (1.) @@;
   * Check for End of Line;
   if n(of Origin -- Sex) eq 0 then do; input; return; end;
   marital = 2 * (kids le 0) + marital;
   format Origin Origin. Size Size. Type Type. Home Home.
          Sex Sex. Income Income. Marital Marital.;
   output;
   datalines;
131112212121110121112201131211011211221122112121131122123211222212212201
121122023121221232211101122122022121110122112102131112211121110112311101
211112113211223121122202221122111311123131211102321122223221220221221101
122122022121220211212201221122021122110132112202213112111331226122221101

   ... more lines ...   

212122011211122131221101121211022212220212121101
;

PROC CORRESP is used to perform the simple correspondence analysis. The ALL option displays all tables, including the contingency table, chi-square information, profiles, and all results of the correspondence analysis. The CHI2P option displays the chi-square p-value. The TABLES statement specifies the row and column categorical variables. The results are displayed with ODS Graphics.

The following statements produce Output 34.1.1:

ods graphics on;

* Perform Simple Correspondence Analysis;
proc corresp data=Cars all chi2p;
   tables Marital, Origin;
run;

Correspondence analysis locates all the categories in a Euclidean space. The first two dimensions of this space are plotted to examine the associations among the categories.[28] Because the smallest dimension of this table is three, there is no loss of information when only two dimensions are plotted. The plot should be thought of as two different overlaid plots, one for each categorical variable. Distances between points within a variable have meaning, but distances between points from different variables do not.

Output 34.1.1: Simple Correspondence Analysis

Automobile Owners and Auto Attributes
Simple Correspondence Analysis

The CORRESP Procedure

Contingency Table
  American European Japanese Sum
Married 37 14 51 102
Married with Kids 52 15 44 111
Single 33 15 63 111
Single with Kids 6 1 8 15
Sum 128 45 166 339

Chi-Square Statistic Expected Values
  American European Japanese
Married 38.5133 13.5398 49.9469
Married with Kids 41.9115 14.7345 54.3540
Single 41.9115 14.7345 54.3540
Single with Kids 5.6637 1.9912 7.3451

Observed Minus Expected Values
  American European Japanese
Married -1.5133 0.4602 1.0531
Married with Kids 10.0885 0.2655 -10.3540
Single -8.9115 0.2655 8.6460
Single with Kids 0.3363 -0.9912 0.6549

Contributions to the Total Chi-Square Statistic
  American European Japanese Sum
Married 0.05946 0.01564 0.02220 0.09730
Married with Kids 2.42840 0.00478 1.97235 4.40553
Single 1.89482 0.00478 1.37531 3.27492
Single with Kids 0.01997 0.49337 0.05839 0.57173
Sum 4.40265 0.51858 3.42825 8.34947

Row Profiles
  American European Japanese
Married 0.362745 0.137255 0.500000
Married with Kids 0.468468 0.135135 0.396396
Single 0.297297 0.135135 0.567568
Single with Kids 0.400000 0.066667 0.533333

Column Profiles
  American European Japanese
Married 0.289063 0.311111 0.307229
Married with Kids 0.406250 0.333333 0.265060
Single 0.257813 0.333333 0.379518
Single with Kids 0.046875 0.022222 0.048193



crse1ab

Row Coordinates
  Dim1 Dim2
Married -0.0278 0.0134
Married with Kids 0.1991 0.0064
Single -0.1716 0.0076
Single with Kids -0.0144 -0.1947

Summary Statistics for the Row Points
  Quality Mass Inertia
Married 1.0000 0.3009 0.0117
Married with Kids 1.0000 0.3274 0.5276
Single 1.0000 0.3274 0.3922
Single with Kids 1.0000 0.0442 0.0685

Partial Contributions to Inertia for the Row
Points
  Dim1 Dim2
Married 0.0102 0.0306
Married with Kids 0.5678 0.0076
Single 0.4217 0.0108
Single with Kids 0.0004 0.9511

Indices of the Coordinates That Contribute Most to Inertia for the Row Points
  Dim1 Dim2 Best
Married 0 0 2
Married with Kids 1 0 1
Single 1 0 1
Single with Kids 0 2 2

Squared Cosines for the Row Points
  Dim1 Dim2
Married 0.8121 0.1879
Married with Kids 0.9990 0.0010
Single 0.9980 0.0020
Single with Kids 0.0054 0.9946

Column Coordinates
  Dim1 Dim2
American 0.1847 -0.0166
European 0.0013 0.1073
Japanese -0.1428 -0.0163

Summary Statistics for the Column
Points
  Quality Mass Inertia
American 1.0000 0.3776 0.5273
European 1.0000 0.1327 0.0621
Japanese 1.0000 0.4897 0.4106

Partial Contributions to Inertia
for the Column Points
  Dim1 Dim2
American 0.5634 0.0590
European 0.0000 0.8672
Japanese 0.4366 0.0737

Indices of the Coordinates That Contribute
Most to Inertia for the Column Points
  Dim1 Dim2 Best
American 1 0 1
European 0 2 2
Japanese 1 0 1

Squared Cosines for the Column
Points
  Dim1 Dim2
American 0.9920 0.0080
European 0.0001 0.9999
Japanese 0.9871 0.0129


crse1g

To interpret the plot, start by interpreting the row points separately from the column points. The European point is near and to the left of the centroid, so it makes a relatively small contribution to the chi-square statistic (because it is near the centroid), it contributes almost nothing to the inertia of dimension one (because its coordinate on dimension one has a small absolute value relative to the other column points), and it makes a relatively large contribution to the inertia of dimension two (because its coordinate on dimension two has a large absolute value relative to the other column points). Its squared cosines for dimension one and two, approximately 0 and 1, respectively, indicate that its position is almost completely determined by its location on dimension two. Its quality of display is 1.0, indicating perfect quality, because the table is two-dimensional after the centering. The American and Japanese points are far from the centroid, and they lie along dimension one. They make relatively large contributions to the chi-square statistic and the inertia of dimension one. The horizontal dimension seems to be largely determined by Japanese versus American automobile ownership.

In the row points, the Married point is near the centroid, and the Single with Kids point has a small coordinate on dimension one that is near zero. The horizontal dimension seems to be largely determined by the Single versus the Married with Kids points. The two interpretations of dimension one show the association with being Married with Kids and owning an American auto, and being single and owning a Japanese auto. The fact that the Married with Kids point is close to the American point and the fact that the Japanese point is near the Single point should be ignored. Distances between row and column points are not defined. The plot shows that more people who are married with kids than you would expect if the rows and columns were independent drive an American auto, and more people who are single than you would expect if the rows and columns were independent drive a Japanese auto.

In the second part of this example, PROC CORRESP creates a Burt table from categorical data and performs a multiple correspondence analysis. The variables used in this example are Origin, Size, Type, Income, Home, Marital, and Sex. MCA specifies multiple correspondence analysis, OBSERVED displays the Burt table. The TABLES statement with only a single variable list and no comma creates the Burt table.

The following statements produce Output 34.1.2:

title2 'Multiple Correspondence Analysis';

* Perform Multiple Correspondence Analysis;
proc corresp mca observed data=Cars;
   tables Origin Size Type Income Home Marital Sex;
run;

Output 34.1.2: Multiple Correspondence Analysis

Automobile Owners and Auto Attributes
Multiple Correspondence Analysis

The CORRESP Procedure

Burt Table
  American European Japanese Large Medium Small Family Sporty Work 1 Income 2 Incomes Own Rent Married Married
with Kids
Single Single with
Kids
Female Male
American 125 0 0 36 60 29 81 24 20 58 67 93 32 37 50 32 6 58 67
European 0 44 0 4 20 20 17 23 4 18 26 38 6 13 15 15 1 21 23
Japanese 0 0 165 2 61 102 76 59 30 74 91 111 54 51 44 62 8 70 95
Large 36 4 2 42 0 0 30 1 11 20 22 35 7 9 21 11 1 17 25
Medium 60 20 61 0 141 0 89 39 13 57 84 106 35 42 51 40 8 70 71
Small 29 20 102 0 0 151 55 66 30 73 78 101 50 50 37 58 6 62 89
Family 81 17 76 30 89 55 174 0 0 69 105 130 44 50 79 35 10 83 91
Sporty 24 23 59 1 39 66 0 106 0 55 51 71 35 35 12 57 2 44 62
Work 20 4 30 11 13 30 0 0 54 26 28 41 13 16 18 17 3 22 32
1 Income 58 18 74 20 57 73 69 55 26 150 0 80 70 10 27 99 14 47 103
2 Incomes 67 26 91 22 84 78 105 51 28 0 184 162 22 91 82 10 1 102 82
Own 93 38 111 35 106 101 130 71 41 80 162 242 0 76 106 52 8 114 128
Rent 32 6 54 7 35 50 44 35 13 70 22 0 92 25 3 57 7 35 57
Married 37 13 51 9 42 50 50 35 16 10 91 76 25 101 0 0 0 53 48
Married with Kids 50 15 44 21 51 37 79 12 18 27 82 106 3 0 109 0 0 48 61
Single 32 15 62 11 40 58 35 57 17 99 10 52 57 0 0 109 0 35 74
Single with Kids 6 1 8 1 8 6 10 2 3 14 1 8 7 0 0 0 15 13 2
Female 58 21 70 17 70 62 83 44 22 47 102 114 35 53 48 35 13 149 0
Male 67 23 95 25 71 89 91 62 32 103 82 128 57 48 61 74 2 0 185



crse1cc

Column Coordinates
  Dim1 Dim2
American -0.4035 0.8129
European -0.0568 -0.5552
Japanese 0.3208 -0.4678
Large -0.6949 1.5666
Medium -0.2562 0.0965
Small 0.4326 -0.5258
Family -0.4201 0.3602
Sporty 0.6604 -0.6696
Work 0.0575 0.1539
1 Income 0.8251 0.5472
2 Incomes -0.6727 -0.4461
Own -0.3887 -0.0943
Rent 1.0225 0.2480
Married -0.4169 -0.7954
Married with Kids -0.8200 0.3237
Single 1.1461 0.2930
Single with Kids 0.4373 0.8736
Female -0.3365 -0.2057
Male 0.2710 0.1656

Summary Statistics for the Column Points
  Quality Mass Inertia
American 0.4925 0.0535 0.0521
European 0.0473 0.0188 0.0724
Japanese 0.3141 0.0706 0.0422
Large 0.4224 0.0180 0.0729
Medium 0.0548 0.0603 0.0482
Small 0.3825 0.0646 0.0457
Family 0.3330 0.0744 0.0399
Sporty 0.4112 0.0453 0.0569
Work 0.0052 0.0231 0.0699
1 Income 0.7991 0.0642 0.0459
2 Incomes 0.7991 0.0787 0.0374
Own 0.4208 0.1035 0.0230
Rent 0.4208 0.0393 0.0604
Married 0.3496 0.0432 0.0581
Married with Kids 0.3765 0.0466 0.0561
Single 0.6780 0.0466 0.0561
Single with Kids 0.0449 0.0064 0.0796
Female 0.1253 0.0637 0.0462
Male 0.1253 0.0791 0.0372

Partial Contributions to Inertia for the Column
Points
  Dim1 Dim2
American 0.0268 0.1511
European 0.0002 0.0248
Japanese 0.0224 0.0660
Large 0.0268 0.1886
Medium 0.0122 0.0024
Small 0.0373 0.0764
Family 0.0405 0.0413
Sporty 0.0610 0.0870
Work 0.0002 0.0023
1 Income 0.1348 0.0822
2 Incomes 0.1099 0.0670
Own 0.0482 0.0039
Rent 0.1269 0.0103
Married 0.0232 0.1169
Married with Kids 0.0967 0.0209
Single 0.1889 0.0171
Single with Kids 0.0038 0.0209
Female 0.0223 0.0115
Male 0.0179 0.0093

Indices of the Coordinates That Contribute Most to Inertia for the Column Points
  Dim1 Dim2 Best
American 0 2 2
European 0 0 2
Japanese 0 2 2
Large 0 2 2
Medium 0 0 1
Small 0 2 2
Family 2 0 2
Sporty 2 2 2
Work 0 0 2
1 Income 1 1 1
2 Incomes 1 1 1
Own 1 0 1
Rent 1 0 1
Married 0 2 2
Married with Kids 1 0 1
Single 1 0 1
Single with Kids 0 0 2
Female 0 0 1
Male 0 0 1

Squared Cosines for the Column Points
  Dim1 Dim2
American 0.0974 0.3952
European 0.0005 0.0468
Japanese 0.1005 0.2136
Large 0.0695 0.3530
Medium 0.0480 0.0068
Small 0.1544 0.2281
Family 0.1919 0.1411
Sporty 0.2027 0.2085
Work 0.0006 0.0046
1 Income 0.5550 0.2441
2 Incomes 0.5550 0.2441
Own 0.3975 0.0234
Rent 0.3975 0.0234
Married 0.0753 0.2742
Married with Kids 0.3258 0.0508
Single 0.6364 0.0416
Single with Kids 0.0090 0.0359
Female 0.0912 0.0341
Male 0.0912 0.0341


crse1m

Multiple correspondence analysis locates all the categories in a Euclidean space. The first two dimensions of this space are plotted to examine the associations among the categories. The top-right quadrant of the plot shows that the categories Single, Single with Kids, 1 Income, and Rent are associated. Proceeding clockwise, the categories Sporty, Small, and Japanese are associated. The bottom-left quadrant shows the association between being married, owning your own home, and having two incomes. Having children is associated with owning a large American family auto. Such information could be used in market research to identify target audiences for advertisements.

This interpretation is based on points found in approximately the same direction from the origin and in approximately the same region of the space. Distances between points do not have a straightforward interpretation in multiple correspondence analysis. The geometry of multiple correspondence analysis is not a simple generalization of the geometry of simple correspondence analysis (Greenacre and Hastie 1987; Greenacre 1988).

If you want to perform a multiple correspondence analysis and get scores for the individuals, you can specify the BINARY option to analyze the binary table, as in the following statements. In the interest of space, only the first 10 rows of coordinates are printed in Output 34.1.3.

title2 'Binary Table';

* Perform Multiple Correspondence Analysis;
proc corresp data=Cars binary;
   ods select RowCoors;
   tables Origin Size Type Income Home Marital Sex;
run;

Output 34.1.3: Correspondence Analysis of a Binary Table

Automobile Owners and Auto Attributes
Binary Table
 
The CORRESP Procedure
 
Row Coordinates

  Dim1 Dim2
1 -0.4093 1.0878
2 0.8198 -0.2221
3 -0.2193 -0.5328
4 0.4382 1.1799
5 -0.6750 0.3600
6 -0.1778 0.1441
7 -0.9375 0.6846
8 -0.7405 -0.1539
9 -0.3027 -0.2749
10 -0.7263 -0.0803





[28] In this analysis, the chi-square statistic is not significantly different from 0. Hence, you would not reject the null hypothesis that the rows and columns are independent. If your goal is hypothesis testing, you might stop at this point and not proceed to interpret the graphical and tabular results. If your goal is exploratory data analysis, you might proceed and interpret the results. This example will proceed, because it is intended to be a small, simple teaching example.