Example 31.2 Simple Correspondence Analysis of U.S. Population

In this example, PROC CORRESP reads an existing contingency table with supplementary observations and performs a simple correspondence analysis. The data are populations of the 50 U.S. states, grouped into regions, for each of the census years from 1920 to 1970 (U.S. Bureau of the Census; 1979). Alaska and Hawaii are treated as supplementary regions, because they were not states during this entire period and are not physically connected to the other 48 states. Consequently, it is reasonable to expect that population changes in these two states operate differently from population changes in the other states. The correspondence analysis is performed giving the supplementary points negative weight, and then the coordinates for the supplementary points are computed in the solution defined by the other points.

The initial DATA step reads the table, provides labels for the years, flags the supplementary rows with negative weights, and specifies absolute weights of 1000 for all observations since the data were originally reported in units of 1000 people.

In the PROC CORRESP statement, PRINT=PERCENT and the display options display the table of cell percentages (OBSERVED), cell contributions to the total chi-square scaled to sum to 100 (CELLCHI2), row profile rows that sum to 100 (RP), and column profile columns that sum to 100 (CP). The SHORT option specifies that the correspondence analysis summary statistics, contributions to inertia, and squared cosines should not be displayed. The option OUTC=COOR creates the output coordinate data set. Since the data are already in table form, a VAR statement is used to read the table. Row labels are specified with the ID statement, and column labels come from the variable labels. The WEIGHT statement flags the supplementary observations and restores the table values to populations.

The following statements produce Output 31.2.1:

title 'United States Population, 1920-1970';

data USPop;

   * Regions:
   * New England     - ME, NH, VT, MA, RI, CT.
   * Great Lakes     - OH, IN, IL, MI, WI.
   * South Atlantic  - DE, MD, DC, VA, WV, NC, SC, GA, FL.
   * Mountain        - MT, ID, WY, CO, NM, AZ, UT, NV.
   * Pacific         - WA, OR, CA.
   *
   * Note: Multiply data values by 1000 to get populations.;

   input Region $14. y1920 y1930 y1940 y1950 y1960 y1970;

   label y1920 = '1920'    y1930 = '1930'    y1940 = '1940'
         y1950 = '1950'    y1960 = '1960'    y1970 = '1970';

   if region = 'Hawaii' or region = 'Alaska'
      then w = -1000;       /* Flag Supplementary Observations */
      else w =  1000;

   datalines;
New England        7401  8166  8437  9314 10509 11842
NY, NJ, PA        22261 26261 27539 30146 34168 37199
Great Lakes       21476 25297 26626 30399 36225 40252
Midwest           12544 13297 13517 14061 15394 16319
South Atlantic    13990 15794 17823 21182 25972 30671
KY, TN, AL, MS     8893  9887 10778 11447 12050 12803
AR, LA, OK, TX    10242 12177 13065 14538 16951 19321
Mountain           3336  3702  4150  5075  6855  8282
Pacific            5567  8195  9733 14486 20339 25454
Alaska               55    59    73   129   226   300
Hawaii              256   368   423   500   633   769
;
ods graphics on;

* Perform Simple Correspondence Analysis;
proc corresp data=uspop print=percent observed cellchi2 rp cp
     short outc=Coor plot(flip);
   var y1920 -- y1970;
   id Region;
   weight w;
run;

The contingency table shows that the population of all regions increased over this time period. The row profiles show that population increased at a different rate for the different regions. There was a small increase in population in the Midwest, for example, but the population more than quadrupled in the Pacific region over the same period. The column profiles show that in 1920, the U.S. population was concentrated in the NY, NJ, PA, Great Lakes, Midwest, and South Atlantic regions. With time, the population shifted more to the South Atlantic, Mountain, and Pacific regions. This is also clear from the correspondence analysis. The inertia and chi-square decomposition table shows that there are five nontrivial dimensions in the table, but the association between the rows and columns is almost entirely one-dimensional.

Output 31.2.1 United States Population, 1920–1970
United States Population, 1920-1970

The CORRESP Procedure

Contingency Table
Percents 1920 1930 1940 1950 1960 1970 Sum
New England 0.830 0.916 0.946 1.045 1.179 1.328 6.245
NY, NJ, PA 2.497 2.946 3.089 3.382 3.833 4.173 19.921
Great Lakes 2.409 2.838 2.987 3.410 4.064 4.516 20.224
Midwest 1.407 1.492 1.516 1.577 1.727 1.831 9.550
South Atlantic 1.569 1.772 1.999 2.376 2.914 3.441 14.071
KY, TN, AL, MS 0.998 1.109 1.209 1.284 1.352 1.436 7.388
AR, LA, OK, TX 1.149 1.366 1.466 1.631 1.902 2.167 9.681
Mountain 0.374 0.415 0.466 0.569 0.769 0.929 3.523
Pacific 0.625 0.919 1.092 1.625 2.282 2.855 9.398
Sum 11.859 13.773 14.771 16.900 20.020 22.677 100.000

Supplementary Rows
Percents 1920 1930 1940 1950 1960 1970
Alaska 0.006170 0.006619 0.008189 0.014471 0.025353 0.033655
Hawaii 0.028719 0.041283 0.047453 0.056091 0.071011 0.086268

Contributions to the Total Chi-Square Statistic
Percents 1920 1930 1940 1950 1960 1970 Sum
New England 0.937 0.314 0.054 0.009 0.352 0.469 2.135
NY, NJ, PA 0.665 1.287 0.633 0.006 0.521 2.265 5.378
Great Lakes 0.004 0.085 0.000 0.001 0.005 0.094 0.189
Midwest 5.749 2.039 0.684 0.072 1.546 4.472 14.563
South Atlantic 0.509 1.231 0.259 0.000 0.285 1.688 3.973
KY, TN, AL, MS 1.454 0.711 1.098 0.087 0.946 2.945 7.242
AR, LA, OK, TX 0.000 0.069 0.077 0.001 0.059 0.030 0.238
Mountain 0.391 0.868 0.497 0.098 0.498 1.834 4.187
Pacific 18.591 9.380 5.458 0.074 7.346 21.248 62.096
Sum 28.302 15.986 8.761 0.349 11.558 35.046 100.000

Row Profiles
Percents 1920 1930 1940 1950 1960 1970
New England 13.2947 14.6688 15.1557 16.7310 18.8777 21.2722
NY, NJ, PA 12.5362 14.7888 15.5085 16.9766 19.2416 20.9484
Great Lakes 11.9129 14.0325 14.7697 16.8626 20.0943 22.3281
Midwest 14.7348 15.6193 15.8777 16.5167 18.0825 19.1691
South Atlantic 11.1535 12.5917 14.2093 16.8872 20.7060 24.4523
KY, TN, AL, MS 13.5033 15.0126 16.3655 17.3813 18.2969 19.4403
AR, LA, OK, TX 11.8687 14.1111 15.1401 16.8471 19.6433 22.3897
Mountain 10.6242 11.7898 13.2166 16.1624 21.8312 26.3758
Pacific 6.6453 9.7823 11.6182 17.2918 24.2784 30.3841

Supplementary Row Profiles
Percents 1920 1930 1940 1950 1960 1970
Alaska 6.5321 7.0071 8.6698 15.3207 26.8409 35.6295
Hawaii 8.6809 12.4788 14.3438 16.9549 21.4649 26.0766

Column Profiles
Percents 1920 1930 1940 1950 1960 1970
New England 7.0012 6.6511 6.4078 6.1826 5.8886 5.8582
NY, NJ, PA 21.0586 21.3894 20.9155 20.0109 19.1457 18.4023
Great Lakes 20.3160 20.6042 20.2221 20.1788 20.2983 19.9126
Midwest 11.8664 10.8303 10.2660 9.3337 8.6259 8.0730
South Atlantic 13.2343 12.8641 13.5363 14.0606 14.5532 15.1729
KY, TN, AL, MS 8.4126 8.0529 8.1857 7.5985 6.7521 6.3336
AR, LA, OK, TX 9.6888 9.9181 9.9227 9.6503 9.4983 9.5581
Mountain 3.1558 3.0152 3.1519 3.3688 3.8411 4.0971
Pacific 5.2663 6.6748 7.3921 9.6158 11.3968 12.5921

United States Population, 1920-1970

The CORRESP Procedure

Inertia and Chi-Square Decomposition
Singular
Value
Principal
Inertia
Chi-
Square

Percent
Cumulative
Percent
   20   40   60   80  100   
----+----+----+----+----+---
0.10664 0.01137 1.014E7 98.16 98.16 *************************   
0.01238 0.00015 136586 1.32 99.48  
0.00658 0.00004 38540 0.37 99.85  
0.00333 0.00001 9896.6 0.10 99.95  
0.00244 0.00001 5309.9 0.05 100.00  
Total 0.01159 1.033E7 100.00    
Degrees of Freedom = 40

Row Coordinates
  Dim1 Dim2
New England 0.0611 0.0132
NY, NJ, PA 0.0546 -0.0117
Great Lakes 0.0074 -0.0028
Midwest 0.1315 0.0186
South Atlantic -0.0553 0.0105
KY, TN, AL, MS 0.1044 -0.0144
AR, LA, OK, TX 0.0131 -0.0067
Mountain -0.1121 0.0338
Pacific -0.2766 -0.0070

Supplementary Row Coordinates
  Dim1 Dim2
Alaska -0.4152 0.0912
Hawaii -0.1198 -0.0321

Column Coordinates
  Dim1 Dim2
1920 0.1642 0.0263
1930 0.1149 -0.0089
1940 0.0816 -0.0108
1950 -0.0046 -0.0125
1960 -0.0815 -0.0007
1970 -0.1335 0.0086

crse2d

ODS Graphics is used to plot the results. The data are essentially one-dimensional. For data such as these, it is better to plot the first dimension vertically, as opposed to the default, which is horizontally. The vertical orientation has fewer opportunities for label collisions. Specifying PLOTS(FLIP) on the PROC statement switches the vertical and horizontal axes to improve the graphical display.

The plot shows that the first dimension correctly orders the years. There is nothing in the correspondence analysis that forces this to happen; the analysis has no information about the inherent ordering of the column categories. The ordering of the regions and the ordering of the years reflect the shift over time of the U.S. population from the Northeast quadrant of the country to the South and to the West. The results show that the West and Southeast grew faster than the rest of the contiguous 48 states during this period.

The plot also shows that the growth pattern for Hawaii was similar to the growth pattern for the mountain states and that Alaska’s growth was even more extreme than the Pacific states’ growth. The row profiles confirm this interpretation.

The Pacific region is farther from the origin than all other active points. The Midwest is the extreme region in the other direction. The table of contributions to the total chi-square shows that 62% of the total chi-square statistic is contributed by the Pacific region, which is followed by the Midwest at over 14%. Similarly the two extreme years, 1920 and 1970, together contribute over 63% to the total chi-square, whereas the years nearer the origin of the plot contribute less.