The following data provide crime rates per 100,000 people in seven categories for each of the 50 states in 1977. Since there are seven numeric variables, it is impossible to plot all the variables simultaneously. Principal components can be used to summarize the data in two or three dimensions, and they help to visualize the data. The following statements produce Figure 72.1 through Figure 72.5.
title 'Crime Rates per 100,000 Population by State'; data Crime; input State $1-15 Murder Rape Robbery Assault Burglary Larceny Auto_Theft; datalines; Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 Arizona 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 Florida 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4 Iowa 2.3 10.6 41.2 89.8 812.5 2685.1 219.9 Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3 Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4 Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7 Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9 Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5 Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1 Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5 Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1 Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4 Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4 Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2 Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1 Nevada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2 New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4 New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5 New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5 New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8 North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.8 192.1 North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7 Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4 Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8 Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9 Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2 Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4 South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1 South Dakota 2.0 13.5 17.9 155.7 570.5 1704.4 147.5 Tennessee 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0 Texas 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6 Utah 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5 Vermont 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2 Virginia 9.0 23.3 92.1 165.7 986.2 2521.2 226.7 Washington 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3 West Virginia 6.0 13.2 42.2 90.9 597.4 1341.7 163.3 Wisconsin 2.8 12.9 52.2 63.7 846.9 2614.2 220.7 Wyoming 5.4 21.9 39.7 173.9 811.6 2772.2 282.0 ;
ods graphics on; proc princomp out=Crime_Components plots= score(ellipse ncomp=3); id State; run;
Figure 72.1 displays the PROC PRINCOMP output, beginning with simple statistics followed by the correlation matrix. The PROC PRINCOMP statement requests by default principal components computed from the correlation matrix, so the total variance is equal to the number of variables, 7.
Crime Rates per 100,000 Population by State |
Observations | 50 |
---|---|
Variables | 7 |
Simple Statistics | |||||||
---|---|---|---|---|---|---|---|
Murder | Rape | Robbery | Assault | Burglary | Larceny | Auto_Theft | |
Mean | 7.444000000 | 25.73400000 | 124.0920000 | 211.3000000 | 1291.904000 | 2671.288000 | 377.5260000 |
StD | 3.866768941 | 10.75962995 | 88.3485672 | 100.2530492 | 432.455711 | 725.908707 | 193.3944175 |
Correlation Matrix | |||||||
---|---|---|---|---|---|---|---|
Murder | Rape | Robbery | Assault | Burglary | Larceny | Auto_Theft | |
Murder | 1.0000 | 0.6012 | 0.4837 | 0.6486 | 0.3858 | 0.1019 | 0.0688 |
Rape | 0.6012 | 1.0000 | 0.5919 | 0.7403 | 0.7121 | 0.6140 | 0.3489 |
Robbery | 0.4837 | 0.5919 | 1.0000 | 0.5571 | 0.6372 | 0.4467 | 0.5907 |
Assault | 0.6486 | 0.7403 | 0.5571 | 1.0000 | 0.6229 | 0.4044 | 0.2758 |
Burglary | 0.3858 | 0.7121 | 0.6372 | 0.6229 | 1.0000 | 0.7921 | 0.5580 |
Larceny | 0.1019 | 0.6140 | 0.4467 | 0.4044 | 0.7921 | 1.0000 | 0.4442 |
Auto_Theft | 0.0688 | 0.3489 | 0.5907 | 0.2758 | 0.5580 | 0.4442 | 1.0000 |
Figure 72.2 displays the eigenvalues. The first principal component explains about 58.8% of the total variance, the second principal component explains about 17.7%, and the third principal component explains about 10.4%. Note that the eigenvalues sum to the total variance.
The eigenvalues indicate that two or three components provide a good summary of the data, two components accounting for 76% of the total variance and three components explaining 87%. Subsequent components contribute less than 5% each.
Eigenvalues of the Correlation Matrix | ||||
---|---|---|---|---|
Eigenvalue | Difference | Proportion | Cumulative | |
1 | 4.11495951 | 2.87623768 | 0.5879 | 0.5879 |
2 | 1.23872183 | 0.51290521 | 0.1770 | 0.7648 |
3 | 0.72581663 | 0.40938458 | 0.1037 | 0.8685 |
4 | 0.31643205 | 0.05845759 | 0.0452 | 0.9137 |
5 | 0.25797446 | 0.03593499 | 0.0369 | 0.9506 |
6 | 0.22203947 | 0.09798342 | 0.0317 | 0.9823 |
7 | 0.12405606 | 0.0177 | 1.0000 |
Figure 72.3 displays the eigenvectors. From the eigenvectors matrix, you can represent the first principal component Prin1 as a linear combination of the original variables:
Similarly, the second principal component Prin2 is
where the variables are standardized.
Eigenvectors | |||||||
---|---|---|---|---|---|---|---|
Prin1 | Prin2 | Prin3 | Prin4 | Prin5 | Prin6 | Prin7 | |
Murder | 0.300279 | -.629174 | 0.178245 | -.232114 | 0.538123 | 0.259117 | 0.267593 |
Rape | 0.431759 | -.169435 | -.244198 | 0.062216 | 0.188471 | -.773271 | -.296485 |
Robbery | 0.396875 | 0.042247 | 0.495861 | -.557989 | -.519977 | -.114385 | -.003903 |
Assault | 0.396652 | -.343528 | -.069510 | 0.629804 | -.506651 | 0.172363 | 0.191745 |
Burglary | 0.440157 | 0.203341 | -.209895 | -.057555 | 0.101033 | 0.535987 | -.648117 |
Larceny | 0.357360 | 0.402319 | -.539231 | -.234890 | 0.030099 | 0.039406 | 0.601690 |
Auto_Theft | 0.295177 | 0.502421 | 0.568384 | 0.419238 | 0.369753 | -.057298 | 0.147046 |
The first component is a measure of the overall crime rate since the first eigenvector shows approximately equal loadings on all variables. The second eigenvector has high positive loadings on variables Auto_Theft and Larceny and high negative loadings on variables Murder and Assault. There is also a small positive loading on Burglary and a small negative loading on Rape. This component seems to measure the preponderance of property crime over violent crime. The interpretation of the third component is not obvious.
The ODS GRAPHICS statement enables the PRINCOMP procedure to produce statistical graphs by using ODS Graphics. See
Chapter 21,
Statistical Graphics Using ODS,
for more information. PLOTS=SCORE(ELLIPSE NCOMP=3) in the PROC PRINCOMP statement requests the pairwise component score plots for the first three components with a 95% prediction ellipse overlaid on each of the scatter plot. Figure 72.4 shows the plot of the first two components. It is possible to identify regional trends on the plot of the first two components. Nevada and California are at the extreme right, with high overall crime rates but an average ratio of property crime to violent crime. North and South Dakota are at the extreme left, with low overall crime rates. Southeastern states tend to be at the bottom of the plot, with a higher-than-average ratio of violent crime to property crime. New England states tend to be in the upper part of the plot, with a higher-than-average ratio of property crime to violent crime. Assuming the first two components are from a bivariate normal distribution, the ellipse identifies Nevada as a possible outlier.
Figure 72.5 shows the plot of the first and third components. Assuming the first and the third components are from a bivariate normal distribution, the ellipse identifies Nevada, Massachusetts, and New York as possible outliers.