The HPPRINCOMP Procedure

Getting Started: HPPRINCOMP Procedure

The following data provide crime rates per 100,000 people in seven categories for each of the 50 US states in 1977:

title 'Crime Rates per 100,000 Population by State';

data Crime;
   input State $1-15 Murder Rape Robbery Assault
         Burglary Larceny Auto_Theft;
   datalines;
Alabama        14.2 25.2  96.8 278.3 1135.5 1881.9 280.7
Alaska         10.8 51.6  96.8 284.0 1331.7 3369.8 753.3
Arizona         9.5 34.2 138.2 312.3 2346.1 4467.4 439.5
Arkansas        8.8 27.6  83.2 203.4  972.6 1862.1 183.4
California     11.5 49.4 287.0 358.0 2139.4 3499.8 663.5
Colorado        6.3 42.0 170.7 292.9 1935.2 3903.2 477.1
Connecticut     4.2 16.8 129.5 131.8 1346.0 2620.7 593.2
Delaware        6.0 24.9 157.0 194.2 1682.6 3678.4 467.0
Florida        10.2 39.6 187.9 449.1 1859.9 3840.5 351.4
Georgia        11.7 31.1 140.5 256.5 1351.1 2170.2 297.9
Hawaii          7.2 25.5 128.0  64.1 1911.5 3920.4 489.4
Idaho           5.5 19.4  39.6 172.5 1050.8 2599.6 237.6
Illinois        9.9 21.8 211.3 209.0 1085.0 2828.5 528.6
Indiana         7.4 26.5 123.2 153.5 1086.2 2498.7 377.4
Iowa            2.3 10.6  41.2  89.8  812.5 2685.1 219.9
Kansas          6.6 22.0 100.7 180.5 1270.4 2739.3 244.3
Kentucky       10.1 19.1  81.1 123.3  872.2 1662.1 245.4
Louisiana      15.5 30.9 142.9 335.5 1165.5 2469.9 337.7
Maine           2.4 13.5  38.7 170.0 1253.1 2350.7 246.9
Maryland        8.0 34.8 292.1 358.9 1400.0 3177.7 428.5
Massachusetts   3.1 20.8 169.1 231.6 1532.2 2311.3 1140.1
Michigan        9.3 38.9 261.9 274.6 1522.7 3159.0 545.5
Minnesota       2.7 19.5  85.9  85.8 1134.7 2559.3 343.1
Mississippi    14.3 19.6  65.7 189.1  915.6 1239.9 144.4
Missouri        9.6 28.3 189.0 233.5 1318.3 2424.2 378.4
Montana         5.4 16.7  39.2 156.8  804.9 2773.2 309.2
Nebraska        3.9 18.1  64.7 112.7  760.0 2316.1 249.1
Nevada         15.8 49.1 323.1 355.0 2453.1 4212.6 559.2
New Hampshire   3.2 10.7  23.2  76.0 1041.7 2343.9 293.4
New Jersey      5.6 21.0 180.4 185.1 1435.8 2774.5 511.5
New Mexico      8.8 39.1 109.6 343.4 1418.7 3008.6 259.5
New York       10.7 29.4 472.6 319.1 1728.0 2782.0 745.8
North Carolina 10.6 17.0  61.3 318.3 1154.1 2037.8 192.1
North Dakota    0.9  9.0  13.3  43.8  446.1 1843.0 144.7
Ohio            7.8 27.3 190.5 181.1 1216.0 2696.8 400.4
Oklahoma        8.6 29.2  73.8 205.0 1288.2 2228.1 326.8
Oregon          4.9 39.9 124.1 286.9 1636.4 3506.1 388.9
Pennsylvania    5.6 19.0 130.3 128.0  877.5 1624.1 333.2
Rhode Island    3.6 10.5  86.5 201.0 1489.5 2844.1 791.4
South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1
South Dakota    2.0 13.5  17.9 155.7  570.5 1704.4 147.5
Tennessee      10.1 29.7 145.8 203.9 1259.7 1776.5 314.0
Texas          13.3 33.8 152.4 208.2 1603.1 2988.7 397.6
Utah            3.5 20.3  68.8 147.3 1171.6 3004.6 334.5
Vermont         1.4 15.9  30.8 101.2 1348.2 2201.0 265.2
Virginia        9.0 23.3  92.1 165.7  986.2 2521.2 226.7
Washington      4.3 39.6 106.2 224.8 1605.6 3386.9 360.3
West Virginia   6.0 13.2  42.2    .  597.4 1341.7 163.3
Wisconsin       2.8 12.9  52.2  63.7  846.9 2614.2 220.7
Wyoming          .  21.9  39.7 173.9  811.6 2772.2 282.0
;

The following statements invoke the HPPRINCOMP procedure, which requests a principal component analysis of the data and produces Figure 11.1 through Figure 11.4:

proc hpprincomp data=Crime;
run;

Figure 11.1 displays the Performance Information, Number of Observations, Number of Variables, and Simple Statistics tables.

The Performance Information table shows the procedure executes in single-machine mode—that is, the data reside and the computation is performed on the machine where the SAS session executes. This run of the HPPRINCOMP procedure took place on a multicore machine with four CPUs; one computational thread was spawned per CPU.

The Number of Observations table shows that of the 50 observations in the input data, only 48 observations are used in the analysis because some observations have incomplete data.

The Number of Variables table indicates that there are seven variables to be analyzed. By default, if the VAR statement is omitted, all numeric variables that are not listed in other statements are used in the analysis.

The Simple Statistics table displays the mean and standard deviation of the analysis variables.

Figure 11.1: Performance Information and Simple Statistics

Crime Rates per 100,000 Population by State

The HPPRINCOMP Procedure

Performance Information
Execution Mode Single-Machine
Number of Threads 4

Number of Observations Read 50
Number of Observations Used 48

Number of Variables 7

Simple Statistics
  Mean Standard
Deviation
Murder 7.51667 3.93059
Rape 26.07500 10.81304
Robbery 127.55625 88.49374
Assault 214.58750 100.64360
Burglary 1316.37917 423.31261
Larceny 2696.88542 714.75023
Auto_Theft 383.97917 194.37033


Figure 11.2 displays the Correlation Matrix table. By default, the PROC HPPRINCOMP statement requests that principal components be computed from the correlation matrix, so the total variance is equal to the number of variables, 7.

Figure 11.2: Correlation Matrix Table

Correlation Matrix
  Murder Rape Robbery Assault Burglary Larceny Auto_Theft
Murder 1.0000 0.6000 0.4768 0.6485 0.3778 0.0925 0.0555
Rape 0.6000 1.0000 0.5817 0.7316 0.7038 0.6009 0.3282
Robbery 0.4768 0.5817 1.0000 0.5452 0.6200 0.4371 0.5787
Assault 0.6485 0.7316 0.5452 1.0000 0.6082 0.3791 0.2520
Burglary 0.3778 0.7038 0.6200 0.6082 1.0000 0.7932 0.5390
Larceny 0.0925 0.6009 0.4371 0.3791 0.7932 1.0000 0.4246
Auto_Theft 0.0555 0.3282 0.5787 0.2520 0.5390 0.4246 1.0000


Figure 11.3 displays the Eigenvalues table. The first principal component accounts for about 57.8% of the total variance, the second principal component accounts for about 18.1%, and the third principal component accounts for about 10.7%. Note that the eigenvalues sum to the total variance.

The eigenvalues indicate that two or three components provide a good summary of the data: two components account for 76% of the total variance, and three components account for 87%. Subsequent components account for less than 5% each.

Figure 11.3: Eigenvalues Table

Eigenvalues of the Correlation Matrix
  Eigenvalue Difference Proportion Cumulative
1 4.045824 2.781795 0.5780 0.5780
2 1.264030 0.516529 0.1806 0.7586
3 0.747500 0.421175 0.1068 0.8653
4 0.326325 0.061119 0.0466 0.9120
5 0.265207 0.036843 0.0379 0.9498
6 0.228364 0.105613 0.0326 0.9825
7 0.122750   0.0175 1.0000


Figure 11.4 displays the Eigenvectors table. From the eigenvectors matrix, you can represent the first principal component, Prin1, as a linear combination of the original variables:

\begin{align*}  \mr {\Variable{Prin1}} & = 0.302888 \times (\mr {\Variable{Murder}})\\ &  \mr {\  } + 0.434103 \times (\mr {\Variable{Rape}}) \\ &  \mr {\  } + 0.397055 \times (\mr {\Variable{Robbery}}) \\ &  \mr {\  } . \\ &  \mr {\  } . \\ &  \mr {\  } . \\ &  \mr {\  } + 0.288343 \times (\mr {\Variable{Auto\_ Theft}}) \end{align*}

Similarly, the second principal component, Prin2, is

\begin{align*}  \mr {\Variable{Prin2}} & = - 0.618929 \times (\mr {\Variable{Murder}})\\ &  \mr {~  ~  } - 0.170526 \times (\mr {\Variable{Rape}})\\ &  \mr {~  ~  } + 0.047125 \times (\mr {\Variable{Robbery}}) \\ &  \mr {~  ~  } . \\ &  \mr {~  ~  } . \\ &  \mr {~  ~  } . \\ &  \mr {~  ~  } + 0.504003 \times (\mr {\Variable{Auto\_ Theft}}) \end{align*}

where the variables are standardized.

Figure 11.4: Eigenvectors Table

Eigenvectors
  Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7
Murder 0.30289 -0.61893 0.17353 -0.23308 0.54896 0.26371 0.26428
Rape 0.43410 -0.17053 -0.23539 0.06540 0.18075 -0.78232 -0.27946
Robbery 0.39705 0.04713 0.49208 -0.57470 -0.50808 -0.09452 -0.02497
Assault 0.39622 -0.35142 -0.05343 0.61743 -0.51525 0.17395 0.19921
Burglary 0.44164 0.20861 -0.22454 -0.02750 0.11273 0.52340 -0.65085
Larceny 0.35634 0.40570 -0.53681 -0.23231 0.02172 0.04085 0.60346
Auto_Theft 0.28834 0.50400 0.57524 0.41853 0.35939 -0.06024 0.15487


The first component is a measure of the overall crime rate, because the first eigenvector shows approximately equal loadings on all variables. The second eigenvector has high positive loadings on the variables Auto_Theft and Larceny and high negative loadings on the variables Murder and Assault. There is also a small positive loading on the variable Burglary and a small negative loading on the variable Rape. This component seems to measure the preponderance of property crime compared to violent crime. The interpretation of the third component is not obvious.