The HPGENSELECT Procedure

Getting Started: HPGENSELECT Procedure

This example illustrates how you can use PROC HPGENSELECT to perform Poisson regression for count data. The following DATA step contains 100 observations for a count response variable (Y), a continuous variable (Total) to be used in a later analysis, and five categorical variables (C1C5), each of which has four numerical levels:

data getStarted;
  input C1-C5 Y Total;
  datalines;
0 3 1 1 3 2 28.361 
2 3 0 3 1 2 39.831 
1 3 2 2 2 1 17.133 
1 2 0 0 3 2 12.769 
0 2 1 0 1 1 29.464 
0 2 1 0 2 1 4.152 
1 2 1 0 1 0 0.000 
0 2 1 1 2 1 20.199 
1 2 0 0 1 0 0.000 
0 1 1 3 3 2 53.376 
2 2 2 2 1 1 31.923 
0 3 2 0 3 2 37.987 
2 2 2 0 0 1 1.082 
0 2 0 2 0 1 6.323 
1 3 0 0 0 0 0.000 
1 2 1 2 3 2 4.217 
0 1 2 3 1 1 26.084 
1 1 0 0 1 0 0.000 
1 3 2 2 2 0 0.000 
2 1 3 1 1 2 52.640 
1 3 0 1 2 1 3.257 
2 0 2 3 0 5 88.066 
2 2 2 1 0 1 15.196 
3 1 3 1 0 1 11.955 
3 1 3 1 2 3 91.790 
3 1 1 2 3 7 232.417 
3 1 1 1 0 1 2.124 
3 1 0 0 0 2 32.762 
3 1 2 3 0 1 25.415 
2 2 0 1 2 1 42.753 
3 3 2 2 3 1 23.854 
2 0 0 2 3 2 49.438 
1 0 0 2 3 4 105.449 
0 0 2 3 0 6 101.536 
0 3 1 0 0 0 0.000 
3 0 1 0 1 1 5.937 
2 0 0 0 3 2 53.952 
1 0 1 0 3 2 23.686 
1 1 3 1 1 1 0.287 
2 1 3 0 3 7 281.551 
1 3 2 1 1 0 0.000 
2 1 0 0 1 0 0.000 
0 0 1 1 2 3 93.009 
0 1 0 1 0 2 25.055 
1 2 2 2 3 1 1.691 
0 3 2 3 1 1 10.719 
3 3 0 3 3 1 19.279 
2 0 0 2 1 2 40.802 
2 2 3 0 3 3 72.924 
0 2 0 3 0 1 10.216 
3 0 1 2 2 2 87.773 
2 1 2 3 1 0 0.000 
3 2 0 3 1 0 0.000 
3 0 3 0 0 2 62.016 
1 3 2 2 1 3 36.355 
2 3 2 0 3 1 23.190 
1 0 1 2 1 1 11.784 
2 1 2 2 2 5 204.527 
3 0 1 1 2 5 115.937 
0 1 1 3 2 1 44.028 
2 2 1 3 1 4 52.247 
1 1 0 0 1 1 17.621 
3 3 1 2 1 2 10.706 
2 2 0 2 3 3 81.506 
0 1 0 0 2 2 81.835 
0 1 2 0 1 2 20.647 
3 2 2 2 0 1 3.110 
2 2 3 0 0 1 13.679 
1 2 2 3 2 1 6.486 
3 3 2 2 1 2 30.025 
0 0 3 1 3 6 202.172 
3 2 3 1 2 3 44.221 
0 3 0 0 0 1 27.645 
3 3 3 0 3 2 22.470 
2 3 2 0 2 0 0.000 
1 3 0 2 0 1 1.628 
1 3 1 0 2 0 0.000 
3 2 3 3 0 1 20.684 
3 1 0 2 0 4 108.000 
0 1 2 2 1 1 4.615 
0 2 3 2 2 1 12.461 
0 3 2 0 1 3 53.798 
2 1 1 2 0 1 36.320 
1 0 3 0 0 0 0.000 
0 0 3 2 0 1 19.902 
0 2 3 1 0 0 0.000 
2 2 2 1 3 2 31.815 
3 3 3 0 0 0 0.000 
2 2 1 3 3 2 17.915 
0 2 3 2 3 2 69.315 
1 3 1 2 1 0 0.000 
3 0 1 1 1 4 94.050 
2 1 1 1 3 6 242.266 
0 2 0 3 2 1 40.885 
2 0 1 1 2 2 74.708 
2 2 2 2 3 2 50.734 
1 0 2 2 1 3 35.950 
1 3 3 1 1 1 2.777 
3 1 2 1 3 5 118.065 
0 3 2 1 2 0 0.000 
;

The following statements fit a log-linked Poisson model to these data by using classification effects for variables C1C5:

proc hpgenselect data=getStarted;
   class C1-C5;
   model Y = C1-C5 /  Distribution=Poisson Link=Log;
run;

The default output from this analysis is presented in Figure 4.1 through Figure 4.8.

The Performance Information table in Figure 4.1 shows that the procedure executed in single-machine mode (that is, on the server where SAS is installed). When high-performance procedures run in single-machine mode, they use concurrently scheduled threads. In this case, four threads were used.

Figure 4.1: Performance Information

The HPGENSELECT Procedure

Performance Information
Execution Mode Single-Machine
Number of Threads 4


Figure 4.2 displays the Model Information table. The variable Y is an integer-valued variable that is modeled by using a Poisson probability distribution, and the mean of Y is modeled by using a log link function. The HPGENSELECT procedure uses a Newton-Raphson algorithm to fit the model. The CLASS variables C1C5 are parameterized by using GLM parameterization, which is the default.

Figure 4.2: Model Information

Model Information
Data Source WORK.GETSTARTED
Response Variable Y
Class Parameterization GLM
Distribution Poisson
Link Function Log
Optimization Technique Newton-Raphson with Ridging


Each of the CLASS variables C1C5 has four unique formatted levels, which are displayed in the Class Level Information table in Figure 4.3.

Figure 4.3: Class Level Information

Class Level Information
Class Levels Values
C1 4 0 1 2 3
C2 4 0 1 2 3
C3 4 0 1 2 3
C4 4 0 1 2 3
C5 4 0 1 2 3


Figure 4.4 displays the Number of Observations table. All 100 observations in the data set are used in the analysis.

Figure 4.4: Number of Observations

Number of Observations Read 100
Number of Observations Used 100


Figure 4.5 displays the Dimensions table for this model. This table summarizes some important sizes of various model components. For example, it shows that there are 21 columns in the design matrix $\bX $: one column for the intercept and 20 columns for the effects that are associated with the classification variables C1C5. However, the rank of the crossproducts matrix is only 16. Because the classification variables C1C5 use GLM parameterization and because the model contains an intercept, there is one singularity in the crossproducts matrix of the model for each classification variable. Consequently, only 16 parameters enter the optimization.

Figure 4.5: Dimensions in Poisson Regression

Dimensions
Number of Effects 6
Number of Parameters 16
Columns in X 21


Figure 4.6 displays the final convergence status of the Newton-Raphson algorithm. The GCONV= relative convergence criterion is satisfied.

Figure 4.6: Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.


The Fit Statistics table is shown in Figure 4.7. The –2 log likelihood at the converged estimates is 290.16169. You can use this value to compare the model to nested model alternatives by means of a likelihood-ratio test. To compare models that are not nested, information criteria such as AIC (Akaike’s information criterion), AICC (Akaike’s bias-corrected information criterion), and BIC (Schwarz Bayesian information criterion) are used. These criteria penalize the –2 log likelihood for the number of parameters.

Figure 4.7: Fit Statistics

Fit Statistics
-2 Log Likelihood 290.16169
AIC (smaller is better) 322.16169
AICC (smaller is better) 328.71590
BIC (smaller is better) 363.84441
Pearson Chi-Square 77.76937
Pearson Chi-Square/DF 0.92583


The Parameter Estimates table in Figure 4.8 shows that many parameters have fairly large p-values, indicating that one or more of the model effects might not be necessary.

Figure 4.8: Parameter Estimates

Parameter Estimates
Parameter DF Estimate Standard
Error
Chi-Square Pr > ChiSq
Intercept 1 0.881903 0.382730 5.3095 0.0212
C1 0 1 -0.196002 0.211482 0.8590 0.3540
C1 1 1 -0.605161 0.263508 5.2742 0.0216
C1 2 1 -0.068458 0.210776 0.1055 0.7453
C1 3 0 0 . . .
C2 0 1 0.961117 0.255485 14.1521 0.0002
C2 1 1 0.708188 0.246768 8.2360 0.0041
C2 2 1 0.161741 0.266365 0.3687 0.5437
C2 3 0 0 . . .
C3 0 1 -0.227016 0.252561 0.8079 0.3687
C3 1 1 -0.094775 0.229519 0.1705 0.6797
C3 2 1 0.044801 0.238127 0.0354 0.8508
C3 3 0 0 . . .
C4 0 1 -0.280476 0.263589 1.1322 0.2873
C4 1 1 0.028157 0.249652 0.0127 0.9102
C4 2 1 0.047803 0.240378 0.0395 0.8424
C4 3 0 0 . . .
C5 0 1 -0.817936 0.219901 13.8351 0.0002
C5 1 1 -0.710596 0.206265 11.8684 0.0006
C5 2 1 -0.602080 0.217724 7.6471 0.0057
C5 3 0 0 . . .