The HPGENSELECT Procedure

Getting Started: HPGENSELECT Procedure

This example illustrates how you can use PROC HPGENSELECT to perform Poisson regression for count data. The following DATA step contains 100 observations for a count response variable (Y), a continuous variable (Total) to be used in a later analysis, and five categorical variables (C1–C5), each of which has four numerical levels:

data getStarted;
  input C1-C5 Y Total;
  datalines;
0 3 1 1 3 2 28.361 
2 3 0 3 1 2 39.831 
1 3 2 2 2 1 17.133 
1 2 0 0 3 2 12.769 
0 2 1 0 1 1 29.464 
0 2 1 0 2 1 4.152 
1 2 1 0 1 0 0.000 
0 2 1 1 2 1 20.199 
1 2 0 0 1 0 0.000 
0 1 1 3 3 2 53.376 
2 2 2 2 1 1 31.923 
0 3 2 0 3 2 37.987 
2 2 2 0 0 1 1.082 
0 2 0 2 0 1 6.323 
1 3 0 0 0 0 0.000 
1 2 1 2 3 2 4.217 
0 1 2 3 1 1 26.084 
1 1 0 0 1 0 0.000 
1 3 2 2 2 0 0.000 
2 1 3 1 1 2 52.640 
1 3 0 1 2 1 3.257 
2 0 2 3 0 5 88.066 
2 2 2 1 0 1 15.196 
3 1 3 1 0 1 11.955 
3 1 3 1 2 3 91.790 
3 1 1 2 3 7 232.417 
3 1 1 1 0 1 2.124 
3 1 0 0 0 2 32.762 
3 1 2 3 0 1 25.415 
2 2 0 1 2 1 42.753 
3 3 2 2 3 1 23.854 
2 0 0 2 3 2 49.438 
1 0 0 2 3 4 105.449 
0 0 2 3 0 6 101.536 
0 3 1 0 0 0 0.000 
3 0 1 0 1 1 5.937 
2 0 0 0 3 2 53.952 
1 0 1 0 3 2 23.686 
1 1 3 1 1 1 0.287 
2 1 3 0 3 7 281.551 
1 3 2 1 1 0 0.000 
2 1 0 0 1 0 0.000 
0 0 1 1 2 3 93.009 
0 1 0 1 0 2 25.055 
1 2 2 2 3 1 1.691 
0 3 2 3 1 1 10.719 
3 3 0 3 3 1 19.279 
2 0 0 2 1 2 40.802 
2 2 3 0 3 3 72.924 
0 2 0 3 0 1 10.216 
3 0 1 2 2 2 87.773 
2 1 2 3 1 0 0.000 
3 2 0 3 1 0 0.000 
3 0 3 0 0 2 62.016 
1 3 2 2 1 3 36.355 
2 3 2 0 3 1 23.190 
1 0 1 2 1 1 11.784 
2 1 2 2 2 5 204.527 
3 0 1 1 2 5 115.937 
0 1 1 3 2 1 44.028 
2 2 1 3 1 4 52.247 
1 1 0 0 1 1 17.621 
3 3 1 2 1 2 10.706 
2 2 0 2 3 3 81.506 
0 1 0 0 2 2 81.835 
0 1 2 0 1 2 20.647 
3 2 2 2 0 1 3.110 
2 2 3 0 0 1 13.679 
1 2 2 3 2 1 6.486 
3 3 2 2 1 2 30.025 
0 0 3 1 3 6 202.172 
3 2 3 1 2 3 44.221 
0 3 0 0 0 1 27.645 
3 3 3 0 3 2 22.470 
2 3 2 0 2 0 0.000 
1 3 0 2 0 1 1.628 
1 3 1 0 2 0 0.000 
3 2 3 3 0 1 20.684 
3 1 0 2 0 4 108.000 
0 1 2 2 1 1 4.615 
0 2 3 2 2 1 12.461 
0 3 2 0 1 3 53.798 
2 1 1 2 0 1 36.320 
1 0 3 0 0 0 0.000 
0 0 3 2 0 1 19.902 
0 2 3 1 0 0 0.000 
2 2 2 1 3 2 31.815 
3 3 3 0 0 0 0.000 
2 2 1 3 3 2 17.915 
0 2 3 2 3 2 69.315 
1 3 1 2 1 0 0.000 
3 0 1 1 1 4 94.050 
2 1 1 1 3 6 242.266 
0 2 0 3 2 1 40.885 
2 0 1 1 2 2 74.708 
2 2 2 2 3 2 50.734 
1 0 2 2 1 3 35.950 
1 3 3 1 1 1 2.777 
3 1 2 1 3 5 118.065 
0 3 2 1 2 0 0.000 
;

The following statements fit a log-linked Poisson model to these data by using classification effects for variables C1–C5:

proc hpgenselect data=getStarted;
   class C1-C5;
   model Y = C1-C5 /  Distribution=Poisson Link=Log;
run;

The default output from this analysis is presented in Figure 8.1 through Figure 8.8.

The "Performance Information" table in Figure 8.1 shows that the procedure executed in single-machine mode (that is, on the server where SAS is installed). When high-performance procedures run in single-machine mode, they use concurrently scheduled threads. In this case, four threads were used.

Figure 8.1: Performance Information

The HPGENSELECT Procedure

Performance Information
Execution Mode	Single-Machine
Number of Threads	4

Figure 8.2 displays the "Model Information" table. The variable Y is an integer-valued variable that is modeled by using a Poisson probability distribution, and the mean of Y is modeled by using a log link function. The HPGENSELECT procedure uses a Newton-Raphson algorithm to fit the model. The CLASS variables C1–C5 are parameterized by using GLM parameterization, which is the default.

Figure 8.2: Model Information

Model Information
Data Source	WORK.GETSTARTED
Response Variable	Y
Class Parameterization	GLM
Distribution	Poisson
Link Function	Log
Optimization Technique	Newton-Raphson with Ridging

Each of the CLASS variables C1–C5 has four unique formatted levels, which are displayed in the "Class Level Information" table in Figure 8.3.

Figure 8.3: Class Level Information

Class Level Information
Class	Levels	Values
C1	4	0 1 2 3
C2	4	0 1 2 3
C3	4	0 1 2 3
C4	4	0 1 2 3
C5	4	0 1 2 3

Figure 8.4 displays the "Number of Observations" table. All 100 observations in the data set are used in the analysis.

Figure 8.4: Number of Observations

Number of Observations Read	100
Number of Observations Used	100

Figure 8.5 displays the "Dimensions" table for this model. This table summarizes some important sizes of various model components. For example, it shows that there are 21 columns in the design matrix $\bX$ : one column for the intercept and 20 columns for the effects that are associated with the classification variables C1–C5. However, the rank of the crossproducts matrix is only 16. Because the classification variables C1–C5 use GLM parameterization and because the model contains an intercept, there is one singularity in the crossproducts matrix of the model for each classification variable. Consequently, only 16 parameters enter the optimization.

Figure 8.5: Dimensions in Poisson Regression

Dimensions
Number of Effects	6
Number of Parameters	16
Columns in X	21

Figure 8.6 displays the final convergence status of the Newton-Raphson algorithm. The GCONV= relative convergence criterion is satisfied.

Figure 8.6: Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.

The "Fit Statistics" table is shown in Figure 8.7. The –2 log likelihood at the converged estimates is 290.16169. You can use this value to compare the model to nested model alternatives by means of a likelihood-ratio test. To compare models that are not nested, information criteria such as AIC (Akaike’s information criterion), AICC (Akaike’s bias-corrected information criterion), and BIC (Schwarz Bayesian information criterion) are used. These criteria penalize the –2 log likelihood for the number of parameters.

Figure 8.7: Fit Statistics

Fit Statistics
-2 Log Likelihood	290.16
AIC (smaller is better)	322.16
AICC (smaller is better)	328.72
BIC (smaller is better)	363.84
Pearson Chi-Square	77.7694
Pearson Chi-Square/DF	0.9258

The "Parameter Estimates" table in Figure 8.8 shows that many parameters have fairly large p-values, indicating that one or more of the model effects might not be necessary.

Figure 8.8: Parameter Estimates

Parameter Estimates
Parameter	DF	Estimate	Standard Error	Chi-Square	Pr > ChiSq
Intercept	1	0.881903	0.382730	5.3095	0.0212
C1 0	1	-0.196002	0.211482	0.8590	0.3540
C1 1	1	-0.605161	0.263508	5.2742	0.0216
C1 2	1	-0.068458	0.210776	0.1055	0.7453
C1 3	0	0	.	.	.
C2 0	1	0.961117	0.255485	14.1521	0.0002
C2 1	1	0.708188	0.246768	8.2360	0.0041
C2 2	1	0.161741	0.266365	0.3687	0.5437
C2 3	0	0	.	.	.
C3 0	1	-0.227016	0.252561	0.8079	0.3687
C3 1	1	-0.094775	0.229519	0.1705	0.6797
C3 2	1	0.044801	0.238127	0.0354	0.8508
C3 3	0	0	.	.	.
C4 0	1	-0.280476	0.263589	1.1322	0.2873
C4 1	1	0.028157	0.249652	0.0127	0.9102
C4 2	1	0.047803	0.240378	0.0395	0.8424
C4 3	0	0	.	.	.
C5 0	1	-0.817936	0.219901	13.8351	0.0002
C5 1	1	-0.710596	0.206265	11.8684	0.0006
C5 2	1	-0.602080	0.217724	7.6471	0.0057
C5 3	0	0	.	.	.