This example illustrates how you can use PROC HPGENSELECT to perform Poisson regression for count data. The following DATA
step contains 100 observations for a count response variable (Y
), a continuous variable (Total
) to be used in a later analysis, and five categorical variables (C1
–C5
), each of which has four numerical levels:
data getStarted; input C1-C5 Y Total; datalines; 0 3 1 1 3 2 28.361 2 3 0 3 1 2 39.831 1 3 2 2 2 1 17.133 1 2 0 0 3 2 12.769 0 2 1 0 1 1 29.464 0 2 1 0 2 1 4.152 1 2 1 0 1 0 0.000 0 2 1 1 2 1 20.199 1 2 0 0 1 0 0.000 0 1 1 3 3 2 53.376 2 2 2 2 1 1 31.923 0 3 2 0 3 2 37.987 2 2 2 0 0 1 1.082 0 2 0 2 0 1 6.323 1 3 0 0 0 0 0.000 1 2 1 2 3 2 4.217 0 1 2 3 1 1 26.084 1 1 0 0 1 0 0.000 1 3 2 2 2 0 0.000 2 1 3 1 1 2 52.640 1 3 0 1 2 1 3.257 2 0 2 3 0 5 88.066 2 2 2 1 0 1 15.196 3 1 3 1 0 1 11.955 3 1 3 1 2 3 91.790 3 1 1 2 3 7 232.417 3 1 1 1 0 1 2.124 3 1 0 0 0 2 32.762 3 1 2 3 0 1 25.415 2 2 0 1 2 1 42.753 3 3 2 2 3 1 23.854 2 0 0 2 3 2 49.438 1 0 0 2 3 4 105.449 0 0 2 3 0 6 101.536 0 3 1 0 0 0 0.000 3 0 1 0 1 1 5.937 2 0 0 0 3 2 53.952 1 0 1 0 3 2 23.686 1 1 3 1 1 1 0.287 2 1 3 0 3 7 281.551 1 3 2 1 1 0 0.000 2 1 0 0 1 0 0.000 0 0 1 1 2 3 93.009 0 1 0 1 0 2 25.055 1 2 2 2 3 1 1.691 0 3 2 3 1 1 10.719 3 3 0 3 3 1 19.279 2 0 0 2 1 2 40.802 2 2 3 0 3 3 72.924 0 2 0 3 0 1 10.216 3 0 1 2 2 2 87.773 2 1 2 3 1 0 0.000 3 2 0 3 1 0 0.000 3 0 3 0 0 2 62.016 1 3 2 2 1 3 36.355 2 3 2 0 3 1 23.190 1 0 1 2 1 1 11.784 2 1 2 2 2 5 204.527 3 0 1 1 2 5 115.937 0 1 1 3 2 1 44.028 2 2 1 3 1 4 52.247 1 1 0 0 1 1 17.621 3 3 1 2 1 2 10.706 2 2 0 2 3 3 81.506 0 1 0 0 2 2 81.835 0 1 2 0 1 2 20.647 3 2 2 2 0 1 3.110 2 2 3 0 0 1 13.679 1 2 2 3 2 1 6.486 3 3 2 2 1 2 30.025 0 0 3 1 3 6 202.172 3 2 3 1 2 3 44.221 0 3 0 0 0 1 27.645 3 3 3 0 3 2 22.470 2 3 2 0 2 0 0.000 1 3 0 2 0 1 1.628 1 3 1 0 2 0 0.000 3 2 3 3 0 1 20.684 3 1 0 2 0 4 108.000 0 1 2 2 1 1 4.615 0 2 3 2 2 1 12.461 0 3 2 0 1 3 53.798 2 1 1 2 0 1 36.320 1 0 3 0 0 0 0.000 0 0 3 2 0 1 19.902 0 2 3 1 0 0 0.000 2 2 2 1 3 2 31.815 3 3 3 0 0 0 0.000 2 2 1 3 3 2 17.915 0 2 3 2 3 2 69.315 1 3 1 2 1 0 0.000 3 0 1 1 1 4 94.050 2 1 1 1 3 6 242.266 0 2 0 3 2 1 40.885 2 0 1 1 2 2 74.708 2 2 2 2 3 2 50.734 1 0 2 2 1 3 35.950 1 3 3 1 1 1 2.777 3 1 2 1 3 5 118.065 0 3 2 1 2 0 0.000 ;
The following statements fit a log-linked Poisson model to these data by using classification effects for variables C1
–C5
:
proc hpgenselect data=getStarted; class C1-C5; model Y = C1-C5 / Distribution=Poisson Link=Log; run;
The default output from this analysis is presented in Figure 4.1 through Figure 4.8.
The “Performance Information” table in Figure 4.1 shows that the procedure executed in single-machine mode (that is, on the server where SAS is installed). When high-performance procedures run in single-machine mode, they use concurrently scheduled threads. In this case, four threads were used.
Figure 4.1: Performance Information
Performance Information | |
---|---|
Execution Mode | Single-Machine |
Number of Threads | 4 |
Figure 4.2 displays the “Model Information” table. The variable Y
is an integer-valued variable that is modeled by using a Poisson probability distribution, and the mean of Y
is modeled by using a log link function. The HPGENSELECT procedure uses a Newton-Raphson algorithm to fit the model. The
CLASS variables C1
–C5
are parameterized by using GLM parameterization, which is the default.
Figure 4.2: Model Information
Model Information | |
---|---|
Data Source | WORK.GETSTARTED |
Response Variable | Y |
Class Parameterization | GLM |
Distribution | Poisson |
Link Function | Log |
Optimization Technique | Newton-Raphson with Ridging |
Each of the CLASS variables C1
–C5
has four unique formatted levels, which are displayed in the “Class Level Information” table in Figure 4.3.
Figure 4.3: Class Level Information
Class Level Information | ||
---|---|---|
Class | Levels | Values |
C1 | 4 | 0 1 2 3 |
C2 | 4 | 0 1 2 3 |
C3 | 4 | 0 1 2 3 |
C4 | 4 | 0 1 2 3 |
C5 | 4 | 0 1 2 3 |
Figure 4.4 displays the “Number of Observations” table. All 100 observations in the data set are used in the analysis.
Figure 4.5 displays the “Dimensions” table for this model. This table summarizes some important sizes of various model components. For example, it shows that
there are 21 columns in the design matrix : one column for the intercept and 20 columns for the effects that are associated with the classification variables C1
–C5
. However, the rank of the crossproducts matrix is only 16. Because the classification variables C1
–C5
use GLM parameterization and because the model contains an intercept, there is one singularity in the crossproducts matrix
of the model for each classification variable. Consequently, only 16 parameters enter the optimization.
Figure 4.5: Dimensions in Poisson Regression
Dimensions | |
---|---|
Number of Effects | 6 |
Number of Parameters | 16 |
Columns in X | 21 |
Figure 4.6 displays the final convergence status of the Newton-Raphson algorithm. The GCONV= relative convergence criterion is satisfied.
The “Fit Statistics” table is shown in Figure 4.7. The –2 log likelihood at the converged estimates is 290.16169. You can use this value to compare the model to nested model alternatives by means of a likelihood-ratio test. To compare models that are not nested, information criteria such as AIC (Akaike’s information criterion), AICC (Akaike’s bias-corrected information criterion), and BIC (Schwarz Bayesian information criterion) are used. These criteria penalize the –2 log likelihood for the number of parameters.
Figure 4.7: Fit Statistics
Fit Statistics | |
---|---|
-2 Log Likelihood | 290.16169 |
AIC (smaller is better) | 322.16169 |
AICC (smaller is better) | 328.71590 |
BIC (smaller is better) | 363.84441 |
Pearson Chi-Square | 77.76937 |
Pearson Chi-Square/DF | 0.92583 |
The “Parameter Estimates” table in Figure 4.8 shows that many parameters have fairly large p-values, indicating that one or more of the model effects might not be necessary.
Figure 4.8: Parameter Estimates
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
Chi-Square | Pr > ChiSq |
Intercept | 1 | 0.881903 | 0.382730 | 5.3095 | 0.0212 |
C1 0 | 1 | -0.196002 | 0.211482 | 0.8590 | 0.3540 |
C1 1 | 1 | -0.605161 | 0.263508 | 5.2742 | 0.0216 |
C1 2 | 1 | -0.068458 | 0.210776 | 0.1055 | 0.7453 |
C1 3 | 0 | 0 | . | . | . |
C2 0 | 1 | 0.961117 | 0.255485 | 14.1521 | 0.0002 |
C2 1 | 1 | 0.708188 | 0.246768 | 8.2360 | 0.0041 |
C2 2 | 1 | 0.161741 | 0.266365 | 0.3687 | 0.5437 |
C2 3 | 0 | 0 | . | . | . |
C3 0 | 1 | -0.227016 | 0.252561 | 0.8079 | 0.3687 |
C3 1 | 1 | -0.094775 | 0.229519 | 0.1705 | 0.6797 |
C3 2 | 1 | 0.044801 | 0.238127 | 0.0354 | 0.8508 |
C3 3 | 0 | 0 | . | . | . |
C4 0 | 1 | -0.280476 | 0.263589 | 1.1322 | 0.2873 |
C4 1 | 1 | 0.028157 | 0.249652 | 0.0127 | 0.9102 |
C4 2 | 1 | 0.047803 | 0.240378 | 0.0395 | 0.8424 |
C4 3 | 0 | 0 | . | . | . |
C5 0 | 1 | -0.817936 | 0.219901 | 13.8351 | 0.0002 |
C5 1 | 1 | -0.710596 | 0.206265 | 11.8684 | 0.0006 |
C5 2 | 1 | -0.602080 | 0.217724 | 7.6471 | 0.0057 |
C5 3 | 0 | 0 | . | . | . |