The HPGENSELECT Procedure

Example 7.3 Tweedie Model

The following HPGENSELECT statements examine the data set getStarted used in the section Getting Started: HPGENSELECT Procedure, but they request that a Tweedie model be fit by using the continuous variable Total as the response instead of the count variable Y. The following statements fit a log-linked Tweedie model to these data by using classification effects for variables C1C5. In an insurance underwriting context, Y represents the total number of claims in each category that is defined by C1C5, and Total represents the total cost of the claims (that is, the sum of costs for individual claims). The CODE statement requests that a text file named Scoring Parameters.txt be created. This file contains a SAS program that contains information from the model that allows scoring of a new data set based on the parameter estimates from the current model.

proc hpgenselect data=getStarted;
   class C1-C5;
   model Total = C1-C5 / Distribution=Tweedie Link=Log;
   code File='ScoringParameters.txt';
run;

The Optimizations Stage Details table in Output 7.3.1 shows the stages used in computing the maximum likelihood estimates of the parameters of the Tweedie model. Stage 1 uses quasi-likelihood and all of the data to compute starting values for stage 2, which uses all of the data and the Tweedie log-likelihood to compute the final estimates.

Output 7.3.1: Optimization Stage Details

The HPGENSELECT Procedure

Optimization Stage Details
Optimization
Stage
Optimization
Type
Sampling
Percentage
Observations
Used
1 Quasilikelihood 100.00 100
2 Full Likelihood 100.00 100


The Parameter Estimates table in Output 7.3.2 shows the resulting regression model parameter estimates, the estimated Tweedie dispersion parameter, and the estimated Tweedie power.

Output 7.3.2: Parameter Estimates

Parameter Estimates
Parameter DF Estimate Standard
Error
Chi-Square Pr > ChiSq
Intercept 1 3.888904 0.435325 79.8044 <.0001
C1 0 1 -0.072400 0.240613 0.0905 0.7635
C1 1 1 -1.358456 0.324363 17.5400 <.0001
C1 2 1 0.154711 0.237394 0.4247 0.5146
C1 3 0 0 . . .
C2 0 1 1.350591 0.289897 21.7050 <.0001
C2 1 1 1.159242 0.275459 17.7106 <.0001
C2 2 1 0.033921 0.303204 0.0125 0.9109
C2 3 0 0 . . .
C3 0 1 -0.217763 0.272474 0.6387 0.4242
C3 1 1 -0.289425 0.259751 1.2415 0.2652
C3 2 1 -0.131961 0.276723 0.2274 0.6335
C3 3 0 0 . . .
C4 0 1 -0.258069 0.288840 0.7983 0.3716
C4 1 1 -0.057042 0.287566 0.0393 0.8428
C4 2 1 0.219697 0.272064 0.6521 0.4194
C4 3 0 0 . . .
C5 0 1 -1.314657 0.257806 26.0038 <.0001
C5 1 1 -0.996980 0.236881 17.7138 <.0001
C5 2 1 -0.481185 0.235614 4.1708 0.0411
C5 3 0 0 . . .
Dispersion 1 5.296966 0.773401 . .
Power 1 1.425625 0.048981 . .


Now suppose you want to compute predicted values for some different data. If $\mb {x}$ is a vector of explanatory variables that might not be in the original data and $\hat{\bbeta }$ is the vector of estimated regression parameters from the model, then $\mu =g^{-1}(\mb {x}^\prime \hat{\bbeta })$ is the predicted value of the mean, where g is the log link function in this case. The following data contain new values of the regression variables C1C5, from which you can compute predicted values based on information in the SAS program that is created by the CODE statement. This is called scoring the new data set.

data ScoringData;
   input C1-C5;
   datalines; 
3 3 1 0 2 
1 1 2 2 0 
3 2 2 2 0 
1 1 2 3 2 
1 1 2 3 3 
3 1 1 0 1 
0 2 1 0 0 
2 1 3 1 3 
3 2 3 2 0 
3 0 2 0 1 
;

The following SAS DATA step creates the new data set Scores, which contains a variable P_Total that represents the predicted values of Total, along with the variables C1C5. The resulting data are shown in Output 7.3.3.

data Scores;
   set ScoringData;
   %inc 'ScoringParameters.txt';
;
proc print data=Scores;
run;

Output 7.3.3: Predicted Values for Scoring Data

Obs C1 C2 C3 C4 C5 P_Total
1 3 3 1 0 2 17.465
2 1 1 2 2 0 11.737
3 3 2 2 2 0 14.819
4 1 1 2 3 2 21.683
5 1 1 2 3 3 35.083
6 3 1 1 0 1 33.237
7 0 2 1 0 0 7.303
8 2 1 3 1 3 171.711
9 3 2 3 2 0 16.909
10 3 0 2 0 1 47.110