![]() | ![]() | ![]() |
For any model fit in PROC GENMOD (in SAS/STAT software) or PROC COUNTREG (in SAS/ETS software), including zero-inflated models, the PRED= option in the OUTPUT statement provides the estimated mean for each observation. However, the standard error and confidence limits are not available for the mean count or rate in zero-inflated models. Currently, neither the LSMEANS statement nor the ESTIMATE statement in PROC GENMOD provides mean count or rate estimates for zero-inflated models. To obtain estimated mean counts or rates, you can fit the zero-inflated model in PROC NLMIXED and use the ESTIMATE or PREDICT statement as illustrated below.
The following statements create a hypothetical insurance claims data set in which claims are classified by age group with two levels and car type with three levels.
data insure;
input c car age;
datalines;
10 1 1
37 2 1
100 3 1
0 1 2
0 2 2
80 3 2
;
The GENMOD step below fits a zero-inflated Poisson (ZIP) model to the insurance claims data. Note that the same model can be fit with PROC COUNTREG using very similar syntax or with PROC FMM as a finite mixture model with two components. Zero-inflated negative binomial (ZINB) models can be fit in GENMOD or COUNTREG using the DIST=ZINB option, or in FMM.NOTE1
The LSMEANS statement requests LS-mean estimates for each AGE level. The E option displays the linear combination of model parameters which produces the LS-mean estimates. The ILINK option applies the inverse link function (exponentiation for log-linked models) to the estimates to provide estimates on the mean count scale. The ESTIMATE statement specifies the same linear combination of model parameters as the LSMEANS statement for AGE=1 and also estimates the zero-inflation parameter. The coefficients following @ZERO in the ESTIMATE statement estimate the zero-inflation parameter for AGE=1, CAR=2. The P= option in the OUTPUT statement saves the predicted mean count for each observation to a data set. The PZERO= option saves the estimated zero-inflation parameter for each observation.
proc genmod data=insure;
class age;
model c = age car / dist=zip;
zeromodel car;
lsmeans age / e ilink;
estimate 'age=1, car=2 LSmean' intercept 1 age 1 0 car 2
@zero intercept 1 car 2;
output out=gout pred=p pzero=p0;
run;
Notice that the table of LS-means coefficients does not include any coefficients for the parameters in the zero-inflation portion of the model. The LSMEANS statement provides estimates based only on the Poisson portion of the model, not for the zero-inflated model. A Note indicating this is displayed in the SAS log. The ESTIMATE statement provides separate estimates for the Poisson and zero-inflation portions of the model. The estimate for the Poisson portion of the model (34.0898) matches the LS-mean estimate for AGE=1, CAR=2.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
These statements display the OUT= data set produced by the OUTPUT statement.
proc print data=gout noobs label;
run;
|
The values in variable P are the mean counts estimated by the ZIP model. Note the difference between the estimated mean from the zero-inflated model (24.0563) provided by the OUTPUT statement and the estimated mean using only the Poisson part of the model (34.0898) provided by the ESTIMATE and LSMEANS statements. Note also that the value for the zero-inflation parameter for AGE=1, CAR=2 (PZERO=0.2943) is identical to the value provided by the ESTIMATE statement.
By using PROC NLMIXED to fit the zero-inflated model, standard errors and confidence limits for mean counts can be obtained. The following statements fit the same ZIP model as in PROC GENMOD above. The zero-inflation part of the model, which is a logistic model, is defined in the first two assignment statements. INFPROB is the zero-inflation parameter, ω. The Poisson portion of the model is defined in the next assignment statement, and LAMBDA is the Poisson mean parameter, λ. The parameters to be estimated are inf0 and inf1 in the zero inflation model, and b0, b1, and b2 in the Poisson model. The next lines define the ZIP log likelihood. The PREDICT statement requests a data set containing a mean count estimate and confidence interval for each observation. As noted in "Zero-Inflated Models" in the Details section of the GENMOD documentation, the mean count for zero-inflated models is (1-ω)λ, where ω is the zero-inflation parameter and λ is the mean of the Poisson or negative binomial portion of the model.
proc nlmixed data=insure;
/* define zero inflation model */
zeromodel = inf0 + inf1*car;
infprob = 1/(1+exp(-zeromodel));
/* define poisson model */
lambda = exp(b0 + b1*(age=1) + b2*car);
/* define ZIP log likelihood */
if c=0 then ll = log(infprob + (1-infprob)*exp(-lambda));
else ll = log((1-infprob)) + c*log(lambda) - lgamma(c+1) - lambda;
model c ~ general(ll);
/* request predicted counts */
predict (1-infprob)*lambda out=nout;
run;
Note that the estimated model parameters match those from PROC GENMOD. Zero-inflated negative binomial models require different IF-ELSE statements to define the ZINB log likelihoodNOTE2.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
These statements display the OUT= data set produced by the PREDICT statement.
proc print data=nout noobs label;
run;
Note that the predicted mean counts (Predicted Value) match those from PROC GENMOD, and the data set also includes standard errors and confidence limits for the predicted counts.
|
Comparisons among the population mean counts can be done using the ESTIMATE statement. For example, suppose you want a test and confidence interval for the difference in mean counts between the AGE=1, CAR=2 population and the AGE=2, CAR=2 population. The difference in these means is (1-ω12)λ12 - (1-ω22)λ22. Adding these statements before the RUN statement in the above PROC NLMIXED step produces the estimate, test, and confidence interval. Note that the entire expression could be written in the ESTIMATE statement if desired, eliminating the preceding assignment statements, but that is not done here to clarify the process.
/* AGE=1, CAR=2 */
zeromodel12 = inf0 + inf1*2;
infprob12 = 1/(1+exp(-zeromodel12));
lambda12 = exp(b0 + b1 + b2*2);
/* AGE=2, CAR=2 */
zeromodel22 = inf0 + inf1*2;
infprob22 = 1/(1+exp(-zeromodel22));
lambda22 = exp(b0 + b2*2);
/* estimate and test difference in mean counts */
estimate "Mean Count Difference" (1-infprob12)*lambda12 - (1-infprob22)*lambda22;
Notice that the estimated difference (5.0876) is the difference of the two estimated means above (24.0563 - 18.9687). The means do not differ significantly (p=0.1778).
| |||||||||||||||||||||||||||
By including an offset in a Poisson or negative binomial model, you can model a rate rather than a count. This is explained in this note for models that are not zero-inflated. The same applies to zero-inflated models. The offset is the log of the rate denominator.
These statements create another hypothetical data set of insurance claims. In each observation, the number of claims (C) and the number of policyholders (N) is recorded for each car type and age group. Since the number of policyholders varies across populations, the rates (number of claims per policyholder, or per 100 policyholders) will be modeled rather than the number of claims. The log of the number of policyholders (LN) is computed for use as an offset in the ZIP model.
data insure;
input n c car$ age;
ln = log(n);
datalines;
500 0 small 1
1200 37 medium 1
100 0 large 1
400 101 small 2
500 73 medium 2
300 0 large 2
;
PROC GENMOD fits the ZIP model to the claims data and saves a data set of predicted values. In this example, the zero-inflation model contains only an intercept. Since the offset is not included in the computation of predicted values by the PRED= option in the OUTPUT statement, the predicted values are mean counts rather than rates. Again, this model can also be fit using the COUNTREG or FMM procedure.NOTE3
proc genmod data=insure;
model c = age / dist=zip offset=ln;
zeromodel ;
output out=gout pred=p;
run;
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The following DATA step computes the estimated rates by dividing the estimated counts by the number of policyholders.
data gout;
set gout;
prate=p/n;
run;
proc print noobs;
run;
|
By using PROC NLMIXED to fit the zero-inflated model you can use its PREDICT statement to estimate the rates and provide these additional statistics. The following statements fit the same ZIP model as above. Notice the inclusion of LN without a coefficient in the Poisson part of the model since an offset has a coefficient fixed at 1. In the PREDICT statement, the ZIP mean is divided by the rate denominator as was done in the DATA step above. The predicted rates and confidence limits are saved in data set NOUT.
proc nlmixed data=insure;
/* define zero inflation model */
zeromodel = a0;
infprob = 1/(1+exp(-zeromodel));
/* define poisson model */
lambda = exp(b0 + ln + b1*age);
/* define ZIP log likelihood */
if c=0 then ll = log(infprob + (1-infprob)*exp(-lambda));
else ll = log((1-infprob)) + c*log(lambda) - lgamma(c+1) - lambda;
model c ~ general(ll);
/* request predicted rates */
predict (1-infprob)*lambda/n out=nout;
run;
The fitted model is identical to the model as fit in PROC GENMOD above.
| ||||||||||||||||||||||||||||||||||||||||||||||||||
The predicted rates with tests and confidence intervals as computed by the PREDICT statement are displayed below.
proc print data=nout noobs label;
run;
The estimated rate for AGE=1 is 0.015592, or about 1.6 claims per 100 policyholders. For AGE=2, the rate is 0.098136, or about 9.8 claims per 100 policyholders.
|
A similar approach as used for comparing mean counts can be used for comparing rates. However, since the expression in the ESTIMATE statement cannot involve variables in the input data set, the PREDICT statement is used instead. Its results are written to data set RATEDIFF. Adding the following statements in the NLMIXED step compares the rates of the two AGE levels.
/* AGE=1 */
zeromodel1 = a0;
infprob1 = 1/(1+exp(-zeromodel1));
lambda1 = exp(b0 + ln + b1*1);
/* AGE=2 */
zeromodel2 = a0;
infprob2 = 1/(1+exp(-zeromodel2));
lambda2 = exp(b0 + ln + b1*2);
/* estimate and test difference in rates */
predict (1-infprob1)*lambda1/n - (1-infprob2)*lambda2/n out=ratediff;
These statements display the estimated rate difference including a test and confidence interval. Only one observation is displayed since the results are the same in every observation of the RATEDIFF data set.
proc print data=ratediff(obs=1) noobs label;
var pred -- upper;
run;
The difference in rates is as expected from the rates above (0.015592 - 0.098136 = -0.082544) and this difference is not quite significant at the 0.05 level (p=0.0546).
|
_______
NOTE1: The following statements fit the ZIP model for counts in the COUNTREG and FMM procedures. The OUTPUT statements also save the predicted values and the inflation probabilities from the ZIP model.
proc countreg data=insure;
class age;
model c = age car / dist=zip;
zeromodel c ~ car;
output out=cout pred=p probzero=p0;
run;
proc fmm data=insure;
class age;
model c = / dist=constant;
model + age car / dist=poisson;
probmodel car;
output out=fout pred=p mixprob=p0;
run;
NOTE2: These statements can be used to define the zero-inflated negative binomial (ZINB) log likelihood in PROC NLMIXED:
/* define ZINB log likelihood */
if c=0 then ll = log( infprob + (1-infprob)/(1+k*lambda)**(1/k) );
else ll = log((1-infprob)) + c*log(k*lambda) - (c+(1/k))*log(1+k*lambda) +
lgamma(c+1/k) - lgamma(1/k) - lgamma(c+1);
NOTE3: These statements fit the ZIP model for rates in the COUNTREG and FMM procedures.
proc countreg data=insure;
model c = age / dist=zip offset=ln;
zeromodel c ~;
output out=cout pred=p;
run;
proc fmm data=insure;
model c= / dist=constant;
model + age / dist=poisson offset=ln;
output out=fout pred=p;
run;
| Product Family | Product | System | SAS Release | |
| Reported | Fixed* | |||
| SAS System | SAS/STAT | z/OS | ||
| OpenVMS VAX | ||||
| Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
| Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
| Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
| Microsoft Windows XP 64-bit Edition | ||||
| Microsoft® Windows® for x64 | ||||
| OS/2 | ||||
| Microsoft Windows 95/98 | ||||
| Microsoft Windows 2000 Advanced Server | ||||
| Microsoft Windows 2000 Datacenter Server | ||||
| Microsoft Windows 2000 Server | ||||
| Microsoft Windows 2000 Professional | ||||
| Microsoft Windows NT Workstation | ||||
| Microsoft Windows Server 2003 Datacenter Edition | ||||
| Microsoft Windows Server 2003 Enterprise Edition | ||||
| Microsoft Windows Server 2003 Standard Edition | ||||
| Microsoft Windows Server 2003 for x64 | ||||
| Microsoft Windows Server 2008 | ||||
| Microsoft Windows Server 2008 for x64 | ||||
| Microsoft Windows XP Professional | ||||
| Windows 7 Enterprise 32 bit | ||||
| Windows 7 Enterprise x64 | ||||
| Windows 7 Home Premium 32 bit | ||||
| Windows 7 Home Premium x64 | ||||
| Windows 7 Professional 32 bit | ||||
| Windows 7 Professional x64 | ||||
| Windows 7 Ultimate 32 bit | ||||
| Windows 7 Ultimate x64 | ||||
| Windows Millennium Edition (Me) | ||||
| Windows Vista | ||||
| Windows Vista for x64 | ||||
| 64-bit Enabled AIX | ||||
| 64-bit Enabled HP-UX | ||||
| 64-bit Enabled Solaris | ||||
| ABI+ for Intel Architecture | ||||
| AIX | ||||
| HP-UX | ||||
| HP-UX IPF | ||||
| IRIX | ||||
| Linux | ||||
| Linux for x64 | ||||
| Linux on Itanium | ||||
| OpenVMS Alpha | ||||
| OpenVMS on HP Integrity | ||||
| Solaris | ||||
| Solaris for x64 | ||||
| Tru64 UNIX | ||||
| Type: | Usage Note |
| Priority: | |
| Topic: | Analytics ==> Regression SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> COUNTREG SAS Reference ==> Procedures ==> FMM SAS Reference ==> Procedures ==> NLMIXED |
| Date Modified: | 2011-10-05 16:27:27 |
| Date Created: | 2011-09-20 16:27:37 |



