For any model fit in PROC GENMOD (in SAS/STAT® software) or PROC COUNTREG (in SAS/ETS® software), including zero-inflated models, the PRED= option in the OUTPUT statement provides the estimated mean for each observation. However, the standard error and confidence limits are not available for the mean count or rate in zero-inflated models. Currently, neither the LSMEANS statement nor the ESTIMATE statement in PROC GENMOD provides mean count or rate estimates for zero-inflated models. To obtain estimated mean counts or rates along with confidence limits, you can use the NLEstimate macro or fit the model in PROC NLMIXED and use its ESTIMATE or PREDICT statement. An easier approach, available beginning in SAS 9.4 TS1M1, is to use the SCORE statement in PROC PLM. With the NLEstimate macro or PROC NLMIXED you can also make comparisons of mean counts or rates. The use of the macro and both procedures is illustrated below.
The following statements create a hypothetical insurance claims data set in which claims are classified by age group with two levels and car type with three levels.
data insure; input c car age; datalines; 10 1 1 37 2 1 100 3 1 0 1 2 0 2 2 80 3 2 ;
The GENMOD step below fits a zero-inflated Poisson (ZIP) model to the insurance claims data. Note that the same model can be fit with PROC COUNTREG using very similar syntax or with PROC FMM as a finite mixture model with two components. Zero-inflated negative binomial (ZINB) models can be fit in GENMOD or COUNTREG using the DIST=ZINB option, or in FMM.NOTE1
The LSMEANS statement requests LS-mean estimates for each AGE level. The E option displays the linear combination of model parameters which produces the LS-mean estimates. The ILINK option applies the inverse link function (exponentiation for log-linked models) to the estimates to provide estimates on the mean count scale. The ESTIMATE statement specifies the same linear combination of model parameters as the LSMEANS statement for AGE=1 and also estimates the zero-inflation parameter. The coefficients following @ZERO in the ESTIMATE statement estimate the zero-inflation parameter for AGE=1, CAR=2. The P= option in the OUTPUT statement saves the predicted mean count for each observation to a data set. The PZERO= option saves the estimated zero-inflation parameter for each observation. The STORE statement saves the fitted model for later use with PROC PLM. The COVB option and ODS OUTPUT statement create and save model information for later use by the NLEstimate macro.
proc genmod data=insure; class age; model c = age car / dist=zip covb; zeromodel car; lsmeans age / e ilink; estimate 'age=1, car=2 LSmean' intercept 1 age 1 0 car 2 @zero intercept 1 car 2; output out=gout pred=p pzero=p0; store out=zip1; ods output parameterestimates=mpe zeroparameterestimates=zpe covb=covb; run;
Notice that the table of LS-means coefficients does not include any coefficients for the parameters in the zero-inflation portion of the model. The LSMEANS statement provides estimates based only on the Poisson portion of the model, not for the zero-inflated model. A Note indicating this is displayed in the SAS log. The ESTIMATE statement provides separate estimates for the Poisson and zero-inflation portions of the model. The estimate for the Poisson portion of the model (34.0898) matches the LS-mean estimate for AGE=1, CAR=2.
|
These statements display the OUT= data set produced by the OUTPUT statement.
proc print data=gout noobs label; run;
The values in variable P (labeled "Predicted Value") are the mean counts estimated by the ZIP model. Note the difference between the estimated mean from the zero-inflated model (24.0563) provided by the OUTPUT statement and the estimated mean using only the Poisson part of the model (34.0898) provided by the ESTIMATE and LSMEANS statements. Also notice that the value for the zero-inflation parameter (labeled "Zero Inflation Probability") for AGE=1, CAR=2 is identical to the value provided by the ESTIMATE statement (0.2943).
|
Using PROC PLM
Beginning in SAS 9.4 TS1M1, PROC PLM can provide the mean counts along with standard errors and confidence limits. The SOURCE= option reads the saved ZIP model. The SCORE statement requests predicted means, standard errors, and confidence limits and stores them in data set PLMOUT. The ILINK option requests that these values be on the mean (count) scale rather than the linked (log count) scale.
proc plm source=zip1; score data=insure out=plmout pred stderr lclm uclm / ilink; run; proc print noobs label; run;
Note that the predicted counts match those from GENMOD.
|
Using the NLEstimate macro
As shown in "Zero-Inflated Models" in the Details section of the GENMOD documentation, the mean count for zero-inflated models is (1-ω)λ, where ω is the zero-inflation parameter and λ is the mean of the Poisson or negative binomial portion of the model. Note that a binary response model (by default, a logistic model) is used to estimate ω. The model for estimating λ is typically a log-linked model. Since the resulting mean estimator is a nonlinear function of the model parameters, it cannot be estimated by the ESTIMATE statement in PROC GENMOD, but it can be estimated by the NLEstimate macro or in PROC NLMIXED.
The NLEstimate macro uses the fitted model information (parameter estimates and covariance matrix) saved by the ODS OUTPUT statement in PROC GENMOD. It then uses PROC NLMIXED to estimate the function and the delta method to obtain confidence limits. When estimating several functions, it is useful to specify them in a data set with appropriate labels. This is done in the following statements and the resulting data set is specified in the fdata= NLEstimate macro parameter. You write the functions to be estimated using the parameter names. See the description of the NLEstimate macro for details about displaying parameter names and using the macro.
Since PROC GENMOD produces separate tables for the two parts of the zero-inflated model, the two tables are saved to separate data sets by the ODS OUTPUT statement above. The following statements combine them into a single data set for use in the NLEstimate macro. The IF statement is needed to remove rows for parameters with zero degrees of freedom since such parameters are not included in the covariance matrix. Removing the rows is needed to make the resulting vector of model parameter estimates compatible with the covariance matrix.
data pe; set mpe zpe; if df ne 0; run;
These statements produce a data set of the nonlinear functions to estimate using the formula for the mean shown above. The NLEstimate macro is then called to perform the estimation for each population in the data.
data fd; length label f $32767; infile datalines delimiter=','; input label f; datalines; car=1 age=1,(1-logistic(b_p4+b_p5))*exp(b_p1+b_p2+b_p3) car=2 age=1,(1-logistic(b_p4+2*b_p5))*exp(b_p1+b_p2+2*b_p3) car=3 age=1,(1-logistic(b_p4+3*b_p5))*exp(b_p1+b_p2+3*b_p3) car=1 age=2,(1-logistic(b_p4+b_p5))*exp(b_p1+b_p3) car=2 age=2,(1-logistic(b_p4+2*b_p5))*exp(b_p1+2*b_p3) car=3 age=2,(1-logistic(b_p4+3*b_p5))*exp(b_p1+3*b_p3) ; %NLEstimate(inest=pe, incovb=covb, fdata=fd, df=6)
Note that the predicted mean counts (Predicted Value) match those from PROC GENMOD. The data set also includes standard errors and confidence limits for the predicted counts. These differ from the PLM results because of the use of different methods. Note that the confidence intervals provided by the NLEstimate macro are symmetric around the predicted means while the intervals from PROC PLM are not. See "Scoring Data Sets for Zero-Inflated Models" in the Details section of the PLM documentation for formulas showing the method used by PROC PLM. The NLEstimate macro uses the delta method to estimate the standard error and forms a confidence interval using a quantile from the t distribution.
|
Using PROC NLMIXED
Mean estimates, standard errors and confidence limits can also be obtained by fitting the zero-inflated model directly in PROC NLMIXED. The following statements fit the same ZIP model as in PROC GENMOD above. The zero-inflation part of the model, which is a logistic model, is defined in the first two assignment statements. INFPROB is the zero-inflation parameter, ω. The Poisson portion of the model is defined in the next assignment statement, and LAMBDA is the Poisson mean parameter, λ. The parameters to be estimated are inf0 and inf1 in the zero inflation model, and b0, b1, and b2 in the Poisson model. The next lines define the ZIP log likelihood. The PREDICT statement requests a data set containing a mean count estimate and confidence interval for each observation. Zero-inflated negative binomial models require different IF-ELSE statements to define the ZINB log likelihood.NOTE2
proc nlmixed data=insure; /* define zero inflation model */ zeromodel = inf0 + inf1*car; infprob = logistic(zeromodel); /* define Poisson model */ lambda = exp(b0 + b1*(age=1) + b2*car); /* define ZIP log likelihood */ if c=0 then ll = log(infprob + (1-infprob)*exp(-lambda)); else ll = log((1-infprob)) + c*log(lambda) - lgamma(c+1) - lambda; model c ~ general(ll); /* request predicted counts */ predict (1-infprob)*lambda out=nout; run;
The results in the OUT= data set are the same as those from the NLEstimate macro.
Comparing mean counts among populations
Comparisons among the population mean counts can be done using the ESTIMATE statement. For example, suppose you want a test and confidence interval for the difference in mean counts between the AGE=1, CAR=2 population and the AGE=2, CAR=2 population. The difference in these means is (1-ω12)λ12 - (1-ω22)λ22.
This estimate can be produced in the NLEstimate macro by adding this function to the fdata= data set.
Mean Count Difference,(1-logistic(b_p4+2*b_p5))*exp(b_p1+b_p2+2*b_p3)-(1-logistic(b_p4+2*b_p5))*exp(b_p1+2*b_p3)
Equivalently, in the PROC NLMIXED step above, add these statements before the RUN statement to produce the estimate, test, and confidence interval. Note that the entire expression could be written in the ESTIMATE statement if desired, eliminating the preceding assignment statements, but that is not done here to clarify the process.
/* AGE=1, CAR=2 */ zeromodel12 = inf0 + inf1*2; infprob12 = logistic(zeromodel12); lambda12 = exp(b0 + b1 + b2*2); /* AGE=2, CAR=2 */ zeromodel22 = inf0 + inf1*2; infprob22 = logistic(zeromodel22); lambda22 = exp(b0 + b2*2); /* estimate and test difference in mean counts */ estimate "Mean Count Difference" (1-infprob12)*lambda12 - (1-infprob22)*lambda22;
Notice that the estimated difference (5.0876) is the difference of the two estimated means above (24.0563 - 18.9687). The means do not differ significantly (p=0.1778).
|
By including an offset in a Poisson or negative binomial model, you can model a rate rather than a count. This is explained in this note for models that are not zero-inflated. The same applies to zero-inflated models. The offset is the log of the rate denominator.
These statements create another hypothetical data set of insurance claims. In each observation, the number of claims (C) and the number of policyholders (N) is recorded for each car type and age group. Since the number of policyholders varies across populations, the rates (number of claims per policyholder, or per 100 policyholders) will be modeled rather than the number of claims. The log of the number of policyholders (LN) is computed for use as an offset in the ZIP model.
data insure; input n c car$ age; ln = log(n); datalines; 500 0 small 1 1200 37 medium 1 100 0 large 1 400 101 small 2 500 73 medium 2 300 0 large 2 ;
PROC GENMOD fits a ZIP model to the claims data and saves a data set of predicted values. In this example, the mean model is a function only of age and the zero-inflation model contains only an intercept. Since the offset is included in the computation of predicted values by the PRED= option in the OUTPUT statement, the predicted values are mean counts rather than rates. Again, this model can also be fit using the COUNTREG or FMM procedure.NOTE3 The STORE statement saves the fitted model. The COVB option and ODS OUTPUT statement save the model information for use in the NLEstimate macro later.
proc genmod data=insure; model c = age / dist=zip offset=ln covb; zeromodel ; output out=gout pred=p; store out=zip2; ods output parameterestimates=mpe zeroparameterestimates=zpe covb=covb; run;
These are the parameter estimates of the fitted zero-inflated Poisson model.
|
The following DATA step computes the estimated rates by dividing the estimated counts by the number of policyholders.
data gout; set gout; prate=p/n; run; proc print noobs; run;
|
Using PROC PLM
These statements use PROC PLM to produce the predicted rates as well as standard errors and confidence limits for the rates. The addition of the NOOFFSET option excludes the offset from the computations so that the predicted rates are produced rather than predicted counts.
proc plm source=zip2; score data=insure out=plm2 pred=p stderr=s lclm=l uclm=u / ilink nooffset; run; proc print noobs label; run;
The predicted rates from PLM match those above and include standard errors and confidence limits.
|
Using the NLEstimate macro
As described in earlier, the separate tables of parameter estimates must be combined and rows for parameters with zero degrees of freedom must be removed so the parameter vector and covariance matrix are compatible.
data pe; set mpe zpe; if df ne 0; run;
Predicted rates for the two ages are then obtained from the NLEstimate macro.
data fd; length label f $32767; infile datalines delimiter=','; input label f; datalines; age=1,(1-logistic(b_p3))*exp(b_p1+b_p2) age=2,(1-logistic(b_p3))*exp(b_p1+2*b_p2) ; %NLEstimate(inest=pe, incovb=covb, fdata=fd, df=6)
The estimated rate for AGE=1 is 0.01559, or about 1.6 claims per 100 policyholders. For AGE=2, the rate is 0.09814, or about 9.8 claims per 100 policyholders. Differences with the PLM results are for reasons described above.
|
Using PROC NLMIXED
By using PROC NLMIXED to fit the zero-inflated model you can use its PREDICT statement to estimate the rates and provide these additional statistics. The following statements fit the same ZIP model as above. Notice the inclusion of LN without a coefficient in the Poisson part of the model since an offset has a coefficient fixed at 1. In the PREDICT statement, the ZIP mean is divided by the rate denominator as was done in the DATA step above. The predicted rates and confidence limits are saved in data set NLMOUT.
proc nlmixed data=insure; /* define zero inflation model */ zeromodel = a0; infprob = logistic(zeromodel); /* define Poisson model */ lambda = exp(b0 + ln + b1*age); /* define ZIP log likelihood */ if c=0 then ll = log(infprob + (1-infprob)*exp(-lambda)); else ll = log((1-infprob)) + c*log(lambda) - lgamma(c+1) - lambda; model c ~ general(ll); /* request predicted rates */ predict (1-infprob)*lambda/n out=nlmout; run;
The results in the OUT= data set match those from the NLEstimate macro.
Comparing rates among populations
A similar approach as used for comparing mean counts can be used for comparing rates. The rate difference is (1-ω)λ1/n1 - (1-ω)λ2/n2 = (1-ω)exp(b0+b1+log(n1))/n1 - (1-ω)exp(b0+2b1+log(n2))/n2 = (1-ω)exp(b0+b1) - (1-ω)exp(b0+2b1).
The estimate of this difference is produced by the NLEstimate macro by adding this function to the fdata= data set.
Mean Rate Difference,(1-logistic(b_p3))*exp(b_p1+b_p2)-(1-logistic(b_p3))*exp(b_p1+2*b_p2)
Comparison of the AGE level rates can also be done in PROC NLMIXED by adding the following statement in the NLMIXED step above.
estimate 'diff' (1-infprob)*exp(b0 + b1*1) - (1-infprob)*exp(b0 + b1*2);
The difference in rates is as expected from the rates above (0.015592 - 0.098136 = -0.082544) and this difference is not quite significant at the 0.05 level (p=0.0546). The results from the NLEstimate macro and PROC NLMIXED are the same.
|
_______
NOTE1: The following statements fit the ZIP model for counts in the COUNTREG and FMM procedures. The OUTPUT statements also save the predicted values and the inflation probabilities from the ZIP model.
proc countreg data=insure; class age; model c = age car / dist=zip; zeromodel c ~ car; output out=cout pred=p probzero=p0; run; proc fmm data=insure; class age; model c = / dist=constant; model + age car / dist=poisson; probmodel car; output out=fout pred=p mixprob=p0; run;
NOTE2: These statements can be used to define the zero-inflated negative binomial (ZINB) log likelihood in PROC NLMIXED:
/* define ZINB log likelihood */ if c=0 then ll = log( infprob + (1-infprob)/(1+k*lambda)**(1/k) ); else ll = log((1-infprob)) + c*log(k*lambda) - (c+(1/k))*log(1+k*lambda) + lgamma(c+1/k) - lgamma(1/k) - lgamma(c+1);
NOTE3: These statements fit the ZIP model for rates in the COUNTREG and FMM procedures.
proc countreg data=insure; model c = age / dist=zip offset=ln; zeromodel c ~; output out=cout pred=p; run; proc fmm data=insure; model c= / dist=constant; model + age / dist=poisson offset=ln; output out=fout pred=p; run;
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | z/OS | ||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows Server 2003 for x64 | ||||
Microsoft Windows Server 2008 | ||||
Microsoft Windows Server 2008 for x64 | ||||
Microsoft Windows XP Professional | ||||
Windows 7 Enterprise 32 bit | ||||
Windows 7 Enterprise x64 | ||||
Windows 7 Home Premium 32 bit | ||||
Windows 7 Home Premium x64 | ||||
Windows 7 Professional 32 bit | ||||
Windows 7 Professional x64 | ||||
Windows 7 Ultimate 32 bit | ||||
Windows 7 Ultimate x64 | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
Windows Vista for x64 | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | Analytics ==> Regression SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> COUNTREG SAS Reference ==> Procedures ==> FMM SAS Reference ==> Procedures ==> NLMIXED SAS Reference ==> Procedures ==> PLM SAS Reference ==> Macro |
Date Modified: | 2011-10-05 16:27:27 |
Date Created: | 2011-09-20 16:27:37 |