SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 44354: Estimating and comparing counts and rates (with confidence intervals) in zero-inflated models

DetailsAboutRate It

For any model fit in PROC GENMOD (in SAS/STAT® software) or PROC COUNTREG (in SAS/ETS® software), including zero-inflated models, the PRED= option in the OUTPUT statement provides the estimated mean for each observation. However, the standard error and confidence limits are not available for the mean count or rate in zero-inflated models. Currently, neither the LSMEANS statement nor the ESTIMATE statement in PROC GENMOD provides mean count or rate estimates for zero-inflated models. To obtain estimated mean counts or rates along with confidence limits, you can use the NLEstimate macro or fit the model in PROC NLMIXED and use its ESTIMATE or PREDICT statement. An easier approach, available beginning in SAS 9.4 TS1M1, is to use the SCORE statement in PROC PLM. With the NLEstimate macro or PROC NLMIXED you can also make comparisons of mean counts or rates. The use of the macro and both procedures is illustrated below.

 

Estimating and comparing mean counts (with confidence intervals) in zero-inflated models

The following statements create a hypothetical insurance claims data set in which claims are classified by age group with two levels and car type with three levels.

      data insure;
        input c car age;
        datalines;
      10  1  1
      37  2  1
      100 3  1
      0   1  2
      0   2  2
      80  3  2
      ;

The GENMOD step below fits a zero-inflated Poisson (ZIP) model to the insurance claims data. Note that the same model can be fit with PROC COUNTREG using very similar syntax or with PROC FMM as a finite mixture model with two components. Zero-inflated negative binomial (ZINB) models can be fit in GENMOD or COUNTREG using the DIST=ZINB option, or in FMM.NOTE1

The LSMEANS statement requests LS-mean estimates for each AGE level. The E option displays the linear combination of model parameters which produces the LS-mean estimates. The ILINK option applies the inverse link function (exponentiation for log-linked models) to the estimates to provide estimates on the mean count scale. The ESTIMATE statement specifies the same linear combination of model parameters as the LSMEANS statement for AGE=1 and also estimates the zero-inflation parameter. The coefficients following @ZERO in the ESTIMATE statement estimate the zero-inflation parameter for AGE=1, CAR=2. The P= option in the OUTPUT statement saves the predicted mean count for each observation to a data set. The PZERO= option saves the estimated zero-inflation parameter for each observation. The STORE statement saves the fitted model for later use with PROC PLM. The COVB option and ODS OUTPUT statement create and save model information for later use by the NLEstimate macro.

      proc genmod data=insure;
        class age;
        model c = age car / dist=zip covb;
        zeromodel car;
        lsmeans age / e ilink;
        estimate 'age=1, car=2 LSmean' intercept 1 age 1 0 car 2 
                                  @zero intercept 1 car 2;
        output out=gout pred=p pzero=p0;
        store out=zip1;
        ods output parameterestimates=mpe zeroparameterestimates=zpe covb=covb;
        run;

Notice that the table of LS-means coefficients does not include any coefficients for the parameters in the zero-inflation portion of the model. The LSMEANS statement provides estimates based only on the Poisson portion of the model, not for the zero-inflated model. A Note indicating this is displayed in the SAS log. The ESTIMATE statement provides separate estimates for the Poisson and zero-inflation portions of the model. The estimate for the Poisson portion of the model (34.0898) matches the LS-mean estimate for AGE=1, CAR=2.

Analysis Of Maximum Likelihood Parameter Estimates
Parameter   DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept   1 1.1101 0.4102 0.3062 1.9141 7.32 0.0068
age 1 1 0.2376 0.1480 -0.0525 0.5277 2.58 0.1084
age 2 0 0.0000 0.0000 0.0000 0.0000 . .
car   1 1.0906 0.1316 0.8328 1.3485 68.73 <.0001
Scale   0 1.0000 0.0000 1.0000 1.0000    
 
Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates
Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept 1 1.7087 2.4584 -3.1096 6.5271 0.48 0.4870
car 1 -1.2916 1.3096 -3.8585 1.2752 0.97 0.3240
 
Contrast Estimate Results
Label Mean Estimate Mean L'Beta Estimate Standard Error Alpha L'Beta Chi-Square Pr > ChiSq
Confidence Limits Confidence Limits
age=1, car=2 LSmean 34.0898 27.1954 42.7320 3.5290 0.1153 0.05 3.3030 3.7549 937.09 <.0001
age=1, car=2 LSmean (Zero Inflation) 0.2943 0.0528 0.7574 -0.8745 1.0270 0.05 -2.8873 1.1384 0.73 0.3945
 
Coefficients for age Least
Squares Means
Parameter age Row1 Row2
Intercept   1 1
age 1 1 1  
age 2 2   1
car   2 2
 
age Least Squares Means
age Estimate Standard Error z Value Pr > |z| Mean Standard Error
of Mean
1 3.5290 0.1153 30.61 <.0001 34.0898 3.9299
2 3.2914 0.1727 19.06 <.0001 26.8803 4.6412

These statements display the OUT= data set produced by the OUTPUT statement.

      proc print data=gout noobs label; 
        run;

The values in variable P (labeled "Predicted Value") are the mean counts estimated by the ZIP model. Note the difference between the estimated mean from the zero-inflated model (24.0563) provided by the OUTPUT statement and the estimated mean using only the Poisson part of the model (34.0898) provided by the ESTIMATE and LSMEANS statements. Also notice that the value for the zero-inflation parameter (labeled "Zero Inflation Probability") for AGE=1, CAR=2 is identical to the value provided by the ESTIMATE statement (0.2943).

c car age Predicted Value Zero Inflation
Probability
10 1 1 4.5497 0.60280
37 2 1 24.0563 0.29432
100 3 1 91.0223 0.10284
0 1 2 3.5875 0.60280
0 2 2 18.9687 0.29432
80 3 2 71.7723 0.10284

Using PROC PLM

Beginning in SAS 9.4 TS1M1, PROC PLM can provide the mean counts along with standard errors and confidence limits. The SOURCE= option reads the saved ZIP model. The SCORE statement requests predicted means, standard errors, and confidence limits and stores them in data set PLMOUT. The ILINK option requests that these values be on the mean (count) scale rather than the linked (log count) scale.

      proc plm source=zip1; 
        score data=insure out=plmout pred stderr lclm uclm / ilink;
        run;
      proc print noobs label;
        run;

Note that the predicted counts match those from GENMOD.

c car age Predicted Value Standard Error Lower 95% Confidence
Limit
Upper 95% Confidence
Limit
10 1 1 4.5497 3.9471 0.8308 24.914
37 2 1 24.0563 7.8271 12.7139 45.518
100 3 1 91.0223 20.1406 58.9932 140.441
0 1 2 3.5875 3.2138 0.6198 20.765
0 2 2 18.9687 6.6765 9.5156 37.813
80 3 2 71.7723 16.3951 45.8686 112.305

Using the NLEstimate macro

As shown in "Zero-Inflated Models" in the Details section of the GENMOD documentation, the mean count for zero-inflated models is (1-ω)λ, where ω is the zero-inflation parameter and λ is the mean of the Poisson or negative binomial portion of the model. Note that a binary response model (by default, a logistic model) is used to estimate ω. The model for estimating λ is typically a log-linked model. Since the resulting mean estimator is a nonlinear function of the model parameters, it cannot be estimated by the ESTIMATE statement in PROC GENMOD, but it can be estimated by the NLEstimate macro or in PROC NLMIXED.

The NLEstimate macro uses the fitted model information (parameter estimates and covariance matrix) saved by the ODS OUTPUT statement in PROC GENMOD. It then uses PROC NLMIXED to estimate the function and the delta method to obtain confidence limits. When estimating several functions, it is useful to specify them in a data set with appropriate labels. This is done in the following statements and the resulting data set is specified in the fdata= NLEstimate macro parameter. You write the functions to be estimated using the parameter names. See the description of the NLEstimate macro for details about displaying parameter names and using the macro.

Since PROC GENMOD produces separate tables for the two parts of the zero-inflated model, the two tables are saved to separate data sets by the ODS OUTPUT statement above. The following statements combine them into a single data set for use in the NLEstimate macro. The IF statement is needed to remove rows for parameters with zero degrees of freedom since such parameters are not included in the covariance matrix. Removing the rows is needed to make the resulting vector of model parameter estimates compatible with the covariance matrix.

      data pe;
        set mpe zpe;
        if df ne 0;
        run;

These statements produce a data set of the nonlinear functions to estimate using the formula for the mean shown above. The NLEstimate macro is then called to perform the estimation for each population in the data.

      data fd; 
        length label f $32767; 
        infile datalines delimiter=',';
        input label f; 
        datalines;
      car=1 age=1,(1-logistic(b_p4+b_p5))*exp(b_p1+b_p2+b_p3)
      car=2 age=1,(1-logistic(b_p4+2*b_p5))*exp(b_p1+b_p2+2*b_p3)
      car=3 age=1,(1-logistic(b_p4+3*b_p5))*exp(b_p1+b_p2+3*b_p3)
      car=1 age=2,(1-logistic(b_p4+b_p5))*exp(b_p1+b_p3)
      car=2 age=2,(1-logistic(b_p4+2*b_p5))*exp(b_p1+2*b_p3)
      car=3 age=2,(1-logistic(b_p4+3*b_p5))*exp(b_p1+3*b_p3)
      ;
      %NLEstimate(inest=pe, incovb=covb, fdata=fd, df=6)

Note that the predicted mean counts (Predicted Value) match those from PROC GENMOD. The data set also includes standard errors and confidence limits for the predicted counts. These differ from the PLM results because of the use of different methods. Note that the confidence intervals provided by the NLEstimate macro are symmetric around the predicted means while the intervals from PROC PLM are not. See "Scoring Data Sets for Zero-Inflated Models" in the Details section of the PLM documentation for formulas showing the method used by PROC PLM. The NLEstimate macro uses the delta method to estimate the standard error and forms a confidence interval using a quantile from the t distribution.

c car age Predicted Value Standard Error
of Prediction
Degrees
of Freedom
t Value Pr > |t| Alpha Lower Confidence
Limit
Upper Confidence
Limit
10 1 1 4.5497 3.8556 6 1.18004 0.28264 0.05 -4.8845 13.984
37 2 1 24.0564 7.7820 6 3.09127 0.02135 0.05 5.0144 43.098
100 3 1 91.0223 20.0644 6 4.53651 0.00395 0.05 41.9265 140.118
0 1 2 3.5875 3.1028 6 1.15621 0.29155 0.05 -4.0048 11.180
0 2 2 18.9688 6.6027 6 2.87287 0.02832 0.05 2.8125 35.125
80 3 2 71.7724 16.3178 6 4.39841 0.00458 0.05 31.8442 111.701

Using PROC NLMIXED

Mean estimates, standard errors and confidence limits can also be obtained by fitting the zero-inflated model directly in PROC NLMIXED. The following statements fit the same ZIP model as in PROC GENMOD above. The zero-inflation part of the model, which is a logistic model, is defined in the first two assignment statements. INFPROB is the zero-inflation parameter, ω. The Poisson portion of the model is defined in the next assignment statement, and LAMBDA is the Poisson mean parameter, λ. The parameters to be estimated are inf0 and inf1 in the zero inflation model, and b0, b1, and b2 in the Poisson model. The next lines define the ZIP log likelihood. The PREDICT statement requests a data set containing a mean count estimate and confidence interval for each observation. Zero-inflated negative binomial models require different IF-ELSE statements to define the ZINB log likelihood.NOTE2

      proc nlmixed data=insure;
        /* define zero inflation model */
        zeromodel = inf0 + inf1*car;
        infprob  = logistic(zeromodel);
        
        /* define Poisson model */
        lambda   = exp(b0 + b1*(age=1) + b2*car);
        
        /* define ZIP log likelihood */
        if c=0 then ll = log(infprob + (1-infprob)*exp(-lambda));
        else ll = log((1-infprob)) + c*log(lambda) - lgamma(c+1) - lambda;
        model c ~ general(ll);
        
        /* request predicted counts */
        predict (1-infprob)*lambda out=nout;
        run;

The results in the OUT= data set are the same as those from the NLEstimate macro.

Comparing mean counts among populations

Comparisons among the population mean counts can be done using the ESTIMATE statement. For example, suppose you want a test and confidence interval for the difference in mean counts between the AGE=1, CAR=2 population and the AGE=2, CAR=2 population. The difference in these means is (1-ω1212 - (1-ω2222.

This estimate can be produced in the NLEstimate macro by adding this function to the fdata= data set.

      Mean Count Difference,(1-logistic(b_p4+2*b_p5))*exp(b_p1+b_p2+2*b_p3)-(1-logistic(b_p4+2*b_p5))*exp(b_p1+2*b_p3)

Equivalently, in the PROC NLMIXED step above, add these statements before the RUN statement to produce the estimate, test, and confidence interval. Note that the entire expression could be written in the ESTIMATE statement if desired, eliminating the preceding assignment statements, but that is not done here to clarify the process.

        /* AGE=1, CAR=2 */
        zeromodel12 = inf0 + inf1*2;
        infprob12   = logistic(zeromodel12);
        lambda12    = exp(b0 + b1 + b2*2);

        /* AGE=2, CAR=2 */
        zeromodel22 = inf0 + inf1*2;
        infprob22   = logistic(zeromodel22);
        lambda22    = exp(b0 + b2*2);

        /* estimate and test difference in mean counts */
        estimate "Mean Count Difference" (1-infprob12)*lambda12 - (1-infprob22)*lambda22;

Notice that the estimated difference (5.0876) is the difference of the two estimated means above (24.0563 - 18.9687). The means do not differ significantly (p=0.1778).

Additional Estimates
Label Estimate Standard Error DF t Value Pr > |t| Alpha Lower Upper
Mean Count Difference 5.0876 3.3334 6 1.53 0.1778 0.05 -3.0690 13.2441

 

Estimating and comparing rates (with confidence intervals) in zero-inflated models

By including an offset in a Poisson or negative binomial model, you can model a rate rather than a count. This is explained in this note for models that are not zero-inflated. The same applies to zero-inflated models. The offset is the log of the rate denominator.

These statements create another hypothetical data set of insurance claims. In each observation, the number of claims (C) and the number of policyholders (N) is recorded for each car type and age group. Since the number of policyholders varies across populations, the rates (number of claims per policyholder, or per 100 policyholders) will be modeled rather than the number of claims. The log of the number of policyholders (LN) is computed for use as an offset in the ZIP model.

      data insure;
        input n c car$ age;
        ln = log(n);
        datalines;
      500    0  small  1
      1200  37  medium 1
      100    0  large  1
      400  101  small  2
      500   73  medium 2
      300    0  large  2
      ;

PROC GENMOD fits a ZIP model to the claims data and saves a data set of predicted values. In this example, the mean model is a function only of age and the zero-inflation model contains only an intercept. Since the offset is included in the computation of predicted values by the PRED= option in the OUTPUT statement, the predicted values are mean counts rather than rates. Again, this model can also be fit using the COUNTREG or FMM procedure.NOTE3 The STORE statement saves the fitted model. The COVB option and ODS OUTPUT statement save the model information for use in the NLEstimate macro later.

      proc genmod data=insure;
        model c = age / dist=zip offset=ln covb;
        zeromodel ;
        output out=gout pred=p;
        store out=zip2;
        ods output parameterestimates=mpe zeroparameterestimates=zpe covb=covb;
        run;

These are the parameter estimates of the fitted zero-inflated Poisson model.

Analysis Of Maximum Likelihood Parameter Estimates
Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept 1 -5.3226 0.3393 -5.9875 -4.6576 246.12 <.0001
age 1 1.8396 0.1819 1.4831 2.1961 102.28 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000    
 
Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates
Parameter DF Estimate Standard Error Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept 1 -0.0304 0.8288 -1.6549 1.5941 0.00 0.9707

The following DATA step computes the estimated rates by dividing the estimated counts by the number of policyholders.

      data gout; 
        set gout; 
        prate=p/n; 
        run;
      proc print noobs;
        run;
n c car age ln p prate
500 0 small 1 6.21461 7.7959 0.015592
1200 37 medium 1 7.09008 18.7101 0.015592
100 0 large 1 4.60517 1.5592 0.015592
400 101 small 2 5.99146 39.2543 0.098136
500 73 medium 2 6.21461 49.0679 0.098136
300 0 large 2 5.70378 29.4408 0.098136

Using PROC PLM

These statements use PROC PLM to produce the predicted rates as well as standard errors and confidence limits for the rates. The addition of the NOOFFSET option excludes the offset from the computations so that the predicted rates are produced rather than predicted counts.

      proc plm source=zip2;
        score data=insure out=plm2 pred=p stderr=s lclm=l uclm=u / ilink nooffset;
        run;
      proc print noobs label;
        run;

The predicted rates from PLM match those above and include standard errors and confidence limits.

n c car age ln Predicted Value Standard Error Lower 95% Confidence
Limit
Upper 95% Confidence
Limit
500 0 small 1 6.21461 0.015592 0.006903 0.006547 0.03713
1200 37 medium 1 7.09008 0.015592 0.006903 0.006547 0.03713
100 0 large 1 4.60517 0.015592 0.006903 0.006547 0.03713
400 101 small 2 5.99146 0.098136 0.040850 0.043402 0.22189
500 73 medium 2 6.21461 0.098136 0.040850 0.043402 0.22189
300 0 large 2 5.70378 0.098136 0.040850 0.043402 0.22189

Using the NLEstimate macro

As described in earlier, the separate tables of parameter estimates must be combined and rows for parameters with zero degrees of freedom must be removed so the parameter vector and covariance matrix are compatible.

      data pe;
        set mpe zpe;
        if df ne 0;
        run;

Predicted rates for the two ages are then obtained from the NLEstimate macro.

      data fd; 
        length label f $32767; 
        infile datalines delimiter=',';
        input label f; 
        datalines;
      age=1,(1-logistic(b_p3))*exp(b_p1+b_p2)
      age=2,(1-logistic(b_p3))*exp(b_p1+2*b_p2)
      ;
      %NLEstimate(inest=pe, incovb=covb, fdata=fd, df=6)

The estimated rate for AGE=1 is 0.01559, or about 1.6 claims per 100 policyholders. For AGE=2, the rate is 0.09814, or about 9.8 claims per 100 policyholders. Differences with the PLM results are for reasons described above.

Additional Estimates
Label Estimate Standard
Error
DF t Value Pr > |t| Alpha Lower Upper
age=1 0.01559 0.006822 6 2.29 0.0623 0.05 -0.00110 0.03228
age=2 0.09814 0.04074 6 2.41 0.0526 0.05 -0.00154 0.1978

Using PROC NLMIXED

By using PROC NLMIXED to fit the zero-inflated model you can use its PREDICT statement to estimate the rates and provide these additional statistics. The following statements fit the same ZIP model as above. Notice the inclusion of LN without a coefficient in the Poisson part of the model since an offset has a coefficient fixed at 1. In the PREDICT statement, the ZIP mean is divided by the rate denominator as was done in the DATA step above. The predicted rates and confidence limits are saved in data set NLMOUT.

      proc nlmixed data=insure;
        /* define zero inflation model */
        zeromodel = a0;
        infprob  = logistic(zeromodel);

        /* define Poisson model */
        lambda   = exp(b0 + ln + b1*age);

        /* define ZIP log likelihood */
        if c=0 then ll = log(infprob + (1-infprob)*exp(-lambda));
        else ll = log((1-infprob)) + c*log(lambda) - lgamma(c+1) - lambda;
        model c ~ general(ll);

        /* request predicted rates */
        predict (1-infprob)*lambda/n out=nlmout;
        run;

The results in the OUT= data set match those from the NLEstimate macro.

Comparing rates among populations

A similar approach as used for comparing mean counts can be used for comparing rates. The rate difference is (1-ω)λ1/n1 - (1-ω)λ2/n2 = (1-ω)exp(b0+b1+log(n1))/n1 - (1-ω)exp(b0+2b1+log(n2))/n2 = (1-ω)exp(b0+b1) - (1-ω)exp(b0+2b1).

The estimate of this difference is produced by the NLEstimate macro by adding this function to the fdata= data set.

      Mean Rate Difference,(1-logistic(b_p3))*exp(b_p1+b_p2)-(1-logistic(b_p3))*exp(b_p1+2*b_p2)

Comparison of the AGE level rates can also be done in PROC NLMIXED by adding the following statement in the NLMIXED step above.

        estimate 'diff' (1-infprob)*exp(b0 + b1*1) - (1-infprob)*exp(b0 + b1*2);

The difference in rates is as expected from the rates above (0.015592 - 0.098136 = -0.082544) and this difference is not quite significant at the 0.05 level (p=0.0546). The results from the NLEstimate macro and PROC NLMIXED are the same.

Additional Estimates
Label Estimate Standard
Error
DF t Value Pr > |t| Alpha Lower Upper
Mean Rate Difference -0.08254 0.03464 6 -2.38 0.0546 0.05 -0.1673 0.002222

_______

NOTE1: The following statements fit the ZIP model for counts in the COUNTREG and FMM procedures. The OUTPUT statements also save the predicted values and the inflation probabilities from the ZIP model.

      proc countreg data=insure;
         class age;
         model c = age car / dist=zip;
         zeromodel c ~ car;
         output out=cout pred=p probzero=p0;
         run;
         
      proc fmm data=insure;
         class age;
         model c = / dist=constant;
         model + age car / dist=poisson;
         probmodel car;
         output out=fout pred=p mixprob=p0;
         run;

NOTE2: These statements can be used to define the zero-inflated negative binomial (ZINB) log likelihood in PROC NLMIXED:

        /* define ZINB log likelihood */
        if c=0 then ll = log( infprob + (1-infprob)/(1+k*lambda)**(1/k) );
        else ll = log((1-infprob)) + c*log(k*lambda) - (c+(1/k))*log(1+k*lambda) +
        lgamma(c+1/k) - lgamma(1/k) - lgamma(c+1);

NOTE3: These statements fit the ZIP model for rates in the COUNTREG and FMM procedures.

      proc countreg data=insure;
         model c = age / dist=zip offset=ln;
         zeromodel c ~;
         output out=cout pred=p;
         run;

      proc fmm data=insure;
         model c= / dist=constant;
         model + age / dist=poisson offset=ln;
         output out=fout pred=p;
         run;


Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 for x64
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.