SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 43522: Fitting truncated Poisson and negative binomial models

DetailsAboutRate It

Count data in which zero counts cannot be observed is called truncated count data. Such data can be modeled using truncated versions of the Poisson or negative binomial distributions.

Models using the truncated Poisson or truncated negative binomial distribution can be fit in PROC NLMIXED by specifying the log likelihood function. Beginning in SAS 9.3 TS1M0, models using the truncated Poisson distribution can more easily be fit in PROC FMM by specifying the DIST=TRUNCPOISSON option. Beginning in SAS 9.3 TS1M2, PROC FMM can also fit the truncated negative binomial model with the DIST=TRUNCNEGBIN option. Note that these truncated distributions are used to fit hurdle models. Fitting models with both truncated distributions using PROC FMM and PROC NLMIXED is illustrated below.

Truncated Poisson Model

The following statements create the skunk capture data presented in Table 7.1 of Zelterman. Y is the number of times each individual skunk was captured over several trappings. Individual skunks never trapped could not be recorded, so only counts of one or more appear in the data. One goal is to estimate the number of skunks in the area for each sex in each of two years.

      data skunk;
         input y freq year sex $ @@;
         female=1; if sex='M' then female=0;
         y77=1; if year=78 then y77=0;
         fem77=female*y77;
         datalines;
         1  1 77  F    2  2  77  F    3  4  77  F   4  2  77  F   5  1  77  F
         1  3 77  M    2  0  77  M    3  3  77  M   4  3  77  M   5  2  77  M  6  2 77  M
         1  7 78  F    2  7  78  F    3  3  78  F   4  1  78  F   5  2  78  F
         1  4 78  M    2  3  78  M    3  1  78  M
         ;

The statements below fit the following truncated Poisson model,

E(y) = λ = exp[β0 + βf*I(Sex=female) + β77*I(Year=1977) + βf77*I(Sex=female and year=1977)]

where I(·) is an indicator variable equaling 1 when the condition is true and 0 otherwise. The variables FEMALE, Y77, and FEM77 in data set SKUNK are these indicator variables. β0, βf, β77, and βf77 are the model parameters to be estimated.

The model is easily fit in PROC FMM using the DIST=TRUNCPOISSON option in the MODEL statement. The COV option and the ODS OUTPUT statement save the model information for use in the NLEstimate macro later.

      proc fmm data=skunk cov;
         freq freq;
         model y = female y77 fem77 / dist=truncpoisson;
         ods output parameterestimates=pe cov=covb;
         run;

Below are the estimated parameters of the model.

Parameter Estimates for Truncated Poisson Model
Effect Estimate Standard
Error
z Value Pr > |z|
Intercept 0.06255 0.4183 0.15 0.8811
female 0.5560 0.4579 1.21 0.2246
y77 1.1680 0.4467 2.61 0.0089
fem77 -0.7493 0.5242 -1.43 0.1529

In PROC NLMIXED, zero initial values for the model parameters are specified in the PARMS statement. The above model for the expected count, λ, is specified next. The log likelihood, LL, for the truncated Poisson is specified in the following line. Finally, the MODEL statement species that the observed counts, Y, are distributed according to the log likelihood previously defined.

      proc nlmixed data=skunk;
         parms b0=0 bf=0 b77=0 bf77=0;
         lambda=exp(b0+bf*female+b77*y77+bf77*fem77);
         LL=freq*(-lambda+y*log(lambda)-lgamma(y+1)-log(1-exp(-lambda)));
         model y ~ general(LL);
         run;

The fitted model is the same as from PROC FMM above.

To determine the size of the population for each sex in each year, you need to estimate the probability, pi, of capturing a skunk at least once in each of the four populations, i=1, 2, 3, or 4. The probability of at least one capture is 1- p 0 i , the probability of zero captures in population i. The product of the capture probability, pi, and the population size, Ni, is the number of skunks captured at least once (Nipi = ni). Consequently, an estimate of the population size is the number of skunks captured at least once divided by the probability. The PDF function can be used to evaluate the probability mass or density function of many distributions. Specifying pdf('poisson',value,lambda), returns the probability of the given integer value in a Poisson distribution with parameter λ. Then, an estimate of the population size is n/(1-pdf('poisson',0,lambda)), where lambda is the estimated Poisson parameter, λ, for a given population. Since this is a nonlinear function of the model parameters, it can be estimated using the NLEstimate macro.

These statements save the observed number, ni, of skunks captured at least once in each population in data set OBS in variable COUNT.

      proc freq data=skunk;
         weight freq;
         table y77*female / out=obs;
         run;

The fitted model information from PROC FMM is provided to the NLEstimate macro using the inest= and incovb= macro parameters. The data set containing the covariance matrix from PROC FMM has additional numeric variables besides those that contain the matrix. These must be removed. The following statements keep only the variables containing the covariance matrix.

      data covb;
        set covb;
        keep col:;
        run;

With the fitted model information, the NLEstimate macro then uses PROC NLMIXED to estimate the population sizes and the delta method to obtain confidence limits. You write the function to be estimated using the parameter names and specify it in the f= macro parameter. A label can be provided in the label= parameter. See the description of the NLEstimate macro for details about displaying parameter names and using the macro. The population sizes are estimated by specifying data set OBS in the score= macro parameter. The NLEstimate macro evaluates the f= function discussed above for each observation in the score= data set. The estimated values are saved in the outscore= data set, Pr0, in a variable named PRED. Standard errors, a test statistic, p-value, and confidence limits are also saved.

      %NLEstimate(inest=pe, incovb=covb, 
                  f=count/(1-pdf('poisson',0, exp(b_p1+b_p2*female+b_p3*y77+b_p4*female*y77))), 
                  score=obs, outscore=Pr0)

These statements display the estimated population sizes.

      proc print data=Pr0;
         id y77 female;
         var count pred lower upper;
         run;

For the population of males (FEMALE=0) in 1978 (Y77=0), the estimated population size is 13 when rounded to the next largest integer. Eight were captured. A large-sample 95% confidence interval for the population size is approximately (6.5, 18). Similarly for the other three populations as shown.

y77 female COUNT Pred Lower Upper
0 0 8 12.2116 6.6004 17.8227
0 1 20 23.7040 20.7317 26.6763
1 0 13 13.4383 12.9618 13.9147
1 1 10 10.6329 9.8833 11.3825

Truncated Negative Binomial Model

The data set by Long, analyzed in the PROC COUNTREG documentation, is from a study examining how several factors affect the number of articles (ART) published by scientists, including the number of articles published by a scientist's mentor (MENT). While the actual data set includes scientists who did not publish, suppose that those scientists were not reported. Then the data would contain no observations with zero counts resulting in a truncated distribution. The following statements fit this truncated negative binomial model to the nonzero data.

E(y) = mean = exp[β0 + β1*MENT]

Beginning in SAS 9.3 TS1M2, the truncated negative binomial model can easily be fit using PROC FMM. The WHERE statement removes the observations with zero counts.

      proc fmm data=long97data; 
         where art ne 0;
         model art = ment / dist=truncnegbin;
         run;

As with the truncated Poisson model, PROC NLMIXED can also fit the truncated negative binomial model.

      proc nlmixed data=long97data;
         where art ne 0;
         y=art;
         parms b0=1 b1=0 k=1;
         mean = exp(b0 + b1*ment);
         ll = lgamma(y+1/k) - lgamma(1/k) - lgamma(y+1) + y*log(k*mean) -
              (y+(1/k))*log(1+k*mean) - log(1-(1+k*mean)**(-1/k));
         model y ~ general(ll);
         run;

The estimate of the β1 parameter (0.02483) indicates that the number of published articles increases significantly as the number of articles published by the mentor increases (p<.0001). Also, the negative binomial dispersion parameter, k, is significantly different from zero (p<.0001) indicating that a Poisson model would not be appropriate.

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t| 95% Confidence Limits Gradient
b0 0.2141 0.09159 640 2.34 0.0197 0.03425 0.3940 0.000080
b1 0.02483 0.004285 640 5.80 <.0001 0.01642 0.03325 0.002145
k 0.5965 0.1337 640 4.46 <.0001 0.3339 0.8592 0.000052

Since overdispersion in truncated models can result in biased parameter estimates, it may be desirable to model the dispersion as well as the mean. Modeling data exhibiting overdispersion is extensively discussed by Morel and Neerchal. This can be done by simply including the desired model for the dispersion parameter, k. In the following NLMIXED statements, k is allowed to differ between females (FEM=1) and males (FEM=0).

      proc nlmixed data=long97data;
         where art ne 0;
         y=art;
         parms b0=1 b1=0 k0=1 k1=0;
         mean = exp(b0 + b1*ment);
         k=k0+k1*fem;
         ll = lgamma(y+1/k) - lgamma(1/k) - lgamma(y+1) + y*log(k*mean) -
              (y+(1/k))*log(1+k*mean) - log(1-(1+k*mean)**(-1/k));
         model y ~ general(ll);
         run;

The significant k1 parameter (p=0.0109) indicates that there is more dispersion for males than for females.

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t| 95% Confidence Limits Gradient
b0 0.2481 0.08768 640 2.83 0.0048 0.07588 0.4202 0.000374
b1 0.02273 0.004306 640 5.28 <.0001 0.01428 0.03119 0.012982
k0 0.7590 0.1739 640 4.36 <.0001 0.4175 1.1006 0.000065
k1 -0.4404 0.1724 640 -2.55 0.0109 -0.7790 -0.1019 7.575E-6

 



Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 for x64
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.