43522 - Fitting truncated Poisson and negative binomial models

Usage Note 43522: Fitting truncated Poisson and negative binomial models

Count data in which zero counts cannot be observed is called truncated count data. Such data can be modeled using truncated versions of the Poisson or negative binomial distributions.

Models using the truncated Poisson or truncated negative binomial distribution can be fit in PROC NLMIXED by specifying the log likelihood function. Beginning in SAS 9.3 TS1M0, models using the truncated Poisson distribution can more easily be fit in PROC FMM by specifying the DIST=TRUNCPOISSON option. Beginning in SAS 9.3 TS1M2, PROC FMM can also fit the truncated negative binomial model with the DIST=TRUNCNEGBIN option. Note that these truncated distributions are used to fit hurdle models. Fitting models with both truncated distributions using PROC FMM and PROC NLMIXED is illustrated below.

Truncated Poisson Model

The following statements create the skunk capture data presented in Table 7.1 of Zelterman. Y is the number of times each individual skunk was captured over several trappings. Individual skunks never trapped could not be recorded, so only counts of one or more appear in the data. One goal is to estimate the number of skunks in the area for each sex in each of two years.

      data skunk;
         input y freq year sex $ @@;
         female=1; if sex='M' then female=0;
         y77=1; if year=78 then y77=0;
         fem77=female*y77;
         datalines;
         1  1 77  F    2  2  77  F    3  4  77  F   4  2  77  F   5  1  77  F
         1  3 77  M    2  0  77  M    3  3  77  M   4  3  77  M   5  2  77  M  6  2 77  M
         1  7 78  F    2  7  78  F    3  3  78  F   4  1  78  F   5  2  78  F
         1  4 78  M    2  3  78  M    3  1  78  M
         ;

The statements below fit the following truncated Poisson model,

E(y) = λ = exp[β₀ + β_f*I(Sex=female) + β₇₇*I(Year=1977) + β_f77*I(Sex=female and year=1977)]

where I(·) is an indicator variable equaling 1 when the condition is true and 0 otherwise. The variables FEMALE, Y77, and FEM77 in data set SKUNK are these indicator variables. β₀, β_f, β₇₇, and β_f77 are the model parameters to be estimated.

The model is easily fit in PROC FMM using the DIST=TRUNCPOISSON option in the MODEL statement. The COV option and the ODS OUTPUT statement save the model information for use in the NLEstimate macro later.

      proc fmm data=skunk cov;
         freq freq;
         model y = female y77 fem77 / dist=truncpoisson;
         ods output parameterestimates=pe cov=covb;
         run;

Below are the estimated parameters of the model.

Parameter Estimates for Truncated Poisson Model
Effect	Estimate	Standard Error	z Value	Pr > \|z\|
Intercept	0.06255	0.4183	0.15	0.8811
female	0.5560	0.4579	1.21	0.2246
y77	1.1680	0.4467	2.61	0.0089
fem77	-0.7493	0.5242	-1.43	0.1529

In PROC NLMIXED, zero initial values for the model parameters are specified in the PARMS statement. The above model for the expected count, λ, is specified next. The log likelihood, LL, for the truncated Poisson is specified in the following line. Finally, the MODEL statement species that the observed counts, Y, are distributed according to the log likelihood previously defined.

      proc nlmixed data=skunk;
         parms b0=0 bf=0 b77=0 bf77=0;
         lambda=exp(b0+bf*female+b77*y77+bf77*fem77);
         LL=freq*(-lambda+y*log(lambda)-lgamma(y+1)-log(1-exp(-lambda)));
         model y ~ general(LL);
         run;

The fitted model is the same as from PROC FMM above.

To determine the size of the population for each sex in each year, you need to estimate the probability, p_i, of capturing a skunk at least once in each of the four populations, i=1, 2, 3, or 4. The probability of at least one capture is 1- p 0 i , the probability of zero captures in population i. The product of the capture probability, p_i, and the population size, N_i, is the number of skunks captured at least once (N_ip_i = n_i). Consequently, an estimate of the population size is the number of skunks captured at least once divided by the probability. The PDF function can be used to evaluate the probability mass or density function of many distributions. Specifying pdf('poisson',value,lambda), returns the probability of the given integer value in a Poisson distribution with parameter λ. Then, an estimate of the population size is n/(1-pdf('poisson',0,lambda)), where lambda is the estimated Poisson parameter, λ, for a given population. Since this is a nonlinear function of the model parameters, it can be estimated using the NLEstimate macro.

These statements save the observed number, n_i, of skunks captured at least once in each population in data set OBS in variable COUNT.

      proc freq data=skunk;
         weight freq;
         table y77*female / out=obs;
         run;

The fitted model information from PROC FMM is provided to the NLEstimate macro using the inest= and incovb= macro parameters. The data set containing the covariance matrix from PROC FMM has additional numeric variables besides those that contain the matrix. These must be removed. The following statements keep only the variables containing the covariance matrix.

      data covb;
        set covb;
        keep col:;
        run;

With the fitted model information, the NLEstimate macro then uses PROC NLMIXED to estimate the population sizes and the delta method to obtain confidence limits. You write the function to be estimated using the parameter names and specify it in the f= macro parameter. A label can be provided in the label= parameter. See the description of the NLEstimate macro for details about displaying parameter names and using the macro. The population sizes are estimated by specifying data set OBS in the score= macro parameter. The NLEstimate macro evaluates the f= function discussed above for each observation in the score= data set. The estimated values are saved in the outscore= data set, Pr0, in a variable named PRED. Standard errors, a test statistic, p-value, and confidence limits are also saved.

      %NLEstimate(inest=pe, incovb=covb, 
                  f=count/(1-pdf('poisson',0, exp(b_p1+b_p2*female+b_p3*y77+b_p4*female*y77))), 
                  score=obs, outscore=Pr0)

These statements display the estimated population sizes.

      proc print data=Pr0;
         id y77 female;
         var count pred lower upper;
         run;

For the population of males (FEMALE=0) in 1978 (Y77=0), the estimated population size is 13 when rounded to the next largest integer. Eight were captured. A large-sample 95% confidence interval for the population size is approximately (6.5, 18). Similarly for the other three populations as shown.

y77	female	COUNT	Pred	Lower	Upper
0	0	8	12.2116	6.6004	17.8227
0	1	20	23.7040	20.7317	26.6763
1	0	13	13.4383	12.9618	13.9147
1	1	10	10.6329	9.8833	11.3825

Truncated Negative Binomial Model

The data set by Long, analyzed in the PROC COUNTREG documentation, is from a study examining how several factors affect the number of articles (ART) published by scientists, including the number of articles published by a scientist's mentor (MENT). While the actual data set includes scientists who did not publish, suppose that those scientists were not reported. Then the data would contain no observations with zero counts resulting in a truncated distribution. The following statements fit this truncated negative binomial model to the nonzero data.

E(y) = mean = exp[β₀ + β₁*MENT]

Beginning in SAS 9.3 TS1M2, the truncated negative binomial model can easily be fit using PROC FMM. The WHERE statement removes the observations with zero counts.

      proc fmm data=long97data; 
         where art ne 0;
         model art = ment / dist=truncnegbin;
         run;

As with the truncated Poisson model, PROC NLMIXED can also fit the truncated negative binomial model.

      proc nlmixed data=long97data;
         where art ne 0;
         y=art;
         parms b0=1 b1=0 k=1;
         mean = exp(b0 + b1*ment);
         ll = lgamma(y+1/k) - lgamma(1/k) - lgamma(y+1) + y*log(k*mean) -
              (y+(1/k))*log(1+k*mean) - log(1-(1+k*mean)**(-1/k));
         model y ~ general(ll);
         run;

The estimate of the β₁ parameter (0.02483) indicates that the number of published articles increases significantly as the number of articles published by the mentor increases (p<.0001). Also, the negative binomial dispersion parameter, k, is significantly different from zero (p<.0001) indicating that a Poisson model would not be appropriate.

Parameter Estimates
Parameter	Estimate	Standard Error	DF	t Value	Pr > \|t\|	95% Confidence Limits		Gradient
b0	0.2141	0.09159	640	2.34	0.0197	0.03425	0.3940	0.000080
b1	0.02483	0.004285	640	5.80	<.0001	0.01642	0.03325	0.002145
k	0.5965	0.1337	640	4.46	<.0001	0.3339	0.8592	0.000052

Since overdispersion in truncated models can result in biased parameter estimates, it may be desirable to model the dispersion as well as the mean. Modeling data exhibiting overdispersion is extensively discussed by Morel and Neerchal. This can be done by simply including the desired model for the dispersion parameter, k. In the following NLMIXED statements, k is allowed to differ between females (FEM=1) and males (FEM=0).

      proc nlmixed data=long97data;
         where art ne 0;
         y=art;
         parms b0=1 b1=0 k0=1 k1=0;
         mean = exp(b0 + b1*ment);
         k=k0+k1*fem;
         ll = lgamma(y+1/k) - lgamma(1/k) - lgamma(y+1) + y*log(k*mean) -
              (y+(1/k))*log(1+k*mean) - log(1-(1+k*mean)**(-1/k));
         model y ~ general(ll);
         run;

The significant k1 parameter (p=0.0109) indicates that there is more dispersion for males than for females.

Parameter Estimates
Parameter	Estimate	Standard Error	DF	t Value	Pr > \|t\|	95% Confidence Limits		Gradient
b0	0.2481	0.08768	640	2.83	0.0048	0.07588	0.4202	0.000374
b1	0.02273	0.004306	640	5.28	<.0001	0.01428	0.03119	0.012982
k0	0.7590	0.1739	640	4.36	<.0001	0.4175	1.1006	0.000065
k1	-0.4404	0.1724	640	-2.55	0.0109	-0.7790	-0.1019	7.575E-6

Operating System and Release Information

Product Family	Product	System	SAS Release
Product Family	Product	System	Reported	Fixed*
SAS System	SAS/STAT	z/OS
		OpenVMS VAX
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		OS/2
		Microsoft Windows 95/98
		Microsoft Windows 2000 Advanced Server
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Server
		Microsoft Windows 2000 Professional
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
		Microsoft Windows Server 2003 Enterprise Edition
		Microsoft Windows Server 2003 Standard Edition
		Microsoft Windows Server 2003 for x64
		Microsoft Windows Server 2008
		Microsoft Windows Server 2008 for x64
		Microsoft Windows XP Professional
		Windows 7 Enterprise 32 bit
		Windows 7 Enterprise x64
		Windows 7 Home Premium 32 bit
		Windows 7 Home Premium x64
		Windows 7 Professional 32 bit
		Windows 7 Professional x64
		Windows 7 Ultimate 32 bit
		Windows 7 Ultimate x64
		Windows Millennium Edition (Me)
		Windows Vista
		Windows Vista for x64
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:
Topic:	Analytics ==> Categorical Data Analysis Analytics ==> Regression SAS Reference ==> Procedures ==> NLMIXED SAS Reference ==> Procedures ==> FMM

Date Modified:	2018-11-12 13:12:00
Date Created:	2011-06-14 13:09:02

Support

Usage Note 43522: Fitting truncated Poisson and negative binomial models

Truncated Poisson Model

Truncated Negative Binomial Model

Operating System and Release Information