Count data in which zero counts cannot be observed is called truncated count data. Such data can be modeled using truncated versions of the Poisson or negative binomial distributions.
Models using the truncated Poisson or truncated negative binomial distribution can be fit in PROC NLMIXED by specifying the log likelihood function. Beginning in SAS 9.3 TS1M0, models using the truncated Poisson distribution can more easily be fit in PROC FMM by specifying the DIST=TRUNCPOISSON option. Beginning in SAS 9.3 TS1M2, PROC FMM can also fit the truncated negative binomial model with the DIST=TRUNCNEGBIN option. Note that these truncated distributions are used to fit hurdle models. Fitting models with both truncated distributions using PROC FMM and PROC NLMIXED is illustrated below.
The following statements create the skunk capture data presented in Table 7.1 of Zelterman. Y is the number of times each individual skunk was captured over several trappings. Individual skunks never trapped could not be recorded, so only counts of one or more appear in the data. One goal is to estimate the number of skunks in the area for each sex in each of two years.
data skunk; input y freq year sex $ @@; female=1; if sex='M' then female=0; y77=1; if year=78 then y77=0; fem77=female*y77; datalines; 1 1 77 F 2 2 77 F 3 4 77 F 4 2 77 F 5 1 77 F 1 3 77 M 2 0 77 M 3 3 77 M 4 3 77 M 5 2 77 M 6 2 77 M 1 7 78 F 2 7 78 F 3 3 78 F 4 1 78 F 5 2 78 F 1 4 78 M 2 3 78 M 3 1 78 M ;
The statements below fit the following truncated Poisson model,
E(y) = λ = exp[β0 + βf*I(Sex=female) + β77*I(Year=1977) + βf77*I(Sex=female and year=1977)]
where I(·) is an indicator variable equaling 1 when the condition is true and 0 otherwise. The variables FEMALE, Y77, and FEM77 in data set SKUNK are these indicator variables. β0, βf, β77, and βf77 are the model parameters to be estimated.
The model is easily fit in PROC FMM using the DIST=TRUNCPOISSON option in the MODEL statement. The COV option and the ODS OUTPUT statement save the model information for use in the NLEstimate macro later.
proc fmm data=skunk cov; freq freq; model y = female y77 fem77 / dist=truncpoisson; ods output parameterestimates=pe cov=covb; run;
Below are the estimated parameters of the model.
|
In PROC NLMIXED, zero initial values for the model parameters are specified in the PARMS statement. The above model for the expected count, λ, is specified next. The log likelihood, LL, for the truncated Poisson is specified in the following line. Finally, the MODEL statement species that the observed counts, Y, are distributed according to the log likelihood previously defined.
proc nlmixed data=skunk; parms b0=0 bf=0 b77=0 bf77=0; lambda=exp(b0+bf*female+b77*y77+bf77*fem77); LL=freq*(-lambda+y*log(lambda)-lgamma(y+1)-log(1-exp(-lambda))); model y ~ general(LL); run;
The fitted model is the same as from PROC FMM above.
To determine the size of the population for each sex in each year, you need to estimate the probability, pi, of capturing a skunk at least once in each of the four populations, i=1, 2, 3, or 4. The probability of at least one capture is 1- p 0 i , the probability of zero captures in population i. The product of the capture probability, pi, and the population size, Ni, is the number of skunks captured at least once (Nipi = ni). Consequently, an estimate of the population size is the number of skunks captured at least once divided by the probability. The PDF function can be used to evaluate the probability mass or density function of many distributions. Specifying pdf('poisson',value,lambda), returns the probability of the given integer value in a Poisson distribution with parameter λ. Then, an estimate of the population size is n/(1-pdf('poisson',0,lambda)), where lambda is the estimated Poisson parameter, λ, for a given population. Since this is a nonlinear function of the model parameters, it can be estimated using the NLEstimate macro.
These statements save the observed number, ni, of skunks captured at least once in each population in data set OBS in variable COUNT.
proc freq data=skunk; weight freq; table y77*female / out=obs; run;
The fitted model information from PROC FMM is provided to the NLEstimate macro using the inest= and incovb= macro parameters. The data set containing the covariance matrix from PROC FMM has additional numeric variables besides those that contain the matrix. These must be removed. The following statements keep only the variables containing the covariance matrix.
data covb; set covb; keep col:; run;
With the fitted model information, the NLEstimate macro then uses PROC NLMIXED to estimate the population sizes and the delta method to obtain confidence limits. You write the function to be estimated using the parameter names and specify it in the f= macro parameter. A label can be provided in the label= parameter. See the description of the NLEstimate macro for details about displaying parameter names and using the macro. The population sizes are estimated by specifying data set OBS in the score= macro parameter. The NLEstimate macro evaluates the f= function discussed above for each observation in the score= data set. The estimated values are saved in the outscore= data set, Pr0, in a variable named PRED. Standard errors, a test statistic, p-value, and confidence limits are also saved.
%NLEstimate(inest=pe, incovb=covb, f=count/(1-pdf('poisson',0, exp(b_p1+b_p2*female+b_p3*y77+b_p4*female*y77))), score=obs, outscore=Pr0)
These statements display the estimated population sizes.
proc print data=Pr0; id y77 female; var count pred lower upper; run;
For the population of males (FEMALE=0) in 1978 (Y77=0), the estimated population size is 13 when rounded to the next largest integer. Eight were captured. A large-sample 95% confidence interval for the population size is approximately (6.5, 18). Similarly for the other three populations as shown.
|
The data set by Long, analyzed in the PROC COUNTREG documentation, is from a study examining how several factors affect the number of articles (ART) published by scientists, including the number of articles published by a scientist's mentor (MENT). While the actual data set includes scientists who did not publish, suppose that those scientists were not reported. Then the data would contain no observations with zero counts resulting in a truncated distribution. The following statements fit this truncated negative binomial model to the nonzero data.
E(y) = mean = exp[β0 + β1*MENT]
Beginning in SAS 9.3 TS1M2, the truncated negative binomial model can easily be fit using PROC FMM. The WHERE statement removes the observations with zero counts.
proc fmm data=long97data; where art ne 0; model art = ment / dist=truncnegbin; run;
As with the truncated Poisson model, PROC NLMIXED can also fit the truncated negative binomial model.
proc nlmixed data=long97data; where art ne 0; y=art; parms b0=1 b1=0 k=1; mean = exp(b0 + b1*ment); ll = lgamma(y+1/k) - lgamma(1/k) - lgamma(y+1) + y*log(k*mean) - (y+(1/k))*log(1+k*mean) - log(1-(1+k*mean)**(-1/k)); model y ~ general(ll); run;
The estimate of the β1 parameter (0.02483) indicates that the number of published articles increases significantly as the number of articles published by the mentor increases (p<.0001). Also, the negative binomial dispersion parameter, k, is significantly different from zero (p<.0001) indicating that a Poisson model would not be appropriate.
|
Since overdispersion in truncated models can result in biased parameter estimates, it may be desirable to model the dispersion as well as the mean. Modeling data exhibiting overdispersion is extensively discussed by Morel and Neerchal. This can be done by simply including the desired model for the dispersion parameter, k. In the following NLMIXED statements, k is allowed to differ between females (FEM=1) and males (FEM=0).
proc nlmixed data=long97data; where art ne 0; y=art; parms b0=1 b1=0 k0=1 k1=0; mean = exp(b0 + b1*ment); k=k0+k1*fem; ll = lgamma(y+1/k) - lgamma(1/k) - lgamma(y+1) + y*log(k*mean) - (y+(1/k))*log(1+k*mean) - log(1-(1+k*mean)**(-1/k)); model y ~ general(ll); run;
The significant k1 parameter (p=0.0109) indicates that there is more dispersion for males than for females.
|
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | z/OS | ||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows Server 2003 for x64 | ||||
Microsoft Windows Server 2008 | ||||
Microsoft Windows Server 2008 for x64 | ||||
Microsoft Windows XP Professional | ||||
Windows 7 Enterprise 32 bit | ||||
Windows 7 Enterprise x64 | ||||
Windows 7 Home Premium 32 bit | ||||
Windows 7 Home Premium x64 | ||||
Windows 7 Professional 32 bit | ||||
Windows 7 Professional x64 | ||||
Windows 7 Ultimate 32 bit | ||||
Windows 7 Ultimate x64 | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
Windows Vista for x64 | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | Analytics ==> Categorical Data Analysis Analytics ==> Regression SAS Reference ==> Procedures ==> NLMIXED SAS Reference ==> Procedures ==> FMM |
Date Modified: | 2018-11-12 13:12:00 |
Date Created: | 2011-06-14 13:09:02 |