![]() | ![]() | ![]() |
Count data in which zero counts cannot be observed is called truncated count data. Such data can be modeled using truncated versions of the poisson or negative binomial distributions.
Models using the truncated poisson or truncated negative binomial distribution can be fit in PROC NLMIXED by specifying the log likelihood function. Beginning in SAS 9.3 TS1M0, models using the truncated poisson distribution can be fit in PROC FMM by simply specifying the DIST=TRUNCPOISSON option. Beginning in SAS 9.3 TS1M2, PROC FMM can also fit the truncated negative binomial model with the DIST=TRUNCNEGBIN option. Note that these truncated distributions are used to fit hurdle models. Also, Zelterman shows how a model can be fit to truncated poisson data using PROC GENMOD and macros defining the truncated poisson deviance, link, and variance functions. Fitting both models using PROC FMM and PROC NLMIXED is illustrated below.
The following statements create the skunk capture data presented in Table 7.1 of Zelterman. Y is the number of times each individual skunk was captured over several trappings. Individual skunks never trapped could not be recorded, so only counts of one or more appear in the data. One goal is to estimate the number of skunks in the area for each sex in each of two years.
data skunk;
input y freq year sex $ @@;
female=1; if sex='M' then female=0;
y77=1; if year=78 then y77=0;
sexyear=female*y77;
do i=1 to freq;
output;
end;
datalines;
1 1 77 F 2 2 77 F 3 4 77 F 4 2 77 F 5 1 77 F
1 3 77 M 2 0 77 M 3 3 77 M 4 3 77 M 5 2 77 M 6 2 77 M
1 7 78 F 2 7 78 F 3 3 78 F 4 1 78 F 5 2 78 F
1 4 78 M 2 3 78 M 3 1 78 M
;
The statements below fit the following truncated poisson model,
E(y) = λ = exp[β0 + βf*I(Sex=female) + β77*I(Year=1977) + βf77*I(Sex=female and year=1977)]
where I(·) is an indicator variable equaling 1 when the condition is true and 0 otherwise. The variables FEMALE, Y77, and SEXYEAR in data set SKUNK are these indicator variables. β0, βf, β77, and βf77 are the model parameters to be estimated.
The model is easily fit in PROC FMM using the DIST=TRUNCPOISSON option in the MODEL statement.
proc fmm;
model y = female|y77 / dist=truncpoisson;
run;
In PROC NLMIXED, initial values (zero) for the model parameters are specified in the PARMS statement. The above model for the expected count, λ, is specified next. The log likelihood, ll, for the truncated poisson is specified in the following line. Finally, the MODEL statement species that the observed counts, Y, are distributed according to the log likelihood previously defined. The ODS OUTPUT statement saves the parameter estimates of the fitted model in a data set.
proc nlmixed;
parms b0=0 bf=0 b77=0 bf77=0;
lambda=exp(b0+bf*female+b77*y77+bf77*sexyear);
ll=-lambda+y*log(lambda)-lgamma(y+1)-log(1-exp(-lambda));
model y ~ general(ll);
ods output parameterestimates=pe;
run;
Following is the resulting table of parameter estimates for the fitted model.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
To determine the size of the population for each sex in each year, you need to estimate the probability, pi, of capturing a skunk at least once in each of the four populations. The product of this probability and the population size, Ni, is the number of skunks captured at least once (Nipi = ni). Consequently, an estimate of the population size is the number of skunks captured at least once divided by the probability. Begin by computing the probabilities of capturing a skunk 0, 1, 2, ... , or 6 times using the estimated parameters from the model. These can be obtained using the PROBCOUNTS macro. To use this macro with the parameter estimates from a model fitted using PROC NLMIXED requires naming the parameters using the associated variable names and using Intercept as the name for the model intercept. The following DATA step does this renaming.
data pe;
length parameter $20;
set pe;
if parameter="b0" then parameter="Intercept";
if parameter="bf" then parameter="female";
if parameter="b77" then parameter="y77";
if parameter="bf77" then parameter="sexyear";
run;
You can now call the PROBCOUNTS macro specifying the data set of saved parameter estimates in the INMODEL= option and requesting count probabilities for counts 0 through 6.
%probcounts(data=skunk, inmodel=pe, counts=%str(0 to 6) )
The PROBCOUNTS macro produces a copy of the skunk data set, named PROBCOUNTS, which includes variables Pr0, Pr1, ... , Pr6 containing the predicted probabilities of each count. Note that Pr0 is the estimated probability of never capturing a skunk in each population. The following DATA step computes the probability, pGE1, of capturing each skunk at least once by summing probabilities Pr1 through Pr6. Since all skunks in the same population have the same probabilities, only one observation from each population is saved.
data pop;
set probcounts;
by year sex;
pGE1=sum(of pr1-pr6);
if first.year or first.sex;
keep year sex pGE1 pr0;
run;
This step computes ni, the observed number of skunks in each population. These counts are saved in a variable named COUNT in data set OBS.
proc freq data=skunk;
table year*sex / out=obs;
run;
The population sizes, Ni, are computed as discussed above by dividing the observed counts, ni, by the estimated count probabilities, pi. The CEIL function rounds the computed value to the next largest integer. The estimated number of uncaptured skunks is estimated by multiplying the estimated probability of zero captures, Pr0, by the estimated population size.
data pop2;
merge pop obs;
by year sex;
popsize=ceil(count/pGE1);
zerocount=ceil(popsize*pr0);
run;
proc print noobs;
var year sex pGE1 count popsize zerocount;
run;
Note that in the population of males in 1977 the observed number of captured skunks, 13, and the estimated number of uncaptured skunks, 1, does not equal the estimated population size, 15. This is because the estimated probability of more than six captures multiplied by this population's size would account for another skunk.
|
The data set by Long, analyzed in the PROC COUNTREG documentation, is from a study examining how several factors affect the number of articles (ART) published by scientists, including the number of articles published by a scientist's mentor (MENT). While the actual data set includes scientists who did not publish, suppose that those scientists were not reported. Then the data would contain no observations with zero counts resulting in a truncated distribution. The following statements fit this truncated negative binomial model to the nonzero data.
E(y) = mean = exp[β0 + β1*MENT]
Beginning in SAS 9.3 TS1M2, the truncated negative binomial model can easily be fit using PROC FMM. The WHERE statement removes the observations with zero counts.
proc fmm data=data.article;
where art ne 0;
model art = ment / dist=truncnegbin;
run;
As with the truncated poisson model, PROC NLMIXED can also fit the truncated negative binomial model.
proc nlmixed data=long97data;
where art ne 0;
y=art;
parms b0=1 b1=0 k=1;
mean = exp(b0 + b1*ment);
ll = lgamma(y+1/k) - lgamma(1/k) - lgamma(y+1) + y*log(k*mean) -
(y+(1/k))*log(1+k*mean) - log(1-(1+k*mean)**(-1/k));
model y ~ general(ll);
run;
The estimate of the β1 parameter (0.02483) indicates that the number of published articles increases significantly as the number of articles published by the mentor increases (p<.0001). Also, the negative binomial dispersion parameter, k, is significantly different from zero (p<.0001) indicating that a poisson model would not be appropriate.
| ||||||||||||||||||||||||||||||||||||||||||||||||||
Since overdispersion in truncated models can result in biased parameter estimates, it may be desirable to model the dispersion as well as the mean. Modeling data exhibiting overdispersion is extensively discussed by Morel and Neerchal. This can be done by simply including the desired model for the dispersion parameter, k. In the following NLMIXED statements, k is allowed to differ between females (FEM=1) and males (FEM=0).
proc nlmixed data=long97data;
where art ne 0;
y=art;
parms b0=1 b1=0 k0=1 k1=0;
mean = exp(b0 + b1*ment);
k=k0+k1*fem;
ll = lgamma(y+1/k) - lgamma(1/k) - lgamma(y+1) + y*log(k*mean) -
(y+(1/k))*log(1+k*mean) - log(1-(1+k*mean)**(-1/k));
model y ~ general(ll);
run;
The significant k1 parameter (p=0.0109) indicates that there is more dispersion for males than for females.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
References
| Product Family | Product | System | SAS Release | |
| Reported | Fixed* | |||
| SAS System | SAS/STAT | z/OS | ||
| OpenVMS VAX | ||||
| Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
| Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
| Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
| Microsoft Windows XP 64-bit Edition | ||||
| Microsoft® Windows® for x64 | ||||
| OS/2 | ||||
| Microsoft Windows 95/98 | ||||
| Microsoft Windows 2000 Advanced Server | ||||
| Microsoft Windows 2000 Datacenter Server | ||||
| Microsoft Windows 2000 Server | ||||
| Microsoft Windows 2000 Professional | ||||
| Microsoft Windows NT Workstation | ||||
| Microsoft Windows Server 2003 Datacenter Edition | ||||
| Microsoft Windows Server 2003 Enterprise Edition | ||||
| Microsoft Windows Server 2003 Standard Edition | ||||
| Microsoft Windows Server 2003 for x64 | ||||
| Microsoft Windows Server 2008 | ||||
| Microsoft Windows Server 2008 for x64 | ||||
| Microsoft Windows XP Professional | ||||
| Windows 7 Enterprise 32 bit | ||||
| Windows 7 Enterprise x64 | ||||
| Windows 7 Home Premium 32 bit | ||||
| Windows 7 Home Premium x64 | ||||
| Windows 7 Professional 32 bit | ||||
| Windows 7 Professional x64 | ||||
| Windows 7 Ultimate 32 bit | ||||
| Windows 7 Ultimate x64 | ||||
| Windows Millennium Edition (Me) | ||||
| Windows Vista | ||||
| Windows Vista for x64 | ||||
| 64-bit Enabled AIX | ||||
| 64-bit Enabled HP-UX | ||||
| 64-bit Enabled Solaris | ||||
| ABI+ for Intel Architecture | ||||
| AIX | ||||
| HP-UX | ||||
| HP-UX IPF | ||||
| IRIX | ||||
| Linux | ||||
| Linux for x64 | ||||
| Linux on Itanium | ||||
| OpenVMS Alpha | ||||
| OpenVMS on HP Integrity | ||||
| Solaris | ||||
| Solaris for x64 | ||||
| Tru64 UNIX | ||||
| Type: | Usage Note |
| Priority: | |
| Topic: | Analytics ==> Categorical Data Analysis Analytics ==> Regression SAS Reference ==> Procedures ==> NLMIXED SAS Reference ==> Procedures ==> FMM |
| Date Modified: | 2011-06-23 16:48:54 |
| Date Created: | 2011-06-14 13:09:02 |




