The following example applies the Pearson goodness of fit test to assess the fit of the negative binomial distribution to a set of count data after estimating the parameters of the distribution. For general information on testing the fit of distributions, see this note.
This example appears in Morel and Neerchal (2012). The data are counts of eggs produced by a species of snail found in samples taken from 926 individuals. These statements create data set EGGS containing the data.
data eggs; input y freq; datalines; 0 603 1 112 2 93 3 53 4 19 5 21 6 7 7 6 8 5 9 2 10 1 11 2 14 2 ;
The following statements fit the negative binomial model to the data using PROC COUNTREG in SAS/ETS® software. Since the model contains only an intercept (no covariates), the data are considered a sample from a single population. The intercept estimates the link-transformed negative binomial mean. If desired, the PRED= option in the OUTPUT statement could be specified to apply the inverse link function and provide an estimate of the negative binomial mean. The negative binomial dispersion parameter is also estimated. The PROB= option provides the expected probabilities for each observed count. Note that the model can also be fit using PROC GENMOD in SAS/STAT® software, but GENMOD does not provide an option to save the expected probabilities of the observed counts.
proc countreg data=eggs; model y = / dist=negbin; output out=exp_nb prob=prob_nb; freq freq; run;
Notice that two parameters are estimated by PROC COUNTREG as described above.
|
Because of the small expected probabilities for the larger counts, some counts are combined before conducting the test. The FORMAT and MEANS steps below combine counts 8 and 9 into one category and counts 10 and larger into another category. This results in a total of 10 categories of Y. The variable containing the expected probabilities is named _TESTP_ as explained below.
proc format; value yfmt 8-9 = "8 or 9" 10-high = ">=10"; run; proc means sum nway data=exp_nb; class y; var prob_nb; format y yfmt.; output out=exp_nb sum=_testp_; run;
Note that counts of 12, 13, 15, and higher did not occur in this data set. However, they would be expected to occur in larger or repeated samples. Consequently, frequencies for these counts may be regarded as sampling zeros and their expected probabilities should be included when computing the goodness of fit statistics. All expected probabilities should sum to one, but not all possible counts are ever observed in a sample. The expected probability of all counts 10 or greater is equal to one minus the probability of all counts less than 10. The following DATA step cumulates the expected probabilities (in SUMEXP) and sets the expected probability for the last count (Y=10) to this difference.
data exp_nb; set exp_nb; sumexp + _testp_; sumtolast = lag(sumexp); if y=10 then _testp_ = 1 - sumtolast; run;
The CHISQ option in the PROC FREQ step which follows requests that the goodness of fit chi-square test be performed for the one-way table of Y counts. The WEIGHT statement provides the observed frequencies. When the TESTP= suboption specifies a data set name (EXP_NB in this case), PROC FREQ looks for a variable named _TESTP_ in the data set. This variable provides the expected probabilities. With the observed and expected values, the CHISQ option is able to compute the goodness of fit statistics. The likelihood ratio chi-square test is included by specifying the LRCHISQ suboption. However, to provide proper tests the usual degrees of freedom for the tests (k-1, where k is the number of Y categories) must be reduced by the number of parameters that were estimated. Since PROC COUNTREG was used to estimate the two parameters of the negative binomial distribution (mean and dispersion), the DF=-2 option is specified to reduce the degrees of freedom by two. Alternatively, you could provide the correct degrees of freedom by specifying DF=7 since k=10 and two parameters are estimated (10-1-2 = 7). The capability to specify a data set of expected probabilities or frequencies and to adjust degrees of freedom was added in the 9.3 TS1M2 release of SAS.
proc freq data=eggs; table y / chisq(testp=exp_nb df=-2 lrchisq); format y yfmt.; weight freq; run;
The results of the tests indicate that the negative binomial model does not fit the data well (p=0.0016).
|
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | z/OS | ||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 8 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows 2012 | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows Server 2003 for x64 | ||||
Microsoft Windows Server 2008 | ||||
Microsoft Windows Server 2008 for x64 | ||||
Microsoft Windows XP Professional | ||||
Windows 7 Enterprise 32 bit | ||||
Windows 7 Enterprise x64 | ||||
Windows 7 Home Premium 32 bit | ||||
Windows 7 Home Premium x64 | ||||
Windows 7 Professional 32 bit | ||||
Windows 7 Professional x64 | ||||
Windows 7 Ultimate 32 bit | ||||
Windows 7 Ultimate x64 | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
Windows Vista for x64 | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | Analytics ==> Categorical Data Analysis Analytics ==> Distribution Analysis SAS Reference ==> Procedures ==> FREQ SAS Reference ==> Procedures ==> COUNTREG SAS Reference ==> Procedures ==> GENMOD |
Date Modified: | 2019-05-03 15:51:08 |
Date Created: | 2012-09-21 14:38:29 |