47956 - Estimating parameters and testing fit of the negative binomial distribution

Usage Note 47956: Estimating parameters and testing fit of the negative binomial distribution

The following example applies the Pearson goodness of fit test to assess the fit of the negative binomial distribution to a set of count data after estimating the parameters of the distribution. For general information on testing the fit of distributions, see this note.

This example appears in Morel and Neerchal (2012). The data are counts of eggs produced by a species of snail found in samples taken from 926 individuals. These statements create data set EGGS containing the data.

      data eggs;
        input y freq;
        datalines;
        0  603
        1  112
        2  93
        3  53
        4  19
        5  21
        6  7
        7  6
        8  5
        9  2
        10 1
        11 2
        14 2
        ;

The following statements fit the negative binomial model to the data using PROC COUNTREG in SAS/ETS^® software. Since the model contains only an intercept (no covariates), the data are considered a sample from a single population. The intercept estimates the link-transformed negative binomial mean. If desired, the PRED= option in the OUTPUT statement could be specified to apply the inverse link function and provide an estimate of the negative binomial mean. The negative binomial dispersion parameter is also estimated. The PROB= option provides the expected probabilities for each observed count. Note that the model can also be fit using PROC GENMOD in SAS/STAT^® software, but GENMOD does not provide an option to save the expected probabilities of the observed counts.

      proc countreg data=eggs;
         model y = / dist=negbin;
         output out=exp_nb prob=prob_nb;
         freq freq;
         run;

Notice that two parameters are estimated by PROC COUNTREG as described above.

Parameter Estimates
Parameter	DF	Estimate	Standard Error	t Value	Approx Pr > \|t\|
Intercept	1	-0.097472	0.066381	-1.47	0.1420
_Alpha	1	2.977951	0.280185	10.63	<.0001

Because of the small expected probabilities for the larger counts, some counts are combined before conducting the test. The FORMAT and MEANS steps below combine counts 8 and 9 into one category and counts 10 and larger into another category. This results in a total of 10 categories of Y. The variable containing the expected probabilities is named _TESTP_ as explained below.

      proc format;
         value yfmt     
           8-9     = "8 or 9" 
           10-high = ">=10"; 
         run;
       
      proc means sum nway data=exp_nb;
         class y; 
         var prob_nb;
         format y yfmt.;
         output out=exp_nb sum=_testp_;
         run;

Note that counts of 12, 13, 15, and higher did not occur in this data set. However, they would be expected to occur in larger or repeated samples. Consequently, frequencies for these counts may be regarded as sampling zeros and their expected probabilities should be included when computing the goodness of fit statistics. All expected probabilities should sum to one, but not all possible counts are ever observed in a sample. The expected probability of all counts 10 or greater is equal to one minus the probability of all counts less than 10. The following DATA step cumulates the expected probabilities (in SUMEXP) and sets the expected probability for the last count (Y=10) to this difference.

      data exp_nb;
         set exp_nb;
         sumexp + _testp_;
         sumtolast = lag(sumexp);
         if y=10 then _testp_ = 1 - sumtolast;
         run;

The CHISQ option in the PROC FREQ step which follows requests that the goodness of fit chi-square test be performed for the one-way table of Y counts. The WEIGHT statement provides the observed frequencies. When the TESTP= suboption specifies a data set name (EXP_NB in this case), PROC FREQ looks for a variable named _TESTP_ in the data set. This variable provides the expected probabilities. With the observed and expected values, the CHISQ option is able to compute the goodness of fit statistics. The likelihood ratio chi-square test is included by specifying the LRCHISQ suboption. However, to provide proper tests the usual degrees of freedom for the tests (k-1, where k is the number of Y categories) must be reduced by the number of parameters that were estimated. Since PROC COUNTREG was used to estimate the two parameters of the negative binomial distribution (mean and dispersion), the DF=-2 option is specified to reduce the degrees of freedom by two. Alternatively, you could provide the correct degrees of freedom by specifying DF=7 since k=10 and two parameters are estimated (10-1-2 = 7). The capability to specify a data set of expected probabilities or frequencies and to adjust degrees of freedom was added in the 9.3 TS1M2 release of SAS.

      proc freq data=eggs; 
         table y / chisq(testp=exp_nb df=-2 lrchisq);
         format y yfmt.;
         weight freq;
         run;

The results of the tests indicate that the negative binomial model does not fit the data well (p=0.0016).

y	Frequency	Percent	Test Percent	Cumulative Frequency	Cumulative Percent
0	603	65.12	64.44	603	65.12
1	112	12.10	15.79	715	77.21
2	93	10.04	7.70	808	87.26
3	53	5.72	4.37	861	92.98
4	19	2.05	2.66	880	95.03
5	21	2.27	1.69	901	97.30
6	7	0.76	1.09	908	98.06
7	6	0.65	0.72	914	98.70
8 or 9	7	0.76	0.81	921	99.46
>=10	5	0.54	0.72	926	100.00

Chi-Square Test for Specified Proportions
Chi-Square	23.2151
DF	7
Pr > ChiSq	0.0016

Likelihood Ratio Chi-Square Test for Specified Proportions
Chi-Square	23.0880
DF	7
Pr > ChiSq	0.0016

Operating System and Release Information

Product Family	Product	System	SAS Release
Product Family	Product	System	Reported	Fixed*
SAS System	SAS/STAT	z/OS
		OpenVMS VAX
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		OS/2
		Microsoft Windows 8
		Microsoft Windows 95/98
		Microsoft Windows 2000 Advanced Server
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Server
		Microsoft Windows 2000 Professional
		Microsoft Windows 2012
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
		Microsoft Windows Server 2003 Enterprise Edition
		Microsoft Windows Server 2003 Standard Edition
		Microsoft Windows Server 2003 for x64
		Microsoft Windows Server 2008
		Microsoft Windows Server 2008 for x64
		Microsoft Windows XP Professional
		Windows 7 Enterprise 32 bit
		Windows 7 Enterprise x64
		Windows 7 Home Premium 32 bit
		Windows 7 Home Premium x64
		Windows 7 Professional 32 bit
		Windows 7 Professional x64
		Windows 7 Ultimate 32 bit
		Windows 7 Ultimate x64
		Windows Millennium Edition (Me)
		Windows Vista
		Windows Vista for x64
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:
Topic:	Analytics ==> Categorical Data Analysis Analytics ==> Distribution Analysis SAS Reference ==> Procedures ==> FREQ SAS Reference ==> Procedures ==> COUNTREG SAS Reference ==> Procedures ==> GENMOD

Date Modified:	2019-05-03 15:51:08
Date Created:	2012-09-21 14:38:29

Support

Usage Note 47956: Estimating parameters and testing fit of the negative binomial distribution

Operating System and Release Information