48914 - Testing the fit of a discrete distribution

Usage Note 48914: Testing the fit of a discrete distribution

The Pearson and likelihood ratio goodness of fit tests provide tests of the fit of a distribution or model to the observed values of a variable. The test compares the expected values from the distribution or model to the observed values. For a categorical variable, the comparison is done at each of its levels, but the fit to a continuous variable can also be tested if it is categorized. The test statistics are approximately chi-square distributed with k-1 degrees of freedom, where k is the number of categories. However, when parameters of the hypothesized distribution or model must be estimated, the degrees of freedom of the test statistic must be decreased by the number of estimated parameters.

When expected values are obtained without estimating parameters, the test can be done using PROC FREQ by specifying a one-way table and the CHISQ option in the TABLES statement. The list of expected values can be specified directly in the TESTF= or TESTP= option in the TABLES statement. The TESTF=data-set-name or TESTP=data-set-name option allows you to read expected values from a variable in a data set. This is illustrated in "Binary or multinomial distribution with predefined expected values" below.

When expected values are obtained using estimated parameters of the hypothesized distribution, the DF= option in PROC FREQ allows you to appropriately adjust the degrees of freedom of the test. The examples in "Expected values based on estimated distribution parameters" illustrate this.

For tables with a small total count, the CHISQ option in the EXACT statement can be specified to obtain an exact test.

Binary or multinomial distribution with predefined expected values

You can test the fit of a binary (Bernoulli) or multinomial distribution if you have expected probabilities or frequencies for the levels of the distribution. Assuming that the expected values were not determined by estimating parameters of the distribution, a test of fit can be obtained by using PROC FREQ with the CHISQ option and specifying the expected probabilities in the TESTP= option or the expected frequencies in the TESTF= option.

For example, the following statements test the fit of a multinomial distribution with equal probabilities to a set of observed frequencies. The Pearson chi-square test is provided by default. The LRCHISQ option requests the likelihood ratio chi-square test.

      data Observed;
        input y count;
        datalines;
      1 10 
      2 18 
      3 38 
      4 34
      ;

      proc freq data=Observed;
        weight count;
        table y / chisq(testp=(0.25 0.25 0.25 0.25) lrchisq);
        run;

The Pearson and likelihood ratio tests are both significant (p<0.0001) indicating that the multinomial distribution with equal probabilities does not fit the observed data. Note that the tests have three degrees of freedom since the distribution has four possible levels and no distribution parameters were estimated.

The plot of deviations shows how each observed probability deviates from the probability expected under the hypothesized equal-probability distribution. The deviations are relative to the expected probabilities and are computed as (observed - expected)/expected. For example, the deviation at level 1 is (0.10 - 0.25)/0.25 = -0.60 indicating that the deviation is more than half its expected value.

The SAS System

The FREQ Procedure

y	Frequency	Percent	Test Percent	Cumulative Frequency	Cumulative Percent
1	10	10.00	25.00	10	10.00
2	18	18.00	25.00	28	28.00
3	38	38.00	25.00	66	66.00
4	34	34.00	25.00	100	100.00

Chi-Square Test for Specified Proportions
Chi-Square	20.9600
DF	3
Pr > ChiSq	0.0001

Likelihood Ratio Chi-Square Test for Specified Proportions
Chi-Square	22.5790
DF	3
Pr > ChiSq	<.0001

Sample Size = 100

You can also specify expected values in a data set. The following statements produce the same tests and plot as above. Note that the variable containing the expected probabilities must be named _TESTP_. If expected frequencies are used, the variable must be named _TESTF_. The variable of expected values must be in a different data set than the observed values.

      data ExpProbs;
        do y = 1 to 4;
          _testp_ = 0.25; 
          output;
        end;
        run;
      
      proc freq data=Observed;
        weight count;
        table y / chisq(testp=ExpProbs lrchisq);
        run;

Expected values based on estimated distribution parameters

If the expected values are determined after estimating the parameter(s) of the hypothesized distribution, the DF= option must also be used in PROC FREQ to properly set the degrees of freedom of the test. The degrees of freedom must be reduced by the number of estimated parameters. You typically need to estimate distribution parameters for discrete distributions such as the Poisson, negative binomial, geometric, and zero-inflated distributions such as the zero-inflated Poisson and zero-inflated negative binomial. You first estimate the parameter(s) of the distribution using a modeling procedure such as PROC GENMOD (shown here), PROC COUNTREG (shown here), or PROC NLMIXED. Expected values for each discrete level of the distribution can either be computed by the modeling procedure or computed in a DATA step after parameter estimation. The expected values are again specified in the TESTP= (or TESTF=) option of PROC FREQ, but you also specify the number of parameters estimated as a negative value in the DF= option to reduce the degrees of freedom of the test by that number.

This note shows an example which tests the fit of the negative binomial distribution. Below are additional examples that appear in Friendly (2000).

Testing the fit of a Poisson distribution

The following statements read the observed counts for each level of count variable Y and creates an expanded data set of individual values.

      data Madison;
        input y freq;
        do i=1 to freq; 
          output; 
        end;
        datalines;
      0  156
      1   63
      2   29
      3    8
      4    4
      5    1
      6    1
      ;

PROC COUNTREG is used to estimate the mean of the Poisson distribution from the observed data. A plot of the observed and expected probabilities is provided by the PLOTS=PREDPROB option. The desired range of Y count values is specified in the COUNTS= option. The PROB= option in the OUTPUT statement saves expected probabilities and they are displayed with PROC PRINT.

      proc countreg data=Madison plots(counts(0 to 6))=predprob;
        model y = / dist=poisson;
        output out=exp_poi prob=prob_poi;
        run;

The plot shows that more zeros are observed than expected and fewer ones are observed than expected, but there is good agreement at the larger counts.

To ensure that the Pearson and likelihood ratio chi-square statistics are approximately chi-square distributed so that their p-values are trustworthy, levels with small expected values should be merged. Note that if the Pearson and likelihood ratio chi-square statistics differ considerably, this is an indication that some of the categories are too sparse and the statistics may not be valid. The following statements combine the expected values of levels three and above before conducting the tests.

      proc format;
        value yfmt 3-high = "3+";
        run;
      proc means mean nway data=exp_poi;
        class y; 
        var prob_poi;
        output out=AvgProbs mean=AvgProb;
        run;
      proc means sum nway data=AvgProbs;
        class y; 
        var AvgProb;
        format y yfmt.;
        output out=ExpProbs sum=_testp_;
        run;

Since the Poisson has only one parameter to estimate, DF=-1 is specified in PROC FREQ to reduce the degrees of freedom of the tests by one. The small p-values of the Pearson and likelihood ratio tests indicate that the Poisson distribution does not fit well.

      proc freq data=Madison; 
        table y / chisq(testp=ExpProbs df=-1 lrchisq);
        format y yfmt.;
        run;

Chi-Square Test for Specified Proportions
Chi-Square	16.0381
DF	2
Pr > ChiSq	0.0003

Likelihood Ratio Chi-Square Test for Specified Proportions
Chi-Square	15.7189
DF	2
Pr > ChiSq	0.0004

Testing the fit of a binomial distribution

The following counts of male births in families with twelve offspring are hypothesized to follow a binomial distribution.

      data Saxony;
        n=12;
        do males=0 to n;
          input freq @@;
          output;
        end;
        datalines;
      3 24 104 286 670 1033 1343 1112 829 478 181 45 7
      ;

These statements estimate the binomial parameter and save it in the P variable in the OUT= data set.

      proc genmod;
        freq freq;
        model males/n = / dist=binomial;
        output out=predbin p=p;
        run;

The following statements use the PDF function to compute the expected numbers of male births for each possible count, 0 to 12, using the estimated binomial parameter. Since only one pass through the DATA step is needed to compute all expected values, the STOP statement is used at the end of the step.

      data exp_bin;
        set predbin;
        do males=0 to 12;
          prob_bin=pdf("binomial",males,p,12);
          output;
        end;
        stop;
        run;
      proc print;
        id males; 
        var prob_bin;
        run;

Again, notice that the large and small counts have small expected probabilities.

males	prob_bin
0	0.00015
1	0.00198
2	0.01174
3	0.04227
4	0.10271
5	0.17747
6	0.22359
7	0.20697
8	0.13970
9	0.06705
10	0.02172
11	0.00427
12	0.00038

These statements combine counts 0-2 and counts 10-12.

      proc format;
        value malefmt low-2 = "<=2" 10-high="10+";
        run;
      proc means sum nway data=exp_bin;
        class males; 
        var prob_bin;
        format males malefmt.;
        output out=exp_bin sum=_testp_;
        run;

Finally, PROC FREQ conducts the tests using one fewer degrees of freedom due to estimation of the binomial parameter. The small p-values indicate that the binomial distribution does not fit the observed counts well.

      proc freq data=Saxony;
        table males / chisq(testp=exp_bin df=-1 lrchisq);
        format males malefmt.;
        weight freq;
        run;

Chi-Square Test for Specified Proportions
Chi-Square	96.3889
DF	7
Pr > ChiSq	<.0001

Likelihood Ratio Chi-Square Test for Specified Proportions
Chi-Square	88.9937
DF	7
Pr > ChiSq	<.0001

__________

Friendly, M. (2000), Visualizing Categorical Data, Cary, NC: SAS Institute Inc.

Operating System and Release Information

Product Family	Product	System	SAS Release
Product Family	Product	System	Reported	Fixed*
SAS System	SAS/STAT	z/OS
		Z64
		OpenVMS VAX
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		OS/2
		Microsoft Windows 8 Pro
		Microsoft Windows 95/98
		Microsoft Windows 2000 Advanced Server
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Server
		Microsoft Windows 2000 Professional
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
		Microsoft Windows Server 2003 Enterprise Edition
		Microsoft Windows Server 2003 Standard Edition
		Microsoft Windows Server 2003 for x64
		Microsoft Windows Server 2008
		Microsoft Windows Server 2008 for x64
		Microsoft Windows Server 2012
		Microsoft Windows XP Professional
		Windows 7 Enterprise 32 bit
		Windows 7 Enterprise x64
		Windows 7 Home Premium 32 bit
		Windows 7 Home Premium x64
		Windows 7 Professional 32 bit
		Windows 7 Professional x64
		Windows 7 Ultimate 32 bit
		Windows 7 Ultimate x64
		Windows Millennium Edition (Me)
		Windows Vista
		Windows Vista for x64
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:
Topic:	Analytics ==> Categorical Data Analysis Analytics ==> Distribution Analysis SAS Reference ==> Procedures ==> COUNTREG SAS Reference ==> Procedures ==> FREQ SAS Reference ==> Procedures ==> GENMOD

Date Modified:	2022-01-13 12:18:40
Date Created:	2013-01-14 10:50:29

Support

Usage Note 48914: Testing the fit of a discrete distribution

Binary or multinomial distribution with predefined expected values

Expected values based on estimated distribution parameters

Operating System and Release Information