The Pearson and likelihood ratio goodness of fit tests provide tests of the fit of a distribution or model to the observed values of a variable. The test compares the expected values from the distribution or model to the observed values. For a categorical variable, the comparison is done at each of its levels, but the fit to a continuous variable can also be tested if it is categorized. The test statistics are approximately chi-square distributed with k-1 degrees of freedom, where k is the number of categories. However, when parameters of the hypothesized distribution or model must be estimated, the degrees of freedom of the test statistic must be decreased by the number of estimated parameters.
When expected values are obtained without estimating parameters, the test can be done using PROC FREQ by specifying a one-way table and the CHISQ option in the TABLES statement. The list of expected values can be specified directly in the TESTF= or TESTP= option in the TABLES statement. The TESTF=data-set-name or TESTP=data-set-name option allows you to read expected values from a variable in a data set. This is illustrated in "Binary or multinomial distribution with predefined expected values" below.
When expected values are obtained using estimated parameters of the hypothesized distribution, the DF= option in PROC FREQ allows you to appropriately adjust the degrees of freedom of the test. The examples in "Expected values based on estimated distribution parameters" illustrate this.
For tables with a small total count, the CHISQ option in the EXACT statement can be specified to obtain an exact test.
You can test the fit of a binary (Bernoulli) or multinomial distribution if you have expected probabilities or frequencies for the levels of the distribution. Assuming that the expected values were not determined by estimating parameters of the distribution, a test of fit can be obtained by using PROC FREQ with the CHISQ option and specifying the expected probabilities in the TESTP= option or the expected frequencies in the TESTF= option.
For example, the following statements test the fit of a multinomial distribution with equal probabilities to a set of observed frequencies. The Pearson chi-square test is provided by default. The LRCHISQ option requests the likelihood ratio chi-square test.
data Observed; input y count; datalines; 1 10 2 18 3 38 4 34 ; proc freq data=Observed; weight count; table y / chisq(testp=(0.25 0.25 0.25 0.25) lrchisq); run;
The Pearson and likelihood ratio tests are both significant (p<0.0001) indicating that the multinomial distribution with equal probabilities does not fit the observed data. Note that the tests have three degrees of freedom since the distribution has four possible levels and no distribution parameters were estimated.
The plot of deviations shows how each observed probability deviates from the probability expected under the hypothesized equal-probability distribution. The deviations are relative to the expected probabilities and are computed as (observed - expected)/expected. For example, the deviation at level 1 is (0.10 - 0.25)/0.25 = -0.60 indicating that the deviation is more than half its expected value.
You can also specify expected values in a data set. The following statements produce the same tests and plot as above. Note that the variable containing the expected probabilities must be named _TESTP_. If expected frequencies are used, the variable must be named _TESTF_. The variable of expected values must be in a different data set than the observed values.
data ExpProbs; do y = 1 to 4; _testp_ = 0.25; output; end; run; proc freq data=Observed; weight count; table y / chisq(testp=ExpProbs lrchisq); run;
If the expected values are determined after estimating the parameter(s) of the hypothesized distribution, the DF= option must also be used in PROC FREQ to properly set the degrees of freedom of the test. The degrees of freedom must be reduced by the number of estimated parameters. You typically need to estimate distribution parameters for discrete distributions such as the Poisson, negative binomial, geometric, and zero-inflated distributions such as the zero-inflated Poisson and zero-inflated negative binomial. You first estimate the parameter(s) of the distribution using a modeling procedure such as PROC GENMOD (shown here), PROC COUNTREG (shown here), or PROC NLMIXED. Expected values for each discrete level of the distribution can either be computed by the modeling procedure or computed in a DATA step after parameter estimation. The expected values are again specified in the TESTP= (or TESTF=) option of PROC FREQ, but you also specify the number of parameters estimated as a negative value in the DF= option to reduce the degrees of freedom of the test by that number.
This note shows an example which tests the fit of the negative binomial distribution. Below are additional examples that appear in Friendly (2000).
Testing the fit of a Poisson distribution
The following statements read the observed counts for each level of count variable Y and creates an expanded data set of individual values.
data Madison; input y freq; do i=1 to freq; output; end; datalines; 0 156 1 63 2 29 3 8 4 4 5 1 6 1 ;
PROC COUNTREG is used to estimate the mean of the Poisson distribution from the observed data. A plot of the observed and expected probabilities is provided by the PLOTS=PREDPROB option. The desired range of Y count values is specified in the COUNTS= option. The PROB= option in the OUTPUT statement saves expected probabilities and they are displayed with PROC PRINT.
proc countreg data=Madison plots(counts(0 to 6))=predprob; model y = / dist=poisson; output out=exp_poi prob=prob_poi; run;
The plot shows that more zeros are observed than expected and fewer ones are observed than expected, but there is good agreement at the larger counts.
To ensure that the Pearson and likelihood ratio chi-square statistics are approximately chi-square distributed so that their p-values are trustworthy, levels with small expected values should be merged. Note that if the Pearson and likelihood ratio chi-square statistics differ considerably, this is an indication that some of the categories are too sparse and the statistics may not be valid. The following statements combine the expected values of levels three and above before conducting the tests.
proc format; value yfmt 3-high = "3+"; run; proc means mean nway data=exp_poi; class y; var prob_poi; output out=AvgProbs mean=AvgProb; run; proc means sum nway data=AvgProbs; class y; var AvgProb; format y yfmt.; output out=ExpProbs sum=_testp_; run;
Since the Poisson has only one parameter to estimate, DF=-1 is specified in PROC FREQ to reduce the degrees of freedom of the tests by one. The small p-values of the Pearson and likelihood ratio tests indicate that the Poisson distribution does not fit well.
proc freq data=Madison; table y / chisq(testp=ExpProbs df=-1 lrchisq); format y yfmt.; run;
|
Testing the fit of a binomial distribution
The following counts of male births in families with twelve offspring are hypothesized to follow a binomial distribution.
data Saxony; n=12; do males=0 to n; input freq @@; output; end; datalines; 3 24 104 286 670 1033 1343 1112 829 478 181 45 7 ;
These statements estimate the binomial parameter and save it in the P variable in the OUT= data set.
proc genmod; freq freq; model males/n = / dist=binomial; output out=predbin p=p; run;
The following statements use the PDF function to compute the expected numbers of male births for each possible count, 0 to 12, using the estimated binomial parameter. Since only one pass through the DATA step is needed to compute all expected values, the STOP statement is used at the end of the step.
data exp_bin; set predbin; do males=0 to 12; prob_bin=pdf("binomial",males,p,12); output; end; stop; run; proc print; id males; var prob_bin; run;
Again, notice that the large and small counts have small expected probabilities.
|
These statements combine counts 0-2 and counts 10-12.
proc format; value malefmt low-2 = "<=2" 10-high="10+"; run; proc means sum nway data=exp_bin; class males; var prob_bin; format males malefmt.; output out=exp_bin sum=_testp_; run;
Finally, PROC FREQ conducts the tests using one fewer degrees of freedom due to estimation of the binomial parameter. The small p-values indicate that the binomial distribution does not fit the observed counts well.
proc freq data=Saxony; table males / chisq(testp=exp_bin df=-1 lrchisq); format males malefmt.; weight freq; run;
|
__________
Friendly, M. (2000), Visualizing Categorical Data, Cary, NC: SAS Institute Inc.
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | z/OS | ||
Z64 | ||||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 8 Pro | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows Server 2003 for x64 | ||||
Microsoft Windows Server 2008 | ||||
Microsoft Windows Server 2008 for x64 | ||||
Microsoft Windows Server 2012 | ||||
Microsoft Windows XP Professional | ||||
Windows 7 Enterprise 32 bit | ||||
Windows 7 Enterprise x64 | ||||
Windows 7 Home Premium 32 bit | ||||
Windows 7 Home Premium x64 | ||||
Windows 7 Professional 32 bit | ||||
Windows 7 Professional x64 | ||||
Windows 7 Ultimate 32 bit | ||||
Windows 7 Ultimate x64 | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
Windows Vista for x64 | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | Analytics ==> Categorical Data Analysis Analytics ==> Distribution Analysis SAS Reference ==> Procedures ==> COUNTREG SAS Reference ==> Procedures ==> FREQ SAS Reference ==> Procedures ==> GENMOD |
Date Modified: | 2022-01-13 12:18:40 |
Date Created: | 2013-01-14 10:50:29 |