![]() | ![]() | ![]() |
For generalized linear models, including but not limited to the logit, probit, and Poisson models, Pearson and deviance goodness-of-fit statistics are computed by the GENMOD procedure. For binomial and multinomial models (whether ordered or unordered), they can also be computed by the LOGISTIC procedure (specify the SCALE=NONE option) and the PROBIT procedure (specify the LACKFIT option). For the multinomial model in PROC GENMOD, specify the AGGREGATE= option. A rough indicator of fit is provided by either of these statistics divided by its degrees of freedom, which should be approximately equal to one. When it deviates substantially from one, some form of lack of fit or, for binomial or Poisson models, overdispersion is indicated. Lindsey (1999) suggests that overdispersion is possible if the deviance is at least twice the degrees of freedom. Overdispersion occurs when there is more variability than expected given the response distribution. Overdispersion is discussed later in this note.
A model might fit some observations well but not others. Some observations might not be well fit because additional predictors or higher-order terms might be needed in the model. Or there might be outliers in the data that no reasonable change to the specification of the model can accommodate. For assessing fit of the binary response model at the observation level, use the residuals that are available via the OBSTATS option (in PROC GENMOD) or the options in the OUTPUT statement (in PROC GENMOD and PROC LOGISTIC). For the multinomial model, predicted probabilities for each observation are available via options in the OUTPUT statements of all three procedures.
Another test of fit, available only for the binary logistic model, is the Hosmer-Lemeshow test. This is available in PROC LOGISTIC via the LACKFIT option. Statistics that can be used to compare competing models are the AIC, SC (also called BIC), and R2 statistics. In PROC LOGISTIC, AIC and SC are provided by default and R2 is requested by specifying the RSQUARE option in the MODEL statement.
For binary and multinomial response models, the Pearson and deviance statistics are computed by grouping observations into subpopulations.[Note] For details and formulas for these statistics, see the documentation for the procedure. You can define the subpopulations by using the AGGREGATE= option in any of the procedures. This is necessary if the data was collected in subpopulations defined more precisely than by the covariates in the model, as further described in this note. For binary response data that is already summarized and for which the events/trials syntax is used in the MODEL statement, subpopulations can be defined as the binomial results that are contained in each observation by omitting the AGGREGATE= option.
The Pearson and deviance statistics are known to be chi-square distributed only in certain cases. In general their distribution is not known. For more details, see McCullagh and Nelder (1989), the "Overdispersion" sections in the PROC LOGISTIC and PROC GENMOD documentation, and the "Lack of Fit Tests" section in the PROC PROBIT documentation. For this reason, PROC GENMOD does not present p-values for these statistics. Generally, the larger the scaled statistics, the poorer is the fit. In the binary response case, sufficient replication is required in all subpopulations for these statistics to be chi-square distributed. Otherwise, the data is sparse and neither statistic is a reliable indicator of goodness of fit. One sign of this is a large difference between the two statistics.
Overdispersion can also be detected by these statistics. For example, in Poisson models the variability should be equal to the mean because the mean and variance are identical in this distribution. When the data is more variable than that, it is said to be overdispersed. One way to adjust for overdispersion is to estimate the dispersion parameter (called the heterogeneity factor in PROC PROBIT) and inflate the covariance matrix of the parameter estimates by this factor. The dispersion parameter can be estimated by the Pearson or deviance statistic divided by its degrees of freedom (use the SCALE=P or SCALE=D option). As noted earlier, because this ratio also indicates lack of fit, it is useful to eliminate this possibility before relying on this adjustment. Another way of dealing with overdispersion is to use the generalized estimating equations (GEE) method to estimate the model. While GEE is typically used when subjects or objects are repeatedly measured, it can be used even when there is only a single measurement per subject. The robust standard errors of the GEE method provide an adjustment for overdispersion as discussed in Stokes et. al. (2000). Other options include Williams' method for overdispersed binomial data (use the SCALE=WILLIAMS option in PROC LOGISTIC) and the negative binomial distribution for overdispersed Poisson data (use the DIST=NEGBIN option in PROC GENMOD). Beginning with SAS 9, you can test for overdispersion in a Poisson model by using these options together in the MODEL statement of PROC GENMOD: DIST=NEGBIN SCALE=0 NOSCALE. These options test whether overdispersion of the form μ+kμ2 exists by testing whether the negative binomial dispersion parameter, k, is zero. When k=0, the negative binomial distribution is equivalent to the Poisson distribution. For binary response or normal response models, the variance can be modeled separately from the mean by using the HETERO statement in the QLIM procedure in SAS/ETS.
For GEE models, which are available in PROC GENMOD via the REPEATED statement, there is no overall fit statistic currently available within PROC GENMOD. Note that the Pearson and deviance statistics apply only to the initial model that begins the GEE estimation, not to the final GEE model. A comparative statistic similar to AIC, known as QIC, has been proposed and is available in PROC GENMOD beginning in SAS 9.2. Prior to that, QIC can be obtained via the QIC macro. You can assess fit of the GEE model by using the observation-level statistics that are available via the OBSTATS option or options in the OUTPUT statement. Beginning in SAS 9, the ASSESS statement can be used to determine adequacy of the link function or whether the functional form of a predictor in the model is correct. Starting in SAS 9.2, deletion diagnostics and plots are provided for GEE models to assess the effects of deleting entire clusters. Deletion diagnostics are available via the OBSTATS option or options in the OUTPUT statement. Plots are available via the PLOTS= option in the PROC GENMOD statement.
_____
Note that in the PROBIT procedure in SAS 8.1 TS1M0 or earlier, the input data set must be sorted into subpopulations. Prior to SAS 8.2, the subpopulations are formed by PROC PROBIT as changes in the covariates are found when it reads the data set. So, if the data set is not sorted into the proper subpopulations, more subpopulations than actually exist are defined, resulting in goodness-of-fit statistics not based on the correct subpopulation structure and usually resulting in a sparse condition._____
Lindsey, J.K. (1999), "On the use of corrections for overdispersion," Applied Statistics, 48(4), 553-561.
Stokes, M. E., Davis, C. S., and Koch, G. G. (2000), Categorical Data Analysis Using the SAS System, Second Edition, Cary, NC: SAS Institute Inc.
| Product Family | Product | System | SAS Release | |
| Reported | Fixed | |||
| SAS System | SAS/STAT | All | n/a | |
| Type: | Usage Note |
| Priority: | low |
| Topic: | SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> PROBIT Analytics ==> Regression Analytics ==> Categorical Data Analysis SAS Reference ==> Procedures ==> LOGISTIC |
| Date Modified: | 2008-07-22 17:33:29 |
| Date Created: | 2002-12-16 10:56:38 |



