22599 - Understanding and correcting complete or quasi-complete separation problems

Usage Note 22599: Understanding and correcting complete or quasi-complete separation problems

This is a common problem with logistic models. Unlike in ordinary least-squares regression for modeling a normally distributed response, when a logistic model perfectly or nearly perfectly predicts the response (that is, separates the response levels), unique maximum likelihood estimates do not exist. Some model parameters are actually infinite. This is a common result of the data being sparse, meaning that not all response levels are observed in each of the predictor settings, which often happens with small data sets, when the event is rare, or with an over-specified model. See "Existence of Maximum Likelihood Estimates" in the Details section of the LOGISTIC documentation.

When the separation condition is detected, the procedure terminates the maximum likelihood iteration process and reports the results at the last iteration. The resulting model is often quite good at classifying observations that are not in the region between events and nonevents (where the shape of the model is not identified) and might be usable for this purpose, but inferences about the parameters based on Wald chi-square statistics should be avoided. For details about the stopping rule, see the "Existence of Maximum Likelihood Estimates" section of the LOGISTIC documentation and So (1993).

Additional discussion and illustration of separation can be found in Allison (2012).

One possible solution to the separation problem is to make the model fit less perfectly by reducing the number of variables or effects in the model, categorizing continuous variables, or merging categories of categorical (CLASS) variables. It is often difficult to know exactly which variables cause the separation, but variables that exhibit large parameter estimates or standard errors are likely candidates.

When developing a model, it can be very helpful to use forward or stepwise (not backward) selection via the SELECTION= option in the MODEL statement. These methods build a model that might avoid separation, or if separation occurs in a model fit at a particular step, you might consider using the model successfully fit at the previous step. The backward method is not recommended since this method begins by fitting the most complex model which maximizes the sparseness of the data and is the most likely model to exhibit separation.

Valid tests of model effects can be obtained by using likelihood ratio (LR) tests rather than the Wald tests that are presented by PROC LOGISTIC. For models that do not involve survey data, the easiest way to do this is to fit the model in PROC GENMOD (use the DIST=BINOMIAL option in the MODEL statement) and use the TYPE3 option or the CONTRAST statement to obtain LR tests.

Because separation often happens with small data sets, another possible solution is to use exact logistic regression, which is available via the EXACT statement in PROC LOGISTIC. However, exact methods are very computationally intensive and can take great amounts of time and/or memory even for moderate sized problems. A penalized likelihood method by Firth can often converge to a solution even though a unique solution by maximum likelihood does not exist in separation cases. Firth's method is available by specifying the FIRTH option in the MODEL statement of PROC LOGISTIC. Neither the FIRTH option nor the EXACT statement can be used with the SELECTION= option.

If interest lies exclusively in testing the association of one predictor with the response while adjusting for the effects of the other predictors, then you could use the CMH option in PROC FREQ. All predictors other than the one of interest are used as stratifying variables. Since no model is fit by PROC FREQ, the problem of infinite parameters caused by separation is not an issue. For example, consider the final model involving Sex, Age, and Treatment in the example titled "Logistic Modeling with Categorical Predictors" in the LOGISTIC documentation. The following statements test the association of Treatment with the response, Pain, while adjusting for Sex and Age. The NOPRINT option is used to avoid displaying the large multi-way table. The CMH option computes the test. Since Treatment is an unordered categorical variable with three levels, the second CMH statistic labeled "Row Mean Scores Differ" is an appropriate test.

   proc freq data=Neuralgia;
      table Sex * Age * Treatment * Pain / noprint cmh;
      run;

The resulting p-value (0.0040) is close to the p-value of the Treatment Type3 test from the logistic model (0.0018)

Summary Statistics for Treatment by Pain
Controlling for Sex and Age

Cochran-Mantel-Haenszel Statistics (Based on Table Scores)
Statistic	Alternative Hypothesis	DF	Value	Prob
1	Nonzero Correlation	1	7.1064	0.0077
2	Row Mean Scores Differ	2	11.0497	0.0040
3	General Association	2	11.0497	0.0040

Operating System and Release Information

Product Family	Product	System	SAS Release
			Reported	Fixed*
SAS System	SAS/STAT	All	n/a

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:	low
Topic:	Analytics ==> Survey Sampling and Analysis SAS Reference ==> Procedures ==> SURVEYLOGISTIC Analytics ==> Categorical Data Analysis SAS Reference ==> Procedures ==> LOGISTIC

Date Modified:	2021-02-05 14:49:10
Date Created:	2002-12-16 10:56:38

Support

Usage Note 22599: Understanding and correcting complete or quasi-complete separation problems

Operating System and Release Information