FOCUS AREAS


PDF

Exact Logistic Regression and Exact Poisson Regression

Overview

SAS/STAT® software provides facilities in the LOGISTIC and GENMOD procedures for performing exact logistic regression and in the GENMOD procedure for performing exact Poisson regression. Exact logistic regression and exact Poisson regression have become important analytical techniques, especially in the pharmaceutical industry, since the usual asymptotic methods for analyzing small, skewed, or sparse data sets are unreliable. Inference based on enumerating the exact distributions of sufficient statistics for parameters of interest in a logistic or Poisson regression model, conditional on the remaining parameters, is computationally infeasible for many problems. Hirji, Mehta, and Patel (1987) developed an efficient algorithm for generating the required conditional distributions making these methods computationally available.

Introduction

Many clinical trials deal with the comparison of populations of subjects with categorical responses. Historically, statistical inference for such studies involves large-sample approximations, and fitting logistic or Poisson regression models to such data is performed through the unconditional likelihood function. However, asymptotic methods might be inadequate when sample sizes are small or the data are sparse or skewed. Exact conditional inference remains valid in such situations.

The LOGISTIC, GENMOD, GLIMMIX, PROBIT, CATMOD, HPLOGISTIC, and HPGENSELECT procedures perform unconditional likelihood inference for logit models, and the PHREG procedure can perform asymptotic conditional likelihood inference for logit models. The GENMOD, GAM, and GLIMMIX procedures perform unconditional likelihood inference for Poisson regression models. SAS users requested the ability to perform exact tests for logistic and Poisson regression modeling, and so SAS/STAT software introduced the EXACT statement in the LOGISTIC and GENMOD procedures that allows users to obtain these tests.

Capabilities

The exact conditional logistic and Poisson regression facilities in the LOGISTIC and GENMOD procedures provide numerous features, including:

  • parameter estimates for each variable in the EXACT statement conditional on the values of all the other variables in the model. For each parameter estimate, the procedure displays either the exact maximum conditional likelihood estimate or the median unbiased estimate in addition to the exponential of the estimate, one- or two-sided confidence limits, and a one- or two-sided p-value for testing that the parameter is equal to zero.

  • two tests for the null hypothesis that the parameters for the specified effects are zero: the exact probability test and the exact conditional scores test. For each test, the procedure displays a test statistic, an exact p-value, and a mid p-value that adjusts for the discreteness of the distribution. Individual hypothesis tests for the parameter of each continuous effect, and main effects tests for the parameters of classification variables, are generated by default.

  • joint test that all specified effects are simultaneouly equal to zero. This test can be requested using the JOINT option on the EXACT statement. Note that the JOINT option will not affect parameter estimates nor the individual hypothesis tests for the specifed effects.

  • output data sets that contain the derived distributions and summary statistics

Example 1: Exact Logistic Regression - Dose-Response Study

Consider a small dose-response study in which researchers are interested in analyzing how mortality rates change with respect to dosage of a drug. The dose set contains life/death outcomes for six levels of drug dosage (0 to 5). Three subjects are given each specific dose of the drug, and the number of deaths are recorded.

data dose;
  input Dose Deaths Total @@;
  datalines;
0 0 3  1 0 3  2 0 3  3 0 3
4 1 3  5 2 3
;
run;

All of the cells have counts that are less than 5, which makes the applicability of large sample theory questionable. For each subject i receiving dosage $x_ i, i = 1, ..., 18$, let $Y_ i = 1$ if the subject died, $Y_ i = 0$ otherwise, and $\pi _ i = P(Y_ i = 1 | x_ i)$. Then the linear logistic model for this problem is logit$(\pi _ i) = log(\pi _ i / (1-\pi _ i)) = \alpha + x_ i\beta $, which fits a common intercept and slope for the i subjects. In the PROC LOGISTIC invocation below, the EXACT statement requests an exact analysis and the ESTIMATE option produces exact parameter estimates.

proc logistic data=dose descending;
  model Deaths/Total = Dose;
  exact Dose / estimate = both;
run;

Output 1 displays some of the unconditional asymptotic results that are produced by default. The likelihood ratio and scores test reject the null hypothesis that the slope $\beta $ is zero. However, the Wald test does not reject this null hypothesis. The seemingly conflicting conclusions of these tests are a sign that the large-sample approximation is unreliable. The estimates for the intercept $\alpha $ and the slope $\beta $ both have p-values greater than 0.05, indicating marginal influence. The 95that a change in dosage causes a change in mortality.

Output 1: Asymptotic Analysis Results

The LOGISTIC Procedure
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 8.1478 1 0.0043
Score 5.7943 1 0.0161
Wald 2.7249 1 0.0988

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -9.4745 5.5677 2.8958 0.0888
Dose 1 2.0804 1.2603 2.7249 0.0988

Odds Ratio Estimates
Effect Point Estimate 95% Wald
Confidence Limits
Dose 8.007 0.677 94.679


Output 2 displays the results produced by the EXACT statement. The p-values in the "Exact Conditional Tests" table lead you to reject the null hypothesis that the slope $\beta $ is equal to zero (no conclusions can be made about the intercept $\alpha $ since it is "conditioned" away). Note that the p-values for the asymptotic estimates are larger than the p-values for the exact estimates. The "Exact Parameter Estimates" table shows that the slope $\beta $ is estimated to be 1.8 with an exact p-value of 0.0245. Note that the confidence interval for the odds ratio does not include the value 1; thus the odds of death increase with dosage.

Output 2: Exact Analysis Results

The LOGISTIC Procedure
 
Exact Conditional Analysis
Exact Conditional Tests
Effect Test Statistic p-Value
Exact Mid
Dose Score 5.4724 0.0245 0.0190
  Probability 0.0110 0.0245 0.0190

Exact Parameter Estimates
Parameter Estimate Standard
Error
95% Confidence Limits Two-sided p-Value
Dose 1.8000 1.0784 0.1157 5.8665 0.0245

Exact Odds Ratios
Parameter Estimate 95% Confidence Limits Two-sided p-Value
Dose 6.049 1.123 353.000 0.0245


Exact Logistic Regression by Using PROC GENMOD

The following SAS statements replicate the previous analysis by using PROC GENMOD:

proc genmod data=dose descending;
  model Deaths/Total = Dose;
  exact Dose / estimate = both;
run;

The output generated by the EXACT statement in GENMOD is identical to that from LOGISTIC. For more information about PROC GENMOD refer to the chapter The GENMOD Procedure in SAS/STAT User's Guide.

For more information about performing exact logistic regression using the LOGISTIC procedure, refer to the paper "Performing Exact Logistic Regression with the SAS System-Revised 2009" by Robert Derr.

Example 2: Exact Poisson Regression - Ingots Study

The following data, taken from Cox and Snell (1989, pp. 9–10) , consists of the number, Notready, of ingots that are not ready for rolling, out of Total tested, for several combinations of heating time and soaking time:

data ingots;
   input Heat Soak NotReady Total @@;
   lnTotal= log(Total);
   datalines;
7 1.0 0 10  14 1.0 0 31  27 1.0 1 56  51 1.0 3 13
7 1.7 0 17  14 1.7 0 43  27 1.7 4 44  51 1.7 0  1
7 2.2 0  7  14 2.2 2 33  27 2.2 0 21  51 2.2 0  1
7 2.8 0 12  14 2.8 0 31  27 2.8 1 22  51 4.0 0  1
7 4.0 0  9  14 4.0 0 19  27 4.0 1 16
;
run;

The following invocation of PROC GENMOD fits an asymptotic (unconditional) Poisson regression model to the data. The variable NotReady is specified as the response variable, and the continuous predictors Heat and Soak are defined in the CLASS statement as categorical predictors that use reference coding. Specifying the offset variable as lnTotal enables you to model the ratio NotReady/Total.

proc genmod data=ingots;
   class Heat Soak / param=ref;
   model NotReady=Heat Soak / offset=lnTotal dist=poisson link=log;
   exact Heat Soak / joint estimate;
run;

The EXACT statement is specified to additionally fit an exact conditional Poisson regression model. Specifying the LnTotal offset variable models the ratio NotReady/Total; in this case, the Total variable contains the largest possible response value for each observation. The JOINT option produces a joint test for the significance of the covariates, along with the usual marginal tests. The ESTIMATE option produces exact parameter estimates for the covariates.

From the "Analysis of Maximum Likelihood Parameter Estimates" table in Output 3 you can see that only two of the Heat parameters are significantly different than the reference level. Looking at the standard errors you will notice that the unconditional analysis had convergence difficulties with the Heat=7 parameter (Standard Error=264324.6), which implies that the regression parameter for Heat=7 is not estimable for this model thereby precluding inference.

Output 3: Unconditional Maximum Likelihood Estimates

The GENMOD Procedure
Analysis Of Maximum Likelihood Parameter Estimates
Parameter   DF Estimate Standard
Error
Wald 95% Confidence Limits Wald Chi-Square Pr > ChiSq
Intercept   1 -1.5700 1.1657 -3.8548 0.7147 1.81 0.1780
Heat 7 1 -27.6129 264324.6 -518094 518039.0 0.00 0.9999
Heat 14 1 -3.0107 1.0025 -4.9756 -1.0458 9.02 0.0027
Heat 27 1 -1.7180 0.7691 -3.2253 -0.2106 4.99 0.0255
Soak 1 1 -0.2454 1.1455 -2.4906 1.9998 0.05 0.8304
Soak 1.7 1 0.5572 1.1217 -1.6412 2.7557 0.25 0.6193
Soak 2.2 1 0.4079 1.2260 -1.9951 2.8109 0.11 0.7394
Soak 2.8 1 -0.1301 1.4234 -2.9199 2.6597 0.01 0.9272
Scale   0 1.0000 0.0000 1.0000 1.0000    

Note: The scale parameter was held fixed.


The "Exact Parameter Estimates" table in Output 4 displays parameter estimates and tests of significance for the levels of the CLASS variables versus the reference level. Again, the Heat=7 parameter has some difficulties; however, in the exact analysis, a median unbiased estimate is computed for the parameter instead of a maximum likelihood estimate. The confidence limits show that the Heat variable contains some explanatory power, while the categorical Soak variable is not significant due to wide confidence intervals and p-values equal to 1.000.

Output 4: Exact Parameter Estimates

Exact Parameter Estimates
Parameter   Estimate   Standard
Error
95% Confidence Limits Two-sided p-Value
Heat 7 -2.7552 * . -Infinity -0.7864 0.0199
Heat 14 -3.0255   1.0128 -5.7450 -0.6194 0.0113
Heat 27 -1.7846   0.8065 -3.6779 0.2260 0.0844
Soak 1 -0.3231   1.1717 -2.8673 3.6754 1.0000
Soak 1.7 0.5375   1.1284 -1.8056 4.4588 1.0000
Soak 2.2 0.4035   1.2347 -2.5785 4.5054 1.0000
Soak 2.8 -0.1661   1.4214 -4.5490 4.2168 1.0000

Note: * indicates a median unbiased estimate.


The Joint test shown in the "Exact Conditional Analysis" table in Output 5 is produced by specifying the JOINT option in the EXACT statement. The p-values for this test indicate that the parameters for Heat and Soak are jointly significant as explanatory effects in the model.

The rows of this table labeled as “Heat” show the joint significance of all the Heat effect parameters conditional on Soak being in the model. In this case, the Heat parameters explain a significant amount of the variability; however, you can see that the Soak parameters, condional on Heat being in the model, are not significant. A reasonable next step in this modeling process would be to remove the Soak variable from the model as it does not appear to be a significant predictor of the outcome variable NotReady.

Output 5: Exact Tests

The GENMOD Procedure
 
Exact Conditional Analysis
Exact Conditional Tests
Effect Test Statistic p-Value
Exact Mid
Joint Score 18.3665 0.0137 0.0137
  Probability 1.294E-6 0.0471 0.0471
Heat Score 15.8259 0.0023 0.0022
  Probability 0.000175 0.0063 0.0062
Soak Score 1.4612 0.8683 0.8646
  Probability 0.00735 0.8176 0.8139


Note:If you want to make predictions from the exact results, you can obtain an estimate for the intercept parameter by specifying the INTERCEPT keyword in the EXACT statement. You can also remove the JOINT option to reduce the amount of time and memory consumed.

References

  • Cox, D. R., and Snell, E. J. (1989). The Analysis of Binary Data. 2nd ed. London: Chapman & Hall.

  • Hirji, K. F., Mehta, C. R., and Patel, N. R. (1987). “Computing Distributions for Exact Logistic Regression.” Journal of the American Statistical Association 82:1110–1117.

  • Stokes, M. E., Davis, C. S., and Koch, G. G. (2000). Categorical Data Analysis Using the SAS System. 2nd ed. Cary, NC: SAS Institute Inc.