The LOGISTIC Procedure

Conditional Logistic Regression

Subsections:

Computational Details
Regression Diagnostic Details

The method of maximum likelihood described in the preceding sections relies on large-sample asymptotic normality for the validity of estimates and especially of their standard errors. When you do not have a large sample size compared to the number of parameters, this approach might be inappropriate and might result in biased inferences. This situation typically arises when your data are stratified and you fit intercepts to each stratum so that the number of parameters is of the same order as the sample size. For example, in a $1 \colon 1$ matched pairs study with n pairs and p covariates, you would estimate $n-1$ intercept parameters and p slope parameters. Taking the stratification into account by "conditioning out" (and not estimating) the stratum-specific intercepts gives consistent and asymptotically normal MLEs for the slope coefficients. See Breslow and Day (1980) and Stokes, Davis, and Koch (2012) for more information. If your nuisance parameters are not just stratum-specific intercepts, you can perform an exact conditional logistic regression .

Computational Details

For each stratum h, $h=1,\ldots ,H$ , number the observations as $i=1,\ldots ,n_ h$ so that hi indexes the ith observation in stratum h. Denote the p covariates for the hith observation as $\mb{x}_{hi}$ and its binary response as $y_{hi}$ , and let $\mb{y} = (y_{11},\ldots ,y_{1n_1},\ldots ,y_{H1},\ldots ,y_{Hn_ H})’$ , $\bX _ h=(\mb{x}_{h1}\ldots \mb{x}_{hn_ h})’$ , and $\bX =(\bX _1’\ldots \bX _ H’)’$ . Let the dummy variables $z_ h, h=1,\ldots ,H$ , be indicator functions for the strata ( $z_ h=1$ if the observation is in stratum h), and denote $\mb{z}_{hi}=(z_1,\ldots ,z_ H)$ for the hith observation, $\bZ _ h=(\mb{z}_{h1}\ldots \mb{z}_{hn_ h})’$ , and $\bZ =(\bZ _1’\ldots \bZ _ H’)’$ . Denote $\bX ^{^*} = (\bZ | \bX$ ) and $\mb{x}^*_{hi}=(\mb{z}_{hi}’ | \mb{x}_{hi}’)’$ . Arrange the observations in each stratum h so that $y_{hi}=1$ for $i=1,\ldots ,m_ h$ , and $y_{hi}=0$ for $i=m_{h+1},\ldots ,n_ h$ . Suppose all observations have unit frequency.

Consider the binary logistic regression model written as

$\mbox{logit}(\bpi ) = \bX ^{^*}\btheta$

where the parameter vector $\btheta =(\balpha ’,\bbeta ’)’$ consists of $\balpha =(\alpha _{1},\ldots ,\alpha _{H})’$ , $\alpha _{h}$ is the intercept for stratum $h, h=1,\ldots ,H$ , and $\bbeta$ is the parameter vector for the p covariates.

From the section Determining Observations for Likelihood Contributions, you can write the likelihood contribution of observation $hi, i=1,\ldots ,n_ h, h=1,\ldots ,H,$ as

$L_{hi}(\btheta ) = \frac{e^{y_{hi}{\mb{x}_{hi}^*}'\btheta }}{1+e^{ {\mb{x}_{hi}^*}'\btheta } }$

where $y_{hi}=1$ when the response takes Ordered Value 1, and $y_{hi}=0$ otherwise.

The full likelihood is

$L(\btheta ) = \prod _{h=1}^{H} \prod _{i=1}^{n_ h} L_{hi}(\btheta ) = \frac{e^{\mb{y}'{\bX ^{^*}}\btheta }}{\prod _{h=1}^{H} \prod _{i=1}^{n_ h} \left(1+e^{{\mb{x}^*_{hi}}'\btheta }\right)}$

Unconditional likelihood inference is based on maximizing this likelihood function.

When your nuisance parameters are the stratum-specific intercepts $(\alpha _{1},\ldots ,\alpha _{H})’$ , and the slopes $\bbeta$ are your parameters of interest, "conditioning out" the nuisance parameters produces the conditional likelihood (Lachin, 2000)

$L(\bbeta ) = \prod _{h=1}^ H L_ h(\bbeta ) = \prod _{h=1}^ H\frac{\prod _{i=1}^{m_ h} \exp (\mb{x}_{hi}'\bbeta )}{\sum \prod _{j=j_1}^{j_{m_ h}}\exp (\mb{x}_{hj}'\bbeta )}$

where the summation is over all ${n_ h}\choose {m_ h}$ subsets $\{ j_1,\ldots ,j_{m_ h}\}$ of $m_ h$ observations chosen from the $n_ h$ observations in stratum h. Note that the nuisance parameters have been factored out of this equation.

For conditional asymptotic inference, maximum likelihood estimates ${\widehat{\bbeta }}$ of the regression parameters are obtained by maximizing the conditional likelihood, and asymptotic results are applied to the conditional likelihood function and the maximum likelihood estimators. A relatively fast method of computing this conditional likelihood and its derivatives is given by Gail, Lubin, and Rubinstein (1981) and Howard (1972). The optimization techniques can be controlled by specifying the NLOPTIONS statement.

Sometimes the log likelihood converges but the estimates diverge. This condition is flagged by having inordinately large standard errors for some of your parameter estimates, and can be monitored by specifying the ITPRINT option. Unfortunately, broad existence criteria such as those discussed in the section Existence of Maximum Likelihood Estimates do not exist for this model. It might be possible to circumvent such a problem by standardizing your independent variables before fitting the model.

Regression Diagnostic Details

Diagnostics are used to indicate observations that might have undue influence on the model fit or that might be outliers. Further investigation should be performed before removing such an observation from the data set.

The derivations in this section use an augmentation method described by Storer and Crowley (1985), which provides an estimate of the "one-step" DFBETAS estimates advocated by Pregibon (1984). The method also provides estimates of conditional stratum-specific predicted values, residuals, and leverage for each observation. The augmentation method can take a lot of time and memory.

Following Storer and Crowley (1985), the log-likelihood contribution can be written as

$\begin{eqnarray*} l_ h & =& \log (L_ h) = \mb{y}_ h’\bgamma _ h - a(\bgamma _ h)\quad \mbox{where} \\ a(\bgamma _ h) & =& \log \left[\sum \prod _{j=j_1}^{j_{m_ h}}\exp (\bgamma _{hj})\right] \end{eqnarray*}$

and the h subscript on matrices indicates the submatrix for the stratum, $\bgamma _ h=(\gamma _{h1},\ldots ,\gamma _{hn_ h})’$ , and $\gamma _{hi} = \mb{x}_{hi}’\bbeta$ . Then the gradient and information matrix are

$\begin{eqnarray*} \mb{g}({\bbeta }) & =& \left\{ \frac{\partial l_ h}{\partial \bbeta }\right\} _{h=1}^ H = \bX ’(\mb{y}-\bpi ) \\ \bLambda ({\bbeta }) & =& \left\{ \frac{\partial ^2 l_ h}{\partial \bbeta ^2}\right\} _{h=1}^ H = \bX ’\mbox{diag}(\bU _1,\ldots ,\bU _ H)\bX \end{eqnarray*}$

where

$\begin{eqnarray*} \pi _{hi} & =& \frac{\partial a(\bgamma _ h)}{\partial \gamma _{hi}} = \frac{\sum _{{j(i)}}\prod _{j=j_1}^{j_{m_ h}}\exp (\gamma _{hj})}{\sum \prod _{j=j_1}^{j_{m_ h}}\exp (\gamma _{hj})} \\ \bpi _ h & =& (\pi _{h1},\ldots ,\pi _{hn_ h}) \\ \bU _ h & =& \frac{\partial ^2 a(\bgamma _ h)}{\partial \bgamma _ h^2} = \left\{ \frac{\partial ^2\bm {a}(\bgamma _ h)}{\partial \gamma _{hi}\partial \gamma _{hj}}\right\} = \{ a_{ij}\} \\ a_{ij} & =& \frac{\sum _{{k(i,j)}}\prod _{k=k_1}^{k_{m_ h}}\exp (\gamma _{hk})}{\sum \prod _{k=k_1}^{k_{m_ h}}\exp (\gamma _{hk})} - \frac{\partial a(\bgamma _ h)}{\partial \gamma _{hi}}\frac{\partial a(\bgamma _ h)}{\partial \gamma _{hj}} = \pi _{hij} - \pi _{hi}\pi _{hj} \end{eqnarray*}$

and where $\pi _{hi}$ is the conditional stratum-specific probability that subject i in stratum h is a case, the summation on ${j(i)}$ is over all subsets from $\{ 1,\ldots ,n_ h\}$ of size $m_ h$ that contain the index i, and the summation on ${k(i,j)}$ is over all subsets from $\{ 1,\ldots ,n_ h\}$ of size $m_ h$ that contain the indices i and j.

To produce the true one-step estimate ${\bbeta }^1_{hi}$ , start at the MLE ${\widehat{\bbeta }}$ , delete the hith observation, and use this reduced data set to compute the next Newton-Raphson step. Note that if there is only one event or one nonevent in a stratum, deletion of that single observation is equivalent to deletion of the entire stratum. The augmentation method does not take this into account.

The augmented model is

$\mbox{logit}(\Pr (y_{hi}=1|\mb{x}_{hi})) = \mb{x}_{hi}’\bbeta + \mb{z}_{hi}’\gamma$

where $\mb{z}_{hi} = (0,\ldots ,0,1,0,\ldots ,0)’$ has a 1 in the hith coordinate, and use $\bbeta ^0 = ({\widehat{\bbeta }}’,0)’$ as the initial estimate for $(\bbeta ’,\gamma )’$ . The gradient and information matrix before the step are

$\begin{eqnarray*} \mb{g}(\bbeta ^0) & =& \left[\begin{array}{c}{\bX ’}\\ {\mb{z}_{hi}’}\end{array}\right](\mb{y}-{\bpi }) = \left[\begin{array}{c}{\bm {0}}\\ {y_{hi}-{\pi }_{hi}}\end{array}\right]\\ \bLambda (\bbeta ^0) & =& \left[\begin{array}{c}{\bX ’}\\ {\mb{z}_{hi}’}\end{array}\right]{\bU }\left[{\bX }\quad {\mb{z}_{hi}}\right] = \left[\begin{array}{cc}\bLambda ({\bbeta }) & \bX ’{\bU }\mb{z}_{hi}\\ \mb{z}_{hi}’{\bU }\bX & \mb{z}_{hi}’{\bU }\mb{z}_{hi} \end{array}\right] \end{eqnarray*}$

Inserting the $\bbeta ^0$ and $(\bX ’,\mb{z}_{hi}’)’$ into the Gail, Lubin, and Rubinstein (1981) algorithm provides the appropriate estimates of $\mb{g}(\bbeta ^0)$ and $\bLambda (\bbeta ^0)$ . Indicate these estimates with ${\widehat{\bpi }}=\bpi ({\widehat{\bbeta }})$ , $\widehat{\bU }=\bU ({\widehat{\bbeta }})$ , $\widehat{\mb{g}}$ , and $\widehat{\bLambda }$ .

DFBETA is computed from the information matrix as

$\begin{eqnarray*} \Delta _{hi}\bbeta & =& {\bbeta }^0 - {\bbeta }^1_{hi} \\ & =& -\widehat{\bLambda }^{-1}({\bbeta }^0)\widehat{\mb{g}}({\bbeta }^0) \\ & =& -\widehat{\bLambda }^{-1}({\widehat{\bbeta }})(\bX ’\widehat{\bU }\mb{z}_{hi}) \bM ^{-1} \mb{z}_{hi}’(\mb{y}-{\widehat{\bpi }}) \end{eqnarray*}$

where

$\begin{eqnarray*} \bM & =& (\mb{z}_{hi}’\widehat{\bU }\mb{z}_{hi}) - (\mb{z}_{hi}’\widehat{\bU }\bX )\widehat{\bLambda }^{-1} ({\widehat{\bbeta }})(\bX ’\widehat{\bU }\mb{z}_{hi}) \end{eqnarray*}$

For each observation in the data set, a DFBETA statistic is computed for each parameter $\beta _ j$ , $1\le j\le p$ , and standardized by the standard error of $\beta _ j$ from the full data set to produce the estimate of DFBETAS.

The estimated leverage is defined as

$h_{hi} = \frac{ \mbox{trace}\{ (\mb{z}_{hi}'\widehat{\bU }\bX )\widehat{\bLambda }^{-1} ({\widehat{\bbeta }})(\bX '\widehat{\bU }\mb{z}_{hi})\} }{ \mbox{trace}\{ \mb{z}_{hi}'\widehat{\bU }\mb{z}_{hi}\} }$

This definition of leverage produces different values from those defined by Pregibon (1984); Moolgavkar, Lustbader, and Venzon (1985); Hosmer and Lemeshow (2000); however, it has the advantage that no extra computations beyond those for the DFBETAS are required.

The estimated residuals $e_{hi}=y_{hi}-{\widehat{\pi }}_{hi}$ are obtained from $\widehat{\mb{g}}(\bbeta ^0)$ , and the weights, or predicted probabilities, are then ${\widehat{\pi }}_{hi} = y_{hi}-e_{hi}$ . The residuals are standardized and reported as (estimated) Pearson residuals:

$\frac{r_{hi}-n_{hi}{\widehat{\pi }}_{hi}}{\sqrt {n_{hi}{\widehat{\pi }}_{hi}(1-{\widehat{\pi }}_{hi})}}$

where $r_{hi}$ is the number of events in the observation and $n_{hi}$ is the number of trials.

The STDRES option in the INFLUENCE option computes the standardized Pearson residual:

$e_{s,hi}=\frac{e_{hi}}{\sqrt {1-h_{hi}}}$

For events/trials MODEL statement syntax, treat each observation as two observations (the first for the nonevents and the second for the events) with frequencies $f_{h,2i-1}=n_{hi}-r_{hi}$ and $f_{h,2i}=r_{hi}$ , and augment the model with a matrix $\bZ _{hi} = [\mb{z}_{h,2i-1} \mb{z}_{h,2i}]$ instead of a single $\mb{z}_{hi}$ vector. Writing $\gamma _{hi} = \mb{x}_{hi}’\bbeta f_{hi}$ in the preceding section results in the following gradient and information matrix:

$\begin{eqnarray*} \mb{g}(\bbeta ^0) & =& \left[\begin{array}{c}{\bm {0}}\\ {f_{h,2i-1}(y_{h,2i-1}-{\pi }_{h,2i-1})}\\ {f_{h,2i}(y_{h,2i}-{\pi }_{h,2i})}\\ \end{array}\right]\\ \bLambda (\bbeta ^0) & =& \left[\begin{array}{cc}\bLambda ({\bbeta }) & \bX ’\mbox{diag}(\mb{f}){\bU }\mbox{diag}(\mb{f})\bZ _{hi}\\ \bZ _{hi}’\mbox{diag}(\mb{f}){\bU }\mbox{diag}(\mb{f})\bX & \bZ _{hi}’\mbox{diag}(\mb{f}){\bU }\mbox{diag}(\mb{f})\bZ _{hi} \end{array}\right] \end{eqnarray*}$

The predicted probabilities are then ${\widehat{\pi }}_{hi} = y_{h,2i}-e_{h,2i}/r_{h,2i}$ , while the leverage and the DFBETAS are produced from $\bLambda (\bbeta ^0)$ in a fashion similar to that for the preceding single-trial equations.