The SURVEYLOGISTIC Procedure

Logistic Regression Models and Parameters

The SURVEYLOGISTIC procedure fits a logistic regression model and estimates the corresponding regression parameters. Each model uses the link function you specified in the LINK= option in the MODEL statement. There are four types of model you can use with the procedure: cumulative logit model, complementary log-log model, probit model, and generalized logit model.

Notation

Let Y be the response variable with categories $1,2, \ldots , D, D+1$. The p covariates are denoted by a p-dimension row vector $\mb{x}$.

For a stratified clustered sample design, each observation is represented by a row vector, $ (w_{hij}, \mb{y}_{hij}’, y_{hij(D+1)}, \mb{x}_{hij}) $, where

  • $h=1, 2, \ldots , H$ is the stratum index

  • $i=1, 2, \ldots , n_ h$ is the cluster index within stratum h

  • $j=1, 2, \ldots , m_{hi}$ is the unit index within cluster i of stratum h

  • $w_{hij}$ denotes the sampling weight

  • $\mb{y}_{hij}$ is a D-dimensional column vector whose elements are indicator variables for the first D categories for variable Y. If the response of the jth unit of the ith cluster in stratum h falls in category d, the dth element of the vector is one, and the remaining elements of the vector are zero, where $d=1, 2, \ldots , D$.

  • $y_{hij(D+1)}$ is the indicator variable for the $(D+1)$ category of variable Y

  • $\mb{x}_{hij}$ denotes the k-dimensional row vector of explanatory variables for the jth unit of the ith cluster in stratum h. If there is an intercept, then $x_{hij1}\equiv 1$.

  • $\tilde n=\sum _{h=1}^ H n_ h$ is the total number of clusters in the sample

  • $n=\sum _{h=1}^ H \sum _{i=1}^{n_ h} {m_{hi}}$ is the total sample size

The following notations are also used:

  • $f_ h$ denotes the sampling rate for stratum h

  • $\bpi _{hij}$ is the expected vector of the response variable:

    \begin{eqnarray*} {\bpi }_{hij} & =& E(\mb{y}_{hij}|\mb{x}_{hij}) \\ & =& (\pi _{hij1}, \pi _{hij2}, \ldots , \pi _{hijD})’ \\ \pi _{hij(D+1)} & =& E(y_{hij(D+1)}|\mb{x}_{hij}) \end{eqnarray*}

Note that $\pi _{hij(D+1)}=1-\Strong{1}’ {\bpi }_{hij}$, where 1 is a D-dimensional column vector whose elements are 1.

Logistic Regression Models

If the response categories of the response variable Y can be restricted to a number of ordinal values, you can fit cumulative probabilities of the response categories with a cumulative logit model, a complementary log-log model, or a probit model. Details of cumulative logit models (or proportional odds models) can be found in McCullagh and Nelder (1989). If the response categories of Y are nominal responses without natural ordering, you can fit the response probabilities with a generalized logit model. Formulation of the generalized logit models for nominal response variables can be found in Agresti (2002). For each model, the procedure estimates the model parameter $\btheta $ by using a pseudo-log-likelihood function. The procedure obtains the pseudo-maximum likelihood estimator $\hat{\btheta }$ by using iterations described in the section Iterative Algorithms for Model Fitting and estimates its variance described in the section Variance Estimation.

Cumulative Logit Model

A cumulative logit model uses the logit function

\[ g(t)=\log \left( \frac{t}{1-t} \right) \]

as the link function.

Denote the cumulative sum of the expected proportions for the first d categories of variable Y by

\[ F_{hijd}=\sum _{r=1}^ d \pi _{hijr} \]

for $d=1, 2, \ldots , D.$ Then the cumulative logit model can be written as

\[ \log \left(\frac{F_{hijd}}{1-F_{hijd}}\right) = \alpha _ d+\mb{x}_{hij}\bbeta \]

with the model parameters

\begin{eqnarray*} \bbeta & = & (\beta _1, \beta _2, \ldots , \beta _ k)’\\ \balpha & =& (\alpha _1, \alpha _2, \ldots , \alpha _ D)’, \, \, \, \alpha _1<\alpha _2<\cdots <\alpha _ D \\ \btheta & = & (\balpha ’,\bbeta ’)’ \end{eqnarray*}
Complementary Log-Log Model

A complementary log-log model uses the complementary log-log function

\[ g(t)=\log (-\log (1-t)) \]

as the link function. Denote the cumulative sum of the expected proportions for the first d categories of variable Y by

\[ F_{hijd}=\sum _{r=1}^ d \pi _{hijr} \]

for $d=1, 2, \ldots , D.$ Then the complementary log-log model can be written as

\[ \log (-\log (1-F_{hijd}))=\alpha _ d+\mb{x}_{hij}\bbeta \]

with the model parameters

\begin{eqnarray*} \bbeta & = & (\beta _1, \beta _2, \ldots , \beta _ k)’\\ \balpha & =& (\alpha _1, \alpha _2, \ldots , \alpha _ D)’, \, \, \, \alpha _1<\alpha _2<\cdots <\alpha _ D \\ \btheta & = & (\balpha ’,\bbeta ’)’ \end{eqnarray*}
Probit Model

A probit model uses the probit (or normit) function, which is the inverse of the cumulative standard normal distribution function,

\[ g(t)=\Phi ^{-1}(t) \]

as the link function, where

\[ \Phi (t)=\frac{1}{\sqrt {2\pi }}\int _{-\infty }^ t e^{-\frac{1}{2} z^2} dz \]

Denote the cumulative sum of the expected proportions for the first d categories of variable Y by

\[ F_{hijd}=\sum _{r=1}^ d \pi _{hijr} \]

for $d=1, 2, \ldots , D.$ Then the probit model can be written as

\[ F_{hijd}=\Phi (\alpha _ d+\mb{x}_{hij}\bbeta ) \]

with the model parameters

\begin{eqnarray*} \bbeta & = & (\beta _1, \beta _2, \ldots , \beta _ k)’\\ \balpha & =& (\alpha _1, \alpha _2, \ldots , \alpha _ D)’, \, \, \, \alpha _1<\alpha _2<\cdots <\alpha _ D \\ \btheta & = & (\balpha ’,\bbeta ’)’ \end{eqnarray*}
Generalized Logit Model

For nominal response, a generalized logit model is to fit the ratio of the expected proportion for each response category over the expected proportion of a reference category with a logit link function.

Without loss of generality, let category $D+1$ be the reference category for the response variable Y. Denote the expected proportion for the dth category by $\pi _{hijd}$ as in the section Notation. Then the generalized logit model can be written as

\[ \log \left(\frac{\pi _{hijd}}{\pi _{hij(D+1)}}\right) =\mb{x}_{hij}\bbeta _ d \]

for $d=1, 2, \ldots , D,$ with the model parameters

\begin{eqnarray*} \bbeta _ d & = & (\beta _{d1}, \beta _{d2}, \ldots , \beta _{dk})’\\ \btheta & = & ( \bbeta _1’, \bbeta _2’, \ldots , \bbeta _ D’)’ \end{eqnarray*}

Likelihood Function

Let $\mb{g}(\cdot )$ be a link function such that

\[ \bpi =\mb{g}(\mb{x}, \btheta ) \]

where $\btheta $ is a column vector for regression coefficients. The pseudo-log likelihood is

\[ l(\btheta ) = \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} w_{hij} \left( (\log (\bpi _{hij}))’\mb{y}_{hij}+ \log (\pi _{hij(D+1)})y_{hij(D+1)} \right) \]

Denote the pseudo-estimator as $\hat{\btheta }$, which is a solution to the estimating equations:

\[ \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} w_{hij}\mb{D}_{hij} \left(\mr{diag}(\bpi _{hij})-\bpi _{hij}\bpi _{hij}’\right)^{-1} (\mb{y}_{hij}-\bpi _{hij})= \Strong{0} \]

where $\mb{D}_{hij}$ is the matrix of partial derivatives of the link function $\mb{g}$ with respect to $\btheta $.

To obtain the pseudo-estimator $\hat{\btheta }$, the procedure uses iterations with a starting value $\btheta ^{(0)}$ for $\btheta $. See the section Iterative Algorithms for Model Fitting for more details.