The SURVEYLOGISTIC Procedure

Logistic Regression Models and Parameters

Subsections:

Notation
Logistic Regression Models
Likelihood Function

The SURVEYLOGISTIC procedure fits a logistic regression model and estimates the corresponding regression parameters. Each model uses the link function you specified in the LINK= option in the MODEL statement. There are four types of model you can use with the procedure: cumulative logit model, complementary log-log model, probit model, and generalized logit model.

Notation

Let Y be the response variable with categories $1,2, \ldots , D, D+1$ . The p covariates are denoted by a p-dimension row vector $\mb{x}$ .

For a stratified clustered sample design, each observation is represented by a row vector, $(w_{hij}, \mb{y}_{hij}’, y_{hij(D+1)}, \mb{x}_{hij})$ , where

$h=1, 2, \ldots , H$ is the stratum index
$i=1, 2, \ldots , n_ h$ is the cluster index within stratum h
$j=1, 2, \ldots , m_{hi}$ is the unit index within cluster i of stratum h
$w_{hij}$ denotes the sampling weight
$\mb{y}_{hij}$ is a D-dimensional column vector whose elements are indicator variables for the first D categories for variable Y. If the response of the jth unit of the ith cluster in stratum h falls in category d, the dth element of the vector is one, and the remaining elements of the vector are zero, where $d=1, 2, \ldots , D$ .
$y_{hij(D+1)}$ is the indicator variable for the $(D+1)$ category of variable Y
$\mb{x}_{hij}$ denotes the k-dimensional row vector of explanatory variables for the jth unit of the ith cluster in stratum h. If there is an intercept, then $x_{hij1}\equiv 1$ .
$\tilde n=\sum _{h=1}^ H n_ h$ is the total number of clusters in the sample
$n=\sum _{h=1}^ H \sum _{i=1}^{n_ h} {m_{hi}}$ is the total sample size

The following notations are also used:

$f_ h$ denotes the sampling rate for stratum h
$\bpi _{hij}$ is the expected vector of the response variable:

$\begin{eqnarray*} {\bpi }_{hij} & =& E(\mb{y}_{hij}|\mb{x}_{hij}) \\ & =& (\pi _{hij1}, \pi _{hij2}, \ldots , \pi _{hijD})’ \\ \pi _{hij(D+1)} & =& E(y_{hij(D+1)}|\mb{x}_{hij}) \end{eqnarray*}$

Note that $\pi _{hij(D+1)}=1-\Strong{1}’ {\bpi }_{hij}$ , where 1 is a D-dimensional column vector whose elements are 1.

Logistic Regression Models

If the response categories of the response variable Y can be restricted to a number of ordinal values, you can fit cumulative probabilities of the response categories with a cumulative logit model, a complementary log-log model, or a probit model. Details of cumulative logit models (or proportional odds models) can be found in McCullagh and Nelder (1989). If the response categories of Y are nominal responses without natural ordering, you can fit the response probabilities with a generalized logit model. Formulation of the generalized logit models for nominal response variables can be found in Agresti (2002). For each model, the procedure estimates the model parameter $\btheta$ by using a pseudo-log-likelihood function. The procedure obtains the pseudo-maximum likelihood estimator $\hat{\btheta }$ by using iterations described in the section Iterative Algorithms for Model Fitting and estimates its variance described in the section Variance Estimation.

Cumulative Logit Model

A cumulative logit model uses the logit function

$g(t)=\log \left( \frac{t}{1-t} \right)$

as the link function.

Denote the cumulative sum of the expected proportions for the first d categories of variable Y by

$F_{hijd}=\sum _{r=1}^ d \pi _{hijr}$

for $d=1, 2, \ldots , D.$ Then the cumulative logit model can be written as

$\log \left(\frac{F_{hijd}}{1-F_{hijd}}\right) = \alpha _ d+\mb{x}_{hij}\bbeta$

with the model parameters

$\begin{eqnarray*} \bbeta & = & (\beta _1, \beta _2, \ldots , \beta _ k)’\\ \balpha & =& (\alpha _1, \alpha _2, \ldots , \alpha _ D)’, \, \, \, \alpha _1<\alpha _2<\cdots <\alpha _ D \\ \btheta & = & (\balpha ’,\bbeta ’)’ \end{eqnarray*}$

Complementary Log-Log Model

A complementary log-log model uses the complementary log-log function

$g(t)=\log (-\log (1-t))$

as the link function. Denote the cumulative sum of the expected proportions for the first d categories of variable Y by

$F_{hijd}=\sum _{r=1}^ d \pi _{hijr}$

for $d=1, 2, \ldots , D.$ Then the complementary log-log model can be written as

$\log (-\log (1-F_{hijd}))=\alpha _ d+\mb{x}_{hij}\bbeta$

with the model parameters

Probit Model

A probit model uses the probit (or normit) function, which is the inverse of the cumulative standard normal distribution function,

$g(t)=\Phi ^{-1}(t)$

as the link function, where

$\Phi (t)=\frac{1}{\sqrt {2\pi }}\int _{-\infty }^ t e^{-\frac{1}{2} z^2} dz$

Denote the cumulative sum of the expected proportions for the first d categories of variable Y by

$F_{hijd}=\sum _{r=1}^ d \pi _{hijr}$

for $d=1, 2, \ldots , D.$ Then the probit model can be written as

$F_{hijd}=\Phi (\alpha _ d+\mb{x}_{hij}\bbeta )$

with the model parameters

Generalized Logit Model

For nominal response, a generalized logit model is to fit the ratio of the expected proportion for each response category over the expected proportion of a reference category with a logit link function.

Without loss of generality, let category $D+1$ be the reference category for the response variable Y. Denote the expected proportion for the dth category by $\pi _{hijd}$ as in the section Notation. Then the generalized logit model can be written as

$\log \left(\frac{\pi _{hijd}}{\pi _{hij(D+1)}}\right) =\mb{x}_{hij}\bbeta _ d$

for $d=1, 2, \ldots , D,$ with the model parameters

$\begin{eqnarray*} \bbeta _ d & = & (\beta _{d1}, \beta _{d2}, \ldots , \beta _{dk})’\\ \btheta & = & ( \bbeta _1’, \bbeta _2’, \ldots , \bbeta _ D’)’ \end{eqnarray*}$

Likelihood Function

Let $\mb{g}(\cdot )$ be a link function such that

$\bpi =\mb{g}(\mb{x}, \btheta )$

where $\btheta$ is a column vector for regression coefficients. The pseudo-log likelihood is

$l(\btheta ) = \sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} w_{hij} \left( (\log (\bpi _{hij}))’\mb{y}_{hij}+ \log (\pi _{hij(D+1)})y_{hij(D+1)} \right)$

Denote the pseudo-estimator as $\hat{\btheta }$ , which is a solution to the estimating equations:

$\sum _{h=1}^ H\sum _{i=1}^{n_ h} \sum _{j=1}^{m_{hi}} w_{hij}\mb{D}_{hij} \left(\mr{diag}(\bpi _{hij})-\bpi _{hij}\bpi _{hij}’\right)^{-1} (\mb{y}_{hij}-\bpi _{hij})= \Strong{0}$

where $\mb{D}_{hij}$ is the matrix of partial derivatives of the link function $\mb{g}$ with respect to $\btheta$ .

To obtain the pseudo-estimator $\hat{\btheta }$ , the procedure uses iterations with a starting value $\btheta ^{(0)}$ for $\btheta$ . See the section Iterative Algorithms for Model Fitting for more details.