The GLIMMIX Procedure

Maximum Likelihood Estimation Based on Laplace Approximation

Objective Function

Let $\bbeta$ denote the vector of fixed-effects parameters and $\btheta$ the vector of covariance parameters. For Laplace estimation in the GLIMMIX procedure, $\btheta$ includes the G-side parameters and a possible scale parameter $\phi$ , provided that the conditional distribution of the data contains such a scale parameter. $\btheta ^*$ is the vector of the G-side parameters.

The marginal distribution of the data in a mixed model can be expressed as

$\begin{align*} p(\mb{y}) & = \int p(\mb{y} | \bgamma ,\bbeta ,\phi ) \, p(\bgamma |\btheta ^*) \, \, d\bgamma \\ & = \int \exp \left\{ \log \{ p(\mb{y}|\bgamma ,\bbeta ,\phi ) \} + \log \{ p(\bgamma |\btheta ^*) \} \right\} \, \, d\bgamma \\ & = \int \exp \left\{ c_ l f(\mb{y},\bbeta ,\btheta ;\bgamma ) \right\} \, \, d\bgamma \end{align*}$

If the constant $c_ l$ is large, the Laplace approximation of this integral is

$L(\bbeta ,\btheta ;\widehat{\bgamma },\mb{y}) = \left( \frac{2\pi }{c_ l} \right) ^{n_\gamma / 2} | -f”(\mb{y},\bbeta ,\btheta ;\widehat{\bgamma }) |^{-1/2} \, e^{c_ l f(\mb{y},\bbeta ,\btheta ;\widehat{\bgamma })}$

where $n_\gamma$ is the number of elements in $\bgamma$ , $f”$ is the second derivative matrix

$f”(\mb{y},\bbeta ,\btheta ;\widehat{\bgamma }) = \frac{\partial ^2 f(\mb{y},\bbeta ,\btheta ;\bgamma )}{\partial \bgamma \partial \bgamma '} |_{\widehat{\bgamma }}$

and $\widehat{\bgamma }$ satisfies the first-order condition

$\frac{\partial f(\mb{y},\bbeta ,\btheta ;\bgamma )}{\partial \bgamma } = \mb{0}$

The objective function for Laplace parameter estimation in the GLIMMIX procedure is $-2 \log \{ L(\bbeta ,\btheta \; \widehat{\bgamma },\mb{y})\}$ . The optimization process is singly iterative, but because $\widehat{\bgamma }$ depends on $\widehat{\bbeta }$ and $\widehat{\btheta }$ , the GLIMMIX procedure solves a suboptimization problem to determine for given values of $\widehat{\bbeta }$ and $\widehat{\btheta }$ the random-effects solution vector that maximizes $f(\mb{y},\bbeta ,\btheta ;\bgamma )$ .

When you have longitudinal or clustered data with m independent subjects or clusters, the vector of observations can be written as $\mb{y} = [\mb{y}_1’,\ldots ,\mb{y}_ m’]’$ , where $\mb{y}_ i$ is an $n_ i \times 1$ vector of observations for subject (cluster) i ( $i=1,\ldots ,m$ ). In this case, assuming conditional independence such that

$p(\mb{y}_ i | \bgamma _ i) = \prod _{j=1}^{n_ i} p(y_{ij}|\bgamma _ i)$

the marginal distribution of the data can be expressed as

$\begin{align*} p(\mb{y}) = \prod _{i=1}^{m} p(\mb{y}_ i) & = \prod _{i=1}^{m} \int p(\mb{y}_ i|\bgamma _ i)p(\bgamma _ i)\, d\bgamma _ i \\ & = \prod _{i=1}^{m} \int \exp \left\{ n_ i f(\mb{y}_ i,\bbeta ,\btheta ;\bgamma _ i) \right\} \, d\bgamma _ i \end{align*}$

where

$\begin{align*} n_ i f(\mb{y}_ i,\bbeta ,\btheta ;\bgamma _ i) & = \log \left\{ p(\mb{y}_ i|\bgamma _ i) \, p(\bgamma _ i) \right\} \\ & = \sum _{j=1}^{n_ i} \log \left\{ p(y_{ij}|\bgamma _ i)\right\} + \log \left\{ p(\bgamma _ i)\right\} \end{align*}$

When the number of observations within a cluster, $n_ i$ , is large, the Laplace approximation to the ith individual’s marginal probability density function is

$\begin{align*} p(\mb{y}_ i|\bbeta ,\btheta ) & = \int \exp \left\{ n_ i f(\mb{y}_ i,\bbeta ,\btheta ;\bgamma _ i) \right\} \, d\bgamma _ i \\ & = \frac{(2\pi )^{n_\gamma }/2 }{|-n_ i f''(\mb{y}_ i,\bbeta ,\btheta ;\widehat{\bgamma }_ i)|^{1/2}} \exp \left\{ n_ i f(\mb{y}_ i,\bbeta ,\btheta ;\widehat{\bgamma }_ i) \right\} \end{align*}$

where $n_{\gamma i}$ is the common dimension of the random effects, $\bgamma _ i$ . In this case, provided that the constant $c_ l = \min \{ n_ i\}$ is large, the Laplace approximation to the marginal log likelihood is

$\begin{align*} \log \left\{ L(\bbeta ,\btheta ;\widehat{\bgamma },\mb{y})\right\} & = \sum _{i=1}^{m} \left\{ n_ i f(\mb{y},\bbeta ,\btheta ;\widehat{\bgamma }_ i) + \frac{n_{\gamma i}}{2}\log \{ 2\pi \} \right. \\ & - \left. \frac{1}{2} \log |-n_ i f”(\bbeta ,\btheta ;\widehat{\bgamma }_ i) | \right\} \end{align*}$

which serves as the objective function for the METHOD= LAPLACE estimator in PROC GLIMMIX.

The Laplace approximation implemented in the GLIMMIX procedure differs from that in Wolfinger (1993) and Pinheiro and Bates (1995) in important respects. Wolfinger (1993) assumed a flat prior for $\bbeta$ and expanded the integrand around $\bbeta$ and $\bgamma$ , leaving only the covariance parameters for the overall optimization. The "fixed" effects $\bbeta$ and the random effects $\bgamma$ are determined in a suboptimization that takes the form of a linear mixed model step with pseudo-data. The GLIMMIX procedure involves only the random effects vector $\bgamma$ in the suboptimization. Pinheiro and Bates (1995) and Wolfinger (1993) consider a modified Laplace approximation that replaces the second derivative $f”(\mb{y},\bbeta ,\btheta ;\widehat{\bgamma })$ with an (approximate) expected value, akin to scoring. The GLIMMIX procedure does not use an approximation to $f”(\mb{y},\bbeta ,\btheta ;\widehat{\bgamma })$ . The METHOD= RSPL estimates in PROC GLIMMIX are equivalent to the estimates obtained with the modified Laplace approximation in Wolfinger (1993). The objective functions of METHOD= RSPL and Wolfinger (1993) differ in a constant that depends on the number of parameters.

Asymptotic Properties and the Importance of Subjects

Suppose that the GLIMMIX procedure processes your data by subjects (see the section Processing by Subjects) and let $n_ i$ denote the number of observations per subject, $i=1,\ldots ,s$ . Arguments in Vonesh (1996) show that the maximum likelihood estimator based on the Laplace approximation is a consistent estimator to order $O_ p\{ \max \{ 1/\sqrt {s}\} ,1/\min \{ n_ i\} \}$ . In other words, as the number of subjects and the number of observations per subject grows, the small-sample bias of the Laplace estimator disappears. Note that the term involving the number of subjects in this maximum relates to standard asymptotic theory, and the term involving the number of observations per subject relates to the accuracy of the Laplace approximation (Vonesh 1996). In the case where random effects enter the model linearly, the Laplace approximation is exact and the requirement that $\min \{ n_ i\} \rightarrow \infty$ can be dropped.

If your model is not processed by subjects but is equivalent to a subject model, the asymptotics with respect to s still apply, because the Hessian matrix of the suboptimization for $\bgamma$ breaks into s separate blocks. For example, the following two models are equivalent with respect to s and $n_ i$ , although only for the first model does PROC GLIMMIX process the data explicitly by subjects:

proc glimmix method=laplace;
   class sub A;
   model y = A;
   random intercept / subject=sub;
run;

proc glimmix method=laplace;
   class sub A;
   model y = A;
   random sub;
run;

The same holds, for example, for models with independent nested random effects. The following two models are equivalent, and you can derive asymptotic properties related to s and $\min \{ n_ i\}$ from the model in the first run:

proc glimmix method=laplace;
   class A B block;
   model y = A B A*B;
   random intercept A / subject=block;
run;

proc glimmix method=laplace;
   class A B block;
   model y = A B A*B;
   random block a*block;
run;

The Laplace approximation requires that the dimension of the integral does not increase with the size of the sample. Otherwise the error of the likelihood approximation does not diminish with $n_ i$ . This is the case, for example, with exchangeable arrays (Shun and McCullagh 1995), crossed random effects (Shun 1997), and correlated random effects of arbitrary dimension (Raudenbush, Yang, and Yosef 2000). Results in Shun (1997), for example, show that even in this case the standard Laplace approximation has smaller bias than pseudo-likelihood estimates.