Pseudo-likelihood Estimation Based on Linearization

The Pseudo-model

Recall from the section Notation for the Generalized Linear Mixed Model that

$\mr{E}[\mb{Y} | \bgamma ] = g^{-1}(\mb{X}\bbeta + \mb{Z}\bgamma ) = g^{-1}(\bm {\eta }) = \bmu$

where $\bgamma \sim N(\mb{0},\mb{G})$ and $\mr{Var}[\mb{Y} | \bgamma ] = \mb{A}^{1/2}\mb{RA}^{1/2}$ . Following Wolfinger and O’Connell (1993), a first-order Taylor series of $\bmu$ about $\widetilde{\bbeta }$ and $\widetilde{\bgamma }$ yields

$g^{-1}(\bm {\eta }) \doteq g^{-1}(\widetilde{\bm {\eta }}) + \widetilde{\bDelta }\mb{X} (\bbeta - \widetilde{\bbeta }) + \widetilde{\bDelta }\mb{Z} (\bgamma - \widetilde{\bgamma })$

where

$\widetilde{\bDelta } = \left( \frac{\partial g^{-1}(\bm {\eta })}{\partial \bm {\eta }} \right)_{\widetilde{\bbeta },\widetilde{\bgamma }}$

is a diagonal matrix of derivatives of the conditional mean evaluated at the expansion locus. Rearranging terms yields the expression

$\widetilde{\bDelta }^{-1}(\bmu - g^{-1}(\widetilde{\bm {\eta }})) + \mb{X}\widetilde{\bbeta } + \mb{Z}\widetilde{\bgamma } \doteq \mb{X}\bbeta + \mb{Z}\bgamma$

The left side is the expected value, conditional on $\bgamma$ , of

$\widetilde{\bDelta }^{-1}(\mb{Y} -g^{-1}(\widetilde{\bm {\eta }})) + \mb{X}\widetilde{\bbeta } + \mb{Z}\widetilde{\bgamma } \equiv \mb{P}$

and

$\mr{Var}[\mb{P}|\bgamma ] = \widetilde{\bDelta }^{-1}\mb{A}^{1/2} \mb{RA}^{1/2}\widetilde{\bDelta }^{-1}$

You can thus consider the model

$\mb{P} = \mb{X}\bbeta + \mb{Z}\bgamma + \bepsilon$

which is a linear mixed model with pseudo-response $\mb{P}$ , fixed effects $\bbeta$ , random effects $\bgamma$ , and $\mr{Var}[\bepsilon ] = \mr{Var}[\mb{P}|\bgamma ]$ .

Objective Functions

Now define

$\mb{V}(\btheta ) = \mb{ZGZ}’ + \widetilde{\bDelta }^{-1}\mb{A}^{1/2}\mb{RA}^{1/2}\widetilde{\bDelta }^{-1}$

as the marginal variance in the linear mixed pseudo-model, where $\btheta$ is the $(q\times 1)$ parameter vector containing all unknowns in $\mb{G}$ and $\mb{R}$ . Based on this linearized model, an objective function can be defined, assuming that the distribution of $\mb{P}$ is known. The GLIMMIX procedure assumes that $\bepsilon$ has a normal distribution. The maximum log pseudo-likelihood (MxPL) and restricted log pseudo-likelihood (RxPL) for $\mb{P}$ are then

$\begin{align*} l(\btheta ,\mb{p}) & = -\frac12 \log |\mb{V}(\btheta )| -\frac{1}{2} \mb{r}’\mb{V}(\btheta )^{-1}\mb{r} -\frac{f}{2}\log \{ 2\pi \} \\ l_ R(\btheta ,\mb{p}) & = -\frac12 \log |\mb{V}(\btheta )| -\frac{1}{2} \mb{r}’\mb{V}(\btheta )^{-1}\mb{r} -\frac12 \log |\mb{X}’\mb{V}(\btheta )^{-1}\mb{X}| -\frac{f-k}{2}\log \{ 2\pi \} \end{align*}$

with $\mb{r} = \mb{p} - \bX (\bX ’\bV ^{-1}\bX )^{-}\bX ’\bV ^{-1}\mb{p}$ . f denotes the sum of the frequencies used in the analysis, and k denotes the rank of $\bX$ . The fixed-effects parameters $\bbeta$ are profiled from these expressions. The parameters in $\btheta$ are estimated by the optimization techniques specified in the NLOPTIONS statement. The objective function for minimization is $-2l(\btheta ,\mb{p})$ or $-2l_ R(\btheta ,\mb{p})$ . At convergence, the profiled parameters are estimated and the random effects are predicted as

$\begin{align*} \widehat{\bbeta } & = (\bX ’\bV (\widehat{\btheta })^{-1}\bX )^{-}\bX ’ \bV (\widehat{\btheta })^{-1}\mb{p} \\ \widehat{\bgamma }& = \widehat{\bG }\bZ ’\bV (\widehat{\btheta })^{-1}\widehat{\mb{r}} \end{align*}$

With these statistics, the pseudo-response and error weights of the linearized model are recomputed and the objective function is minimized again. The predictors $\widehat{\bgamma }$ are the estimated BLUPs in the approximated linear model. This process continues until the relative change between parameter estimates at two successive (outer) iterations is sufficiently small. See the PCONV= option in the PROC GLIMMIX statement for the computational details about how the GLIMMIX procedure compares parameter estimates across optimizations.

If the conditional distribution contains a scale parameter $\phi \not= 1$ (Table 45.20), the GLIMMIX procedure profiles this parameter in GLMMs from the log pseudo-likelihoods as well. To this end define

$\mb{V}(\btheta ^*) = \widetilde{\bDelta }^{-1}\mb{A}^{1/2}\mb{R}^*\mb{A}^{1/2}\widetilde{\bDelta }^{-1} + \mb{Z}\mb{G}^*\mb{Z}’$

where $\btheta ^*$ is the covariance parameter vector with q – 1 elements. The matrices $\bG ^*$ and $\bR ^*$ are appropriately reparameterized versions of $\bG$ and $\bR$ . For example, if $\bG$ has a variance component structure and $\bR =\phi \bI$ , then $\btheta ^*$ contains ratios of the variance components and $\phi$ , and $\bR ^* = \bI$ . The solution for $\widehat{\phi }$ is

$\widehat{\phi } = \widehat{\mb{r}}’\mb{V}(\widehat{\btheta }^*)^{-1}\widehat{\mb{r}}/m$

where m = f for MxPL and m = f – k for RxPL. Substitution into the previous functions yields the profiled log pseudo-likelihoods,

$\begin{align*} l(\btheta ^*,\mb{p}) = & -\frac12 \log |\mb{V}(\btheta ^*)| -\frac{f}{2} \log \left\{ \mb{r}’\mb{V}(\btheta ^*)^{-1}\mb{r}\right\} -\frac{f}{2}(1+\log \{ 2\pi /f\} ) \\ l_ R(\btheta ^*,\mb{p}) = & -\frac12 \log |\mb{V}(\btheta ^*)| -\frac{f-k}{2} \log \left\{ \mb{r}’\mb{V}(\btheta ^*)^{-1}\mb{r}\right\} \\ & -\frac12 \log |\mb{X}’\mb{V}(\btheta ^*)^{-1}\mb{X}| -\frac{f-k}{2}(1 + \log \{ 2\pi /(f-k)\} ) \end{align*}$

Profiling of $\phi$ can be suppressed with the NOPROFILE option in the PROC GLIMMIX statement.

Where possible, the objective function, its gradient, and its Hessian employ the sweep-based W-transformation ( Hemmerle and Hartley 1973; Goodnight 1979; Goodnight and Hemmerle 1979). Further details about the minimization process in the general linear mixed model can be found in Wolfinger, Tobias, and Sall (1994).

Estimated Precision of Estimates

The GLIMMIX procedure produces estimates of the variability of $\widehat{\bbeta }$ , $\widehat{\btheta }$ , and estimates of the prediction variability for $\widehat{\bgamma }$ , $\mr{Var}[\widehat{\bgamma }-\bgamma ]$ . Denote as $\mb{S}$ the matrix

$\mb{S} \equiv \widehat{\mr{Var}}[\bP |\bgamma ] = \widetilde{\bDelta }^{-1}\mb{A}^{1/2}\mb{R}\mb{A}^{1/2}\widetilde{\bDelta }^{-1}$

where all components on the right side are evaluated at the converged estimates. The mixed model equations (Henderson 1984) in the linear mixed (pseudo-)model are then

$\left[ \begin{array}{cc} \mb{X}’\mb{S}^{-1}\mb{X} & \mb{X}’\mb{S}^{-1}\mb{Z} \\ \mb{Z}’\mb{S}^{-1}\mb{X} & \mb{Z}’\mb{S}^{-1}\mb{Z} + \mb{G}(\widehat{\btheta })^{-1} \end{array} \right] \left[ \begin{array}{c} \widehat{\bbeta } \\ \widehat{\bgamma } \end{array} \right] = \left[ \begin{array}{c} \mb{X}’\mb{S}^{-1}\mb{p} \\ \mb{Z}’\mb{S}^{-1}\mb{p} \end{array} \right]$

and

$\begin{align*} \mb{C} & = \left[ \begin{array}{cc} \mb{X}’\mb{S}^{-1}\mb{X} & \mb{X}’\mb{S}^{-1}\mb{Z} \\ \mb{Z}’\mb{S}^{-1}\mb{X} & \mb{Z}’\mb{S}^{-1}\mb{Z} + \mb{G}(\widehat{\btheta })^{-1} \end{array} \right]^{-} \\ & = \left[ \begin{array}{cc} \widehat{\bOmega } & -\widehat{\bOmega }\mb{X}’\mb{V}(\widehat{\btheta })^{-1}\mb{Z}\mb{G}(\widehat{\btheta }) \\ -\mb{G}(\widehat{\btheta })\mb{Z}’\mb{V}(\widehat{\btheta })^{-1}\mb{X}\widehat{\bOmega } & \mb{M} + \mb{G}(\widehat{\btheta })\mb{Z}’\mb{V}(\widehat{\btheta })^{-1}\mb{X}\widehat{\bOmega } \mb{X}’\mb{V}(\widehat{\btheta })^{-1}\mb{ZG}(\widehat{\btheta }) \end{array} \right] \end{align*}$

is the approximate estimated variance-covariance matrix of $[\widehat{\bbeta }’,\widehat{\bgamma }’-\bgamma ’]’$ . Here, $\widehat{\bOmega } = (\bX ’\bV (\widehat{\btheta })^{-1}\bX )^{-}$ and $\bM = (\bZ ’\bS ^{-1}\bZ + \bG (\widehat{\btheta })^{-1})^{-1}$ .

The square roots of the diagonal elements of $\widehat{\bOmega }$ are reported in the Standard Error column of the "Parameter Estimates" table. This table is produced with the SOLUTION option in the MODEL statement. The prediction standard errors of the random-effects solutions are reported in the Std Err Pred column of the "Solution for Random Effects" table. This table is produced with the SOLUTION option in the RANDOM statement.

As a cautionary note, $\bC$ tends to underestimate the true sampling variability of [ $\widehat{\bbeta }’, \widehat{\bgamma }’]’$ , because no account is made for the uncertainty in estimating $\bG$ and $\bR$ . Although inflation factors have been proposed (Kackar and Harville 1984; Kass and Steffey 1989; Prasad and Rao 1990), they tend to be small for data sets that are fairly well balanced. PROC GLIMMIX does not compute any inflation factors by default. The DDFM= KENWARDROGER option in the MODEL statement prompts PROC GLIMMIX to compute a specific inflation factor (Kenward and Roger 1997), along with Satterthwaite-based degrees of freedom.

If $\bG (\widehat{\btheta })$ is singular, or if you use the CHOL option of the PROC GLIMMIX statement, the mixed model equations are modified as follows. Let $\bL$ denote the lower triangular matrix so that $\bL \bL ’ = \bG (\widehat{\btheta })$ . PROC GLIMMIX then solves the equations

$\left[ \begin{array}{cc} \mb{X}’\mb{S}^{-1}\mb{X} & \mb{X}’\mb{S}^{-1}\mb{Z}\mb{L} \\ \mb{L}’\mb{Z}’\bS ^{-1}\mb{X} & \mb{L}’\mb{Z}’\mb{S}^{-1}\mb{Z}\mb{L} + \mb{I} \end{array} \right] \left[ \begin{array}{c} \widehat{\bbeta } \\ \widehat{\btau } \end{array} \right] = \left[ \begin{array}{c} \mb{X}’\mb{S}^{-1}\mb{p} \\ \mb{L}’\mb{Z}’\mb{S}^{-1}\mb{p} \end{array} \right]$

and transforms $\widehat{\btau }$ and a generalized inverse of the left-side coefficient matrix by using $\bL$ .

The asymptotic covariance matrix of the covariance parameter estimator $\widehat{\btheta }$ is computed based on the observed or expected Hessian matrix of the optimization procedure. Consider first the case where the scale parameter $\phi$ is not present or not profiled. Because $\bbeta$ is profiled from the pseudo-likelihood, the objective function for minimization is $f(\btheta ) = -2l(\btheta ,\mb{p})$ for METHOD= MSPL and METHOD= MMPL and $f(\btheta ) = -2l_ R(\btheta ,\mb{p})$ for METHOD= RSPL and METHOD= RMPL. Denote the observed Hessian (second derivative) matrix as

$\bH = \frac{\partial ^2 f(\btheta )}{\partial \btheta \, \partial \btheta '}$

The GLIMMIX procedure computes the variance of $\widehat{\btheta }$ by default as $2\bH ^{-1}$ . If the Hessian is not positive definite, a sweep-based generalized inverse is used instead. When the EXPHESSIAN option of the PROC GLIMMIX statement is used, or when the procedure is in scoring mode at convergence (see the SCORING option in the PROC GLIMMIX statement), the observed Hessian is replaced with an approximated expected Hessian matrix in these calculations.

Following Wolfinger, Tobias, and Sall (1994), define the following components of the gradient and Hessian in the optimization:

$\begin{align*} \mb{g}_1 & = \frac{\partial }{\partial \btheta }\, \mb{r}’\mb{V}(\btheta )^{-1}\mb{r} \\ \mb{H}_1 & = \frac{\partial ^2}{\partial \btheta \, \partial \btheta '} \, \log \{ \bV (\btheta )\} \\ \mb{H}_2 & = \frac{\partial ^2}{\partial \btheta \, \partial \btheta '} \, \mb{r}’\bV (\btheta )^{-1}\mb{r} \\ \mb{H}_3 & = \frac{\partial ^2}{\partial \btheta \, \partial \btheta '} \, \log \{ |\bX ’\bV (\btheta )^{-1}\bX |\} \end{align*}$

Table 45.23 gives expressions for the Hessian matrix $\bH$ depending on estimation method, profiling, and scoring.

Table 45.23: Hessian Computation in GLIMMIX

Profiling	Scoring	MxPL	RxPL
No	No	$\mb{H}_1 + \mb{H}_2$	$\mb{H}_1 + \mb{H}_2 + \mb{H}_3$
No	Yes	$-\mb{H}_1$	$-\mb{H}_1 + \mb{H}_3$
No	Modified	$-\mb{H}_1$	$-\mb{H}_1 - \mb{H}_3$
Yes	No	$\left[\begin{array}{cc} \mb{H}_1 + \mb{H}_2/\phi & -\mb{g}_2/\phi ^2 \\ -\mb{g}_2’/\phi ^2 & f/\phi ^2 \end{array}\right]$	$\left[\begin{array}{cc} \mb{H}_1 + \mb{H}_2/\phi + \mb{H}_3 & -\mb{g}_2/\phi ^2 \\ -\mb{g}_2’/\phi ^2 & (f-k)/\phi ^2 \end{array}\right]$
Yes	Yes	$\left[\begin{array}{cc} -\mb{H}_1 & -\mb{g}_2/\phi ^2 \\ -\mb{g}_2’/\phi ^2 & f/\phi ^2 \end{array}\right]$	$\left[\begin{array}{cc} -\mb{H}_1 + \mb{H}_3 & -\mb{g}_2/\phi ^2 \\ -\mb{g}_2’/\phi ^2 & (f-k)/\phi ^2 \end{array}\right]$
Yes	Modified	$\left[\begin{array}{cc} -\mb{H}_1 & -\mb{g}_2/\phi ^2 \\ -\mb{g}_2’/\phi ^2 & f/\phi ^2 \end{array}\right]$	$\left[\begin{array}{cc} -\mb{H}_1 - \mb{H}_3 & -\mb{g}_2/\phi ^2 \\ -\mb{g}_2’/\phi ^2 & (f-k)/\phi ^2 \end{array}\right]$

The "Modified" expressions for the Hessian under scoring in RxPL estimation refer to a modified scoring method. In some cases, the modification leads to faster convergence than the standard scoring algorithm. The modification is requested with the SCOREMOD option in the PROC GLIMMIX statement.

Finally, in the case of a profiled scale parameter $\phi$ , the Hessian for the $(\btheta ^*,\phi )$ parameterization is converted into that for the $\btheta$ parameterization as

$\bH (\btheta ) = \bB \bH (\btheta ^*,\phi )\bB ’$

where

$\mb{B} = \left[ \begin{array}{ccccc} 1/\phi & 0 & \cdots & 0 & 0 \\ 0 & 1/\phi & \cdots & 0 & 0 \\ 0 & \cdots & \cdots & 1/\phi & 0 \\ -\theta ^*_1/\phi & -\theta ^*_2/\phi & \cdots & -\theta _{q-1}^*/\phi & 1 \end{array}\right]$

Subject-Specific and Population-Averaged (Marginal) Expansions

There are two basic choices for the expansion locus of the linearization. A subject-specific (SS) expansion uses

$\widetilde{\bbeta } = \widehat{\bbeta } \quad \widetilde{\bgamma } = \widehat{\bgamma }$

which are the current estimates of the fixed effects and estimated BLUPs. The population-averaged (PA) expansion expands about the same fixed effects and the expected value of the random effects

$\widetilde{\bbeta } = \widehat{\bbeta } \quad \widetilde{\bgamma } = \mb{0}$

To recompute the pseudo-response and weights in the SS expansion, the BLUPs must be computed every time the objective function in the linear mixed model is maximized. The PA expansion does not require any BLUPs. The four pseudo-likelihood methods implemented in the GLIMMIX procedure are the $2 \times 2$ factorial combination between two expansion loci and residual versus maximum pseudo-likelihood estimation. The following table shows the combination and the corresponding values of the METHOD= option (PROC GLIMMIX statement); METHOD= RSPL is the default.

Type of	Expansion Locus
PL	$\widehat{\bgamma }$	E $[\bgamma ]$
residual	RSPL	RMPL
maximum	MSPL	MMPL

The GLIMMIX Procedure

Pseudo-likelihood Estimation Based on Linearization

The Pseudo-model

Objective Functions

Estimated Precision of Estimates

Subject-Specific and Population-Averaged (Marginal) Expansions