The GLIMMIX Procedure

Maximum Likelihood Estimation Based on Adaptive Quadrature

Quadrature methods, like the Laplace approximation, approximate integrals. If you choose METHOD=QUAD for a generalized linear mixed model, the GLIMMIX procedure approximates the marginal log likelihood with an adaptive Gauss-Hermite quadrature rule. Gaussian quadrature is particularly well suited to numerically evaluate integrals against probability measures (Lange, 1999, Ch. 16). And Gauss-Hermite quadrature is appropriate when the density has kernel $\exp \{ -x^2\}$ and integration extends over the real line, as is the case for the normal distribution. Suppose that $p(x)$ is a probability density function and the function $f(x)$ is to be integrated against it. Then the quadrature rule is

$\int _{-\infty }^{\infty } f(x)p(x)\, dx \approx \sum _{i=1}^{N} w_ i f(x_ i)$

where N denotes the number of quadrature points, the $w_ i$ are the quadrature weights, and the $x_ i$ are the abscissas. The Gaussian quadrature chooses abscissas in areas of high density, and if $p(x)$ is continuous, the quadrature rule is exact if $f(x)$ is a polynomial of up to degree 2N – 1. In the generalized linear mixed model the roles of $f(x)$ and $p(x)$ are played by the conditional distribution of the data given the random effects, and the random-effects distribution, respectively. Quadrature abscissas and weights are those of the standard Gauss-Hermite quadrature (Golub and Welsch 1969; see also Table 25.10 of Abramowitz and Stegun 1972; Evans 1993).

A numerical integration rule is called adaptive when it uses a variable step size to control the error of the approximation. For example, an adaptive trapezoidal rule uses serial splitting of intervals at midpoints until a desired tolerance is achieved. The quadrature rule in the GLIMMIX procedure is adaptive in the following sense: if you do not specify the number of quadrature points (nodes) with the QPOINTS= suboption of the METHOD=QUAD option, then the number of quadrature points is determined by evaluating the log likelihood at the starting values at a successively larger number of nodes until a tolerance is met (for more details see the text under the heading “Starting Values” in the next section). Furthermore, the GLIMMIX procedure centers and scales the quadrature points by using the empirical Bayes estimates (EBEs) of the random effects and the Hessian (second derivative) matrix from the EBE suboptimization. This centering and scaling improves the likelihood approximation by placing the abscissas according to the density function of the random effects. It is not, however, adaptiveness in the previously stated sense.

Objective Function

Let $\bbeta$ denote the vector of fixed-effects parameters and $\btheta$ the vector of covariance parameters. For quadrature estimation in the GLIMMIX procedure, $\btheta$ includes the G-side parameters and a possible scale parameter $\phi$ , provided that the conditional distribution of the data contains such a scale parameter. $\btheta ^*$ is the vector of the G-side parameters. The marginal distribution of the data for subject i in a mixed model can be expressed as

$p(\mb {y}_ i) = \int \, \cdots \, \int p(\mb {y}_ i | \bgamma _ i,\bbeta ,\phi ) \, p(\bgamma _ i|\btheta ^*) \, \, d\bgamma _ i$

Suppose $N_ q$ denotes the number of quadrature points in each dimension (for each random effect) and r denotes the number of random effects. For each subject, obtain the empirical Bayes estimates of $\bgamma _ i$ as the vector $\widehat{\bgamma }_ i$ that minimizes

$-\log \left\{ p(\mb {y}_ i | \bgamma _ i,\bbeta ,\phi ) p(\bgamma _ i|\btheta ^*) \right\} = f(\mb {y}_ i,\bbeta ,\btheta ;\bgamma _ i)$

If $\mb {z} = [z_1,\cdots ,z_{N_ q}]$ are the standard abscissas for Gauss-Hermite quadrature, and $\mb {z}_ j^* = [z_{j_1},\cdots ,z_{j_ r}]$ is a point on the r-dimensional quadrature grid, then the centered and scaled abscissas are

$\mb {a}_ j^* = \widehat{\bgamma }_ i + 2^{1/2} f”(\mb {y}_ i,\bbeta ,\btheta ;\widehat{\bgamma }_ i)^{-1/2} \mb {z}_ j^*$

As for the Laplace approximation, $f”$ is the second derivative matrix with respect to the random effects,

$f”(\mb {y}_ i,\bbeta ,\btheta ;\widehat{\bgamma }_ i) = \frac{\partial ^2 f(\mb {y}_ i,\bbeta ,\btheta ;\bgamma _ i)}{\partial \bgamma _ i \partial \bgamma _ i} |_{\widehat{\bgamma }_ i}$

These centered and scaled abscissas, along with the Gauss-Hermite quadrature weights $\mb {w} = [w_1,\cdots ,w_{N_ q}]$ , are used to construct the r-dimensional integral by a sequence of one-dimensional rules

$\begin{align*} p(\mb {y}_ i) & = \int \, \cdots \, \int p(\mb {y}_ i | \bgamma _ i,\bbeta ,\phi ) \, p(\bgamma _ i|\btheta ^*) \, \, d\bgamma _ i \\ & \approx 2^{r/2} | f”(\mb {y}_ i,\bbeta ,\btheta ;\widehat{\bgamma }_ i)|^{-1/2} \\ & \mbox{ } \sum _{j_1=1}^{N_ q} \, \cdots \, \sum _{j_ r=1}^{N_ q} \left[ p(\mb {y}_ i|\mb {a}_ j^*,\bbeta ,\phi ) p(\mb {a}_ j^*|\btheta ^*) \prod _{k=1}^ r w_{j_ k}\exp {z^2_{j_ k}} \right] \end{align*}$

The right-hand side of this expression, properly accumulated across subjects, is the objective function for adaptive quadrature estimation in the GLIMMIX procedure.

Quadrature or Laplace Approximation

If you select the quadrature rule with a single quadrature point, namely

proc glimmix method=quad(qpoints=1);

the results will be identical to METHOD=LAPLACE. Computationally, the two methods are not identical, however. METHOD=LAPLACE can be applied to a considerably larger class of models. For example, crossed random effects, models without subjects, or models with non-nested subjects can be handled with the Laplace approximation but not with quadrature. Furthermore, METHOD=LAPLACE draws on a number of computational simplifications that can increase its efficiency compared to a quadrature algorithm with a single node. For example, the Laplace approximation is possible with unbounded covariance parameter estimates (NOBOUND option in the PROC GLIMMIX statement) and can permit certain types of negative definite or indefinite $\bG$ matrices. The adaptive quadrature approximation with scaled abscissas typically breaks down when $\bG$ is not at least positive semidefinite.

As the number of random effects grows—for example, if you have nested random effects—quadrature quickly becomes computationally infeasible, due to the high dimensionality of the integral. To this end it is worthwhile to clarify the issues of dimensionality and computational effort as related to the number of quadrature nodes. Suppose that the A effect has 4 levels and consider the following statements:

proc glimmix method=quad(qpoints=5);
   class A id;
   model y = / dist=negbin;
   random A / subject=id;
run;

For each subject, computing the marginal log likelihood requires the numerical evaluation of a four-dimensional integral. As part of this evaluation $5^4 = 625$ conditional log likelihoods need to be computed for each observation on each pass through the data. As the number of quadrature points or the number of random effects increases, this constitutes a sizable computational effort. Suppose, for example, that an additional random effect with b = 2 levels is added as an interaction. The following statements then require evaluation of $5^{(4 + 8)} = 244140625$ conditional log likelihoods for each observation one each pass through the data:

proc glimmix method=quad(qpoints=5);
   class A B id;
   model y = / dist=negbin;
   random A A*B / subject=id;
run;

As the number of random effects increases, Laplace approximation presents a computationally more expedient alternative.

If you wonder whether METHOD=LAPLACE would present a viable alternative to a model that you can fit with METHOD=QUAD, the “Optimization Information” table can provide some insights. The table contains as its last entry the number of quadrature points determined by PROC GLIMMIX to yield a sufficiently accurate approximation of the log likelihood (at the starting values). In many cases, a single quadrature node is sufficient, in which case the estimates are identical to those of METHOD=LAPLACE.