The HPMIXED Procedure

Computing and Maximizing the Likelihood

In computing the restricted likelihood function given previously, the determinants of the matrices $\bC$ and $\bG$ can be obtained effectively by using Cholesky decomposition. The quadratic term $\mb {y}’\bP \mb {y}$ can be expressed in terms of solutions of mixed model equations as follows:

$\mb {y}’\bP \mb {y} = \frac{1}{\sigma ^2} \left(\mb {y}’\mb {y} - \left[ \widehat{\bbeta }’, \widehat{\bgamma }’\right] \left[ \begin{array}{c} \bX ’\mb {y} \\ \bZ ’\mb {y} \end{array} \right]\right)$

By default, the HPMIXED procedure profiles out the residual variance $\sigma ^2$ from the parameter vector $\btheta$ . Let $\btheta ^*$ be the new parameter vector such that $\theta ^*_ i = \theta _ i/\sigma ^2$ . The profiled objective function becomes

$\begin{equation*} L(\btheta ^*,\sigma ^2) = (n-p) \log (2 \pi ) + \log |\bC ^*| + \log |\bG ^*| - (r_ C - r_ G -n)\log (\sigma ^2) + (n - p) \end{equation*}$

where $\bC ^* = \bC \sigma ^2$ and $\bG ^* = \bG \sigma ^2$ are the profiled versions of $\bC$ and $\bG$ , $r_ C$ and $r_ G$ are the ranks of $\bC$ and $\bG$ . Minimizing analytically for $\sigma ^2$ yields

$\widehat{\sigma }^2 = \frac{1}{n-p}\left(\mb {y}’\mb {y} - \left[ \widehat{\bbeta }’, \widehat{\bgamma }’\right] \left[ \begin{array}{c} \bX ’\mb {y} \\ \bZ ’\mb {y} \end{array} \right]\right)$

Optimizing the likelihood calls for derivatives with respect to the parameters. The first and second derivatives of the log-likelihood function L with respect to scalar variance components $\theta _ i$ and $\theta _ j$ are

$\frac{\partial L}{\partial \theta _ i} = \mbox{tr}\left(\frac{\partial \bV }{\partial \theta _ i}\bP \right) - \mb {y}’\bP \frac{\partial \bV }{\partial \theta _ i}\bP \mb {y}$

and

$\frac{\partial ^2 L}{\partial \theta _ i\theta _ j} = -\mbox{tr}\left(\frac{\partial \bV }{\partial \theta _ i}\bP \frac{\partial \bV }{\partial \theta _ j}\bP \right) + 2\mb {y}’\bP \frac{\partial \bV }{\partial \theta _ i}\bP \frac{\partial \bV }{\partial \theta _ j}\bP \mb {y}$

The default quasi-Newton method of optimization for the HPMIXED procedure requires only first derivatives of the log likelihood, and these are readily derived by solving the mixed model equations. For example, when $\bG = \bI \sigma _ a$ , the first derivative of the log likelihood with respect to the parameter $\sigma ^2_ a$ can be computed as follows:

$\frac{\partial L}{\partial \sigma ^2_ a} = \frac{q}{\sigma ^2_ a} - \frac{\mbox{tr}(\bC ^{aa})}{\sigma ^4_ a} - \frac{\widehat{\bgamma }\widehat{\bgamma }}{\sigma ^4_ a}$

where q is the size of $\bgamma$ vector and $\bC ^{aa}$ is the part of the g-inverse of the mixed model equation coefficient matrix $\bC$ corresponding to the random effect $\bgamma$ .

The second derivative of the log likelihood needs to be computed only if you specify certain nondefault optimization techniques in the NLOPTIONS statement, namely TECH=NEWRAP, TECH=NRRIDG, or TECH=TRUREG; see NLOPTIONS Statement in Chapter 19: Shared Concepts and Topics, for more information about optimization techniques. For these second-derivative-based optimization techniques, the HPMIXED procedure does not actually use the true second derivative matrix, or observed information matrix, as defined earlier. Instead, it uses an alternative matrix that is more efficient to compute for large problems and that can be more stable. This alternative is called the average information matrix, and it is defined as follows. The expected value of the second derivative is

$\bE (\frac{\partial ^2 L}{\partial \theta _ i\theta _ j}) = \mbox{tr}\left(\frac{\partial \bV }{\partial \theta _ i}\bP \frac{\partial \bV }{\partial \theta _ j}\bP \right)$

It is this trace that is computationally inefficient to evaluate. But if you average the expected information matrix defined by this formula with the observed information matrix defined by the preceding formula for the true second derivative, then the trace term cancels, leaving just a quadratic expression in $\mb {y}$ . This quadratic expression defines the average information (Johnson and Thompson, 1995) with respect to $\theta _ i$ and $\theta _ j$ :

$\mbox{AI}(\theta _ i,\theta _ j) = \mb {y}’\bP \frac{\partial \bV }{\partial \theta _ i}\bP \frac{\partial \bV }{\partial \theta _ j}\bP \mb {y}$