The SSM Procedure

Likelihood Computation and Model Fitting Phase

In view of the Gaussian nature of the response vector, the likelihood of $ \mb{Y}$ can be computed by using the prediction-error decomposition. In the diffuse case the definition of the likelihood depends on the treatment of the diffuse quantities—$\pmb {\delta }$, $\pmb {\beta }$, and $\pmb {\gamma }$. In the SSM procedure a likelihood called the diffuse-likelihood, $ \mb{L}_{d}( \mb{Y}, \pmb {\theta } )$, is used for parameter estimation. In the literature the diffuse likelihood is also called the restricted-likelihood. The diffuse likelihood is computed by treating the diffuse quantities as zero-mean, Gaussian, random variables with infinite variance (that is, they have diffuse distribution). In terms of the quantities described in Table 34.5 the diffuse likelihood is defined as follows:

\[ -2 \log \mb{L}_{d}( \mb{Y}, \pmb {\theta } ) = N_{0} \log 2 \pi + \sum _{t=1}^{n} \sum _{i=1}^{q*p_{t}} ( \log F_{t, i} + \frac{\nu _{t,i}^{2} }{ F_{t, i} } ) - \log ( | \mb{S}_{n, p_{n}}^{-1} | ) - \mb{b}_{n, p_{n}}^{'} \mb{S}_{n, p_{n}}^{-1} \mb{b}_{n, p_{n}} \]

where $N_{0} = (N - k - g - d)$, $ | \mb{S}_{n, p_{n}}^{-1} | $ denotes the determinant of $\mb{S}_{n, p_{n}}^{-1}$, and $ \mb{b}_{n, p_{n}}^{'} $ denotes the transpose of the column vector $ \mb{b}_{n, p_{n}}$. In the preceding formula, the terms that are associated with the missing response values $y_{t,i}$ are excluded and $\Mathtext{N}$ denotes the total number of nonmissing response values in the sample. If $\mb{S}_{n, p_{n}}$ is not invertible, then a generalized inverse is used in place of $\mb{S}_{n, p_{n}}^{-1}$, and $ | \mb{S}_{n, p_{n}}^{-1} | $ is computed based on the nonzero eigenvalues of $\mb{S}_{n, p_{n}}$. Moreover, in this case $N_{0} = N - \mr{Rank}(\mb{S}_{n, p_{n}})$. When $(d+k+g) = 0$, the terms that involve $ \mb{S}_{n, p_{n}}$ and $ \mb{b}_{n, p_{n}}$ are absent.

In addition to reporting the diffuse likelihood, the SSM procedure reports a variant of the likelihood called the profile likelihood. The profile likelihood is computed by treating the diffuse quantities—$\pmb {\delta }$, $\pmb {\beta }$, and $\pmb {\gamma }$—as unknown parameters (similar to $\pmb {\theta }$). Interestingly, the quantities that are described in Table 34.5 play a key role in the computation of this variant of the likelihood also. It turns out that the filtering process yields the maximum likelihood (ML) estimates of $\pmb {\delta }$, $\pmb {\beta }$, and $\pmb {\gamma }$ conditional on the remaining parameters of the model—$\pmb {\theta }$. Moreover, the likelihood that is evaluated at the ML estimates of $\pmb {\delta }$, $\pmb {\beta }$, and $\pmb {\gamma }$—that is, the likelihood from which these parameters are profiled out—has the following expression:

\[ -2 \log \mb{L}_{p}( \mb{Y}, \pmb {\theta } ) = N \log 2 \pi + \sum _{t=1}^{n} \sum _{i=1}^{q*p_{t}} ( \log F_{t, i} + \frac{\nu _{t,i}^{2} }{ F_{t, i} } ) - \mb{b}_{n, p_{n}}^{'} \mb{S}_{n, p_{n}}^{-1} \mb{b}_{n, p_{n}} \]

Note that, computationally, the profile likelihood differs from the diffuse likelihood in only two respects: the constant term involves $\Mathtext{N}$—the total number of nonmissing response values—rather than $N_{0}$, and the log-determinant term $ \log ( | \mb{S}_{n, p_{n}}^{-1} | )$ is absent. However, in terms of theoretical considerations, the diffuse likelihood and the profile likelihood differ in an important way. It can be shown that the diffuse likelihood corresponds to the (nondiffuse) likelihood of a suitable transformation of $\mb{Y}$. The transformation is chosen in such a way that the distribution of the transformed data no longer depends on the initial condition $\pmb {\delta }$ and the regression vectors $\pmb {\beta }$ and $\pmb {\gamma }$. In this sense, the diffuse likelihood is a pseudo-likelihood of the original data $\mb{Y}$. The profile likelihood, on the other hand, does not involve any data transformation and can be considered as the likelihood of the original data $\mb{Y}$. Of course, if the state space model for $\mb{Y}$ does not involve any diffuse quantities, then the two likelihoods are the same.

As noted earlier, the SSM procedure does not use the profile likelihood for parameter estimation. When the model specification contains any unknown parameters $\pmb {\theta }$, they are estimated by maximizing the diffuse likelihood function. This is done by using a nonlinear optimization process that involves repeated evaluations of $ \mb{L}_{d}( \mb{Y}, \pmb {\theta } )$ at different values of $\pmb {\theta }$. The maximum likelihood (ML) estimate of $\pmb {\theta }$ is denoted by $\hat{\pmb {\theta }}$. Because the diffuse likelihood is also called the restricted likelihood, $\hat{\pmb {\theta }}$ is sometimes called the restricted maximum likelihood (REML) estimate. Approximate standard errors of $\hat{\pmb {\theta }}$ are computed by taking the square root of the diagonal elements of its (approximate) covariance matrix. This covariance is computed as $-\mb{H}^{-1}$, where $\mb{H}$ is the Hessian (the matrix of the second-order partials) of $\log \mb{L}_{d}( \mb{Y}, \pmb {\theta } )$ evaluated at the optimum $\hat{\pmb {\theta }}$. It is known that the ML (or REML) estimate of $\pmb {\theta }$ based on the diffuse likelihood as well as the profile likelihood is consistent and efficient under mild regularity assumptions (as the number of distinct time points tend toward infinity). In addition, it is known that the estimate based on the diffuse likelihood is better in terms of having smaller bias. For good discussions about diffuse and profile likelihoods, see Laird (2004); Francke, Koopman, and de Vos (2010).

Let $ \mr{dim}(\theta )$ denote the dimension of the parameter vector $\pmb {\theta }$. After the parameter estimation is completed, PROC SSM prints the "Likelihood Computation Summary" table, which summarizes the likelihood calculations at $\hat{\pmb {\theta }}$, as shown in Table 34.6.

Table 34.6: Likelihood Computation Summary

Quantity

Formula

Nonmissing response values used

$\Mathtext{N}$

Estimated parameters

$ \mr{dim}(\theta )$

Initialized diffuse state elements

$\mr{rank}(\mb{S}_{n, p_{n}})$

Normalized residual sum of squares

$\sum _{t=1}^{n} \sum _{i=1}^{q*p_{t}} ( \frac{\nu _{t,i}^{2} }{ F_{t, i} } ) - \mb{b}_{n, p_{n}}^{'} \mb{S}_{n, p_{n}}^{-1} \mb{b}_{n, p_{n}}$

Diffuse log likelihood

$ \log \mb{L}_{d}( \mb{Y}, \hat{\pmb {\theta }} ) $

Profile log likelihood

$ \log \mb{L}_{p}( \mb{Y}, \hat{\pmb {\theta }} ) $


In addition, the information criteria based on the diffuse likelihood and the profile likelihood are also reported. A variety of information criteria are reported. All these criteria are functions of twice the negative likelihood, $-2 \log \mb{L}$ (the likelihood can be either diffuse or profile); $N_{*}$, the effective sample size; and $\mi{nparm}$, the effective number of model parameters. For the information criteria based on the diffuse likelihood, the effective sample size $N_{*} = N_{0}$ and the effective number of model parameters $\mi{nparm} = \mr{dim}(\theta )$. For the information criteria based on the profile likelihood, the effective sample size $N_{*} = N$ and the effective number of model parameters $\mi{nparm} = \mr{dim}(\theta ) + d + k + g$. Table 34.7 summarizes the reported information criteria in smaller-is-better form.

Table 34.7: Information Criteria

Criterion

Formula

Reference

AIC

$-2 \log \mb{L} + 2 \mi{nparm} $

Akaike (1974)

AICC

$-2 \log \mb{L} + 2 \mi{nparm} N_{*}/(N_{*} - \mi{nparm} -1)$

Hurvich and Tsai (1989)

   

Burnham and Anderson (1998)

HQIC

$-2 \log \mb{L} + 2 \mi{nparm} \log \log (N_{*} )$

Hannan and Quinn (1979)

BIC

$-2 \log \mb{L} +\mi{nparm} \log ( N_{*} )$

Schwarz (1978)

CAIC

$-2 \log \mb{L} + \mi{nparm} (\log ( N_{*} ) + 1)$

Bozdogan (1987)