The MI Procedure

MCMC Method for Arbitrary Missing Multivariate Normal Data

The Markov chain Monte Carlo (MCMC) method originated in physics as a tool for exploring equilibrium distributions of interacting molecules. In statistical applications, it is used to generate pseudorandom draws from multidimensional and otherwise intractable probability distributions via Markov chains. A Markov chain is a sequence of random variables in which the distribution of each element depends only on the value of the previous element.

In MCMC simulation, you constructs a Markov chain long enough for the distribution of the elements to stabilize to a stationary distribution, which is the distribution of interest. Repeatedly simulating steps of the chain simulates draws from the distribution of interest. See Schafer (1997) for a detailed discussion of this method.

In Bayesian inference, information about unknown parameters is expressed in the form of a posterior probability distribution. This posterior distribution is computed using Bayes’ theorem,

\[ p(\btheta | y) = \frac{p( y | \btheta ) p( \btheta )}{ \int p( y | \btheta ) p(\btheta ) d \btheta } \]

MCMC has been applied as a method for exploring posterior distributions in Bayesian inference. That is, through MCMC, you can simulate the entire joint posterior distribution of the unknown quantities and obtain simulation-based estimates of posterior parameters that are of interest.

In many incomplete-data problems, the observed-data posterior $p(\btheta | Y_{\mathit{obs}})$ is intractable and cannot easily be simulated. However, when $Y_{\mathit{obs}}$ is augmented by an estimated or simulated value of the missing data $Y_{\mathit{mis}}$, the complete-data posterior $p(\btheta | Y_{\mathit{obs}}, Y_{\mathit{mis}})$ is much easier to simulate. Assuming that the data are from a multivariate normal distribution, data augmentation can be applied to Bayesian inference with missing data by repeating the following steps:

1. The imputation I-step Given an estimated mean vector and covariance matrix, the I-step simulates the missing values for each observation independently. That is, if you denote the variables with missing values for observation i by $Y_{i(\mi{mis})}$ and the variables with observed values by $Y_{i(\mi{obs})}$, then the I-step draws values for $Y_{i(\mi{mis})}$ from a conditional distribution for $Y_{i(\mi{mis})}$ given $Y_{i(\mi{obs})}$.

2. The posterior P-step Given a complete sample, the P-step simulates the posterior population mean vector and covariance matrix. These new estimates are then used in the next I-step. Without prior information about the parameters, a noninformative prior is used. You can also use other informative priors. For example, a prior information about the covariance matrix can help to stabilize the inference about the mean vector for a near singular covariance matrix.

That is, with a current parameter estimate $\btheta ^{(t)}$ at the tth iteration, the I-step draws $Y_{\mathit{mis}}^{(t+1)}$ from $p( Y_{\mathit{mis}} | Y_{\mathit{obs}}, \btheta ^{(t)} )$ and the P-step draws $\btheta ^{(t+1)}$ from $p( \btheta | Y_{\mathit{obs}}, Y_{\mathit{mis}}^{(t+1)} )$. The two steps are iterated long enough for the results to reliably simulate an approximately independent draw of the missing values for a multiply imputed data set (Schafer 1997).

This creates a Markov chain $(Y_{\mathit{mis}}^{(1)}, \btheta ^{(1)})$ , $(Y_{\mathit{mis}}^{(2)}, \btheta ^{(2)})$ , …, which converges in distribution to $p( Y_{\mathit{mis}}, \btheta | Y_{\mathit{obs}})$. Assuming the iterates converge to a stationary distribution, the goal is to simulate an approximately independent draw of the missing values from this distribution.

To validate the imputation results, you should repeat the process with different random number generators and starting values based on different initial parameter estimates.

The next three sections provide details for the imputation step, Bayesian estimation of the mean vector and covariance matrix, and the posterior step.

Imputation Step

In each iteration, starting with a given mean vector $\bmu $ and covariance matrix $\bSigma $, the imputation step draws values for the missing data from the conditional distribution $Y_{\mathit{mis}}$ given $Y_{\mathit{obs}}$.

Suppose $ \bmu = \left[ \bmu _{1}’, \bmu _{2}’ \right] ’$ is the partitioned mean vector of two sets of variables, $Y_{\mathit{obs}}$ and $Y_{\mathit{mis}}$, where $\bmu _{1}$ is the mean vector for variables $Y_{\mathit{obs}}$ and $\bmu _{2}$ is the mean vector for variables $Y_{\mathit{mis}}$.

Also suppose

\begin{eqnarray*} \bSigma & = & \left[ \begin{array}{rrr} \bSigma _{11} & \bSigma _{12} \\ \bSigma _{12}’ & \bSigma _{22} \\ \end{array} \right] \end{eqnarray*}

is the partitioned covariance matrix for these variables, where $\bSigma _{11}$ is the covariance matrix for variables $Y_{\mathit{obs}}$, $\bSigma _{22}$ is the covariance matrix for variables $Y_{\mathit{mis}}$, and $\bSigma _{12}$ is the covariance matrix between variables $Y_{\mathit{obs}}$ and variables $Y_{\mathit{mis}}$.

By using the sweep operator (Goodnight 1979) on the pivots of the $\bSigma _{11}$ submatrix, the matrix becomes

\begin{eqnarray*} \left[ \begin{array}{rrr} \bSigma _{11}^{-1} & \bSigma _{11}^{-1} \bSigma _{12} \\ -\bSigma _{12}’ \bSigma _{11}^{-1} & \bSigma _{22.1} \\ \end{array} \right] \end{eqnarray*}

where $\bSigma _{22.1} = \bSigma _{22} - \bSigma _{12}’ \bSigma _{11}^{-1} \bSigma _{12}$ can be used to compute the conditional covariance matrix of $\mb{Y}_{\mathit{mis}}$ after controlling for $\mb{Y}_{\mathit{obs}}$.

For an observation with the preceding missing pattern, the conditional distribution of $Y_{\mathit{mis}}$ given $Y_{\mathit{obs}}=\mb{y}_{1}$ is a multivariate normal distribution with the mean vector

\[ \bmu _{2.1} = \bmu _{2} + \bSigma _{12}’ \bSigma _{11}^{-1} ( \mb{y}_{1} - \bmu _{1} ) \]

and the conditional covariance matrix

\[ \bSigma _{22.1} = \bSigma _{22} - \bSigma _{12}’ \bSigma _{11}^{-1} \bSigma _{12} \]

Bayesian Estimation of the Mean Vector and Covariance Matrix

Suppose that $\mb{Y}= (\,  \mb{y}_{1}’, \mb{y}_{2}’, \ldots , \mb{y}_{n}’ \, )’$ is an $(n{\times }p)$ matrix made up of n $(p{\times }1)$ independent vectors $\mb{y}_{i}$, each of which has a multivariate normal distribution with mean zero and covariance matrix $\bLambda $. Then the SSCP matrix

\[ \mb{A}= \mb{Y}’\mb{Y} = \sum _{i} { \mb{y}_{i} \mb{y}_{i}’ } \]

has a Wishart distribution $W( n, \bLambda )$.

When each observation $\mb{y}_{i}$ is distributed with a multivariate normal distribution with an unknown mean $\bmu $, then the CSSCP matrix

\[ \mb{A}= \sum _{i} { ( \mb{y}_{i} - \overline{\mb{y}} ) ( \mb{y}_{i} - \overline{\mb{y}} )’ } \]

has a Wishart distribution $W( n-1, \bLambda )$.

If $\mb{A}$ has a Wishart distribution $W( n, \bLambda )$, then $\mb{B}= \mb{A}^{-1}$ has an inverted Wishart distribution $W^{-1}( n, \bPsi )$, where n is the degrees of freedom and $\bPsi = \bLambda ^{-1}$ is the precision matrix (Anderson 1984).

Note that, instead of using the parameter $\bPsi =\bLambda ^{-1}$ for the inverted Wishart distribution, Schafer (1997) uses the parameter $\bLambda $.

Suppose that each observation in the data matrix $\mb{Y}$ has a multivariate normal distribution with mean $\bmu $ and covariance matrix $\bSigma $. Then with a prior inverted Wishart distribution for $\bSigma $ and a prior normal distribution for $\bmu $

\begin{eqnarray*} \bSigma \quad \sim & & W^{-1} \left( \, m, \, \bPsi \right) \\ \bmu | \bSigma \quad \sim & & N \left( \, \bmu _{0}, \, \, \frac{1}{\tau } \bSigma \right) \end{eqnarray*}

where $\tau > 0$ is a fixed number.

The posterior distribution (Anderson 1984, p. 270; Schafer 1997, p. 152) is

\begin{eqnarray*} \bSigma | \Strong{Y} \quad \sim & & W^{-1} \left( \, n+m, \, \, (n-1) \Strong{S} + \bPsi + \frac{n \tau }{n + \tau } (\overline{\Strong{y}} - \bmu _{0}) (\overline{\Strong{y}} - \bmu _{0})’ \right) \\ \bmu | ( \bSigma , \Strong{Y}) \quad \sim & & N \left( \, \, \frac{1}{n+\tau } \, (n \overline{\Strong{y}} + \tau \bmu _{0}), \, \, \frac{1}{n+\tau } \bSigma \right) \end{eqnarray*}

where $(n-1) \mb{S}$ is the CSSCP matrix.

Posterior Step

In each iteration, the posterior step simulates the posterior population mean vector $\bmu $ and covariance matrix $\bSigma $ from prior information for $\bmu $ and $\bSigma $, and the complete sample estimates.

You can specify the prior parameter information by using one of the following methods:

  • PRIOR=JEFFREYS, which uses a noninformative prior

  • PRIOR=INPUT=, which provides a prior information for $\bSigma $ in the data set. Optionally, it also provides a prior information for $\bmu $ in the data set.

  • PRIOR=RIDGE=, which uses a ridge prior

The next four subsections provide details of the posterior step for different prior distributions.

1. A Noninformative Prior

Without prior information about the mean and covariance estimates, you can use a noninformative prior by specifying the PRIOR=JEFFREYS option. The posterior distributions (Schafer 1997, p. 154) are

\begin{eqnarray*} \bSigma ^{(t+1)} | \Strong{Y} \quad \sim & & W^{-1} \left( \, n-1, \, (n-1)\Strong{S} \right) \\ \bmu ^{(t+1)} | ( \bSigma ^{(t+1)}, \Strong{Y}) \quad \sim & & N \left( \, \overline{\Strong{y}}, \, \, \frac{1}{\, n \, } \bSigma ^{(t+1)} \right) \end{eqnarray*}
2. An Informative Prior for $\bmu $ and $\bSigma $

When prior information is available for the parameters $\bmu $ and $\bSigma $, you can provide it with a SAS data set that you specify with the PRIOR=INPUT= option:

\begin{eqnarray*} \bSigma \quad \sim & & W^{-1} \left( \, d^{*}, \, d^{*}\Strong{S}^{*} \right) \\ \bmu | \bSigma \quad \sim & & N \left( \, \bmu _{0}, \, \, \frac{1}{n_{0}} \bSigma \right) \end{eqnarray*}

To obtain the prior distribution for $\bSigma $, PROC MI reads the matrix $\mb{S}^{*}$ from observations in the data set with _TYPE_=‘COV’, and it reads $n^{*}=d^{*}+1$ from observations with _TYPE_=‘N’.

To obtain the prior distribution for $\bmu $, PROC MI reads the mean vector $ \bmu _{0}$ from observations with _TYPE_=‘MEAN’, and it reads $n_{0}$ from observations with _TYPE_=‘N_MEAN’. When there are no observations with _TYPE_=‘N_MEAN’, PROC MI reads $n_{0}$ from observations with _TYPE_=‘N’.

The resulting posterior distribution, as described in the section Bayesian Estimation of the Mean Vector and Covariance Matrix, is given by

\begin{eqnarray*} \bSigma ^{(t+1)} | \Strong{Y} \quad \sim & & W^{-1} \left( \, n+d^{*}, \, \, (n-1)\Strong{S} + d^{*}\Strong{S}^{*} + \Strong{S}_{m} \right) \\ \bmu ^{(t+1)} \, | \, \left( \bSigma ^{(t+1)}, \Strong{Y} \right) \quad \sim & & N \left( \, \, \frac{1}{n+n_{0}} \, (n \overline{\Strong{y}} + n_{0} \bmu _{0}), \, \, \frac{1}{n+n_{0}} \bSigma ^{(t+1)} \right) \end{eqnarray*}

where

\[ \mb{S}_{m} = \frac{n n_{0}}{n + n_{0}} (\overline{\mb{y}} - \bmu _{0}) (\overline{\mb{y}} - \bmu _{0})’ \]
3. An Informative Prior for $\bSigma $

When the sample covariance matrix $\mb{S}$ is singular or near singular, prior information about $\bSigma $ can also be used without prior information about $\bmu $ to stabilize the inference about $\bmu $. You can provide it with a SAS data set that you specify with the PRIOR=INPUT= option.

To obtain the prior distribution for $\bSigma $, PROC MI reads the matrix $\mb{S}^{*}$ from observations in the data set with _TYPE_=‘COV’, and it reads $n^{*}$ from observations with _TYPE_=‘N’.

The resulting posterior distribution for $( \bmu , \bSigma )$ (Schafer 1997, p. 156) is

\begin{eqnarray*} \bSigma ^{(t+1)} | \Strong{Y} \quad \sim & & W^{-1} \left( \, n+d^*, \, \, (n-1) \Strong{S} + d^* \Strong{S}^{*} \right) \\ \bmu ^{(t+1)} \, | \, \left( \bSigma ^{(t+1)}, \Strong{Y} \right) \quad \sim & & N \left( \, \overline{\Strong{y}}, \, \, \frac{1}{\, \, n \, \, } \, \bSigma ^{(t+1)} \right) \end{eqnarray*}

Note that if the PRIOR=INPUT= data set also contains observations with _TYPE_=‘MEAN’, then a complete informative prior for both $\bmu $ and $\bSigma $ will be used.

4. A Ridge Prior

A special case of the preceding adjustment is a ridge prior with $\mb{S}^*$ = Diag $(\mb{S})$ (Schafer 1997, p. 156). That is, $\mb{S}^*$ is a diagonal matrix with diagonal elements equal to the corresponding elements in $\mb{S}$.

You can request a ridge prior by using the PRIOR=RIDGE= option. You can explicitly specify the number $d^{*} \geq 1$ in the PRIOR=RIDGE=$d^{*}$ option. Or you can implicitly specify the number by specifying the proportion p in the PRIOR=RIDGE=p option to request $d^{*}= (n-1) p$.

The posterior is then given by

\begin{eqnarray*} \bSigma ^{(t+1)} | \Strong{Y} \quad \sim & & W^{-1} \left( \, n + d^{*}, \, \, (n-1) \Strong{S} + d^{*} \textrm{Diag} (\Strong{S}) \, \right) \\ \bmu ^{(t+1)} \, \left| \, \left( \bSigma ^{(t+1)}, \Strong{Y} \right) \right. \quad \sim & & N \left( \, \overline{\Strong{y}}, \, \frac{1}{\, \, n \, \, } \, \bSigma ^{(t+1)} \right) \end{eqnarray*}