The MI Procedure

Monotone and FCS Regression Methods

The regression method is the default imputation method in the MONOTONE and FCS statements for continuous variables.

In the regression method, a regression model is fitted for a continuous variable with the covariates constructed from a set of effects. Based on the fitted regression model, a new regression model is simulated from the posterior predictive distribution of the parameters and is used to impute the missing values for each variable (Rubin 1987, pp. 166–167). That is, for a continuous variable $Y_{j}$ with missing values, a model

\[ Y_{j} = {\beta }_{0} + {\beta }_{1} \, X_{1} + {\beta }_{2} \, X_{2} + \ldots + {\beta }_{k} \, X_{k} \]

is fitted using observations with observed values for the variable $Y_{j}$ and its covariates $X_{1}$, $X_{2}$, …, $X_{k}$.

The fitted model includes the regression parameter estimates $\hat{\bbeta } = (\hat{\beta }_{0}, \hat{\beta }_{1}, \ldots , \hat{\beta }_{k})$ and the associated covariance matrix $\hat{\sigma }_{j}^2 \mb{V}_{j}$, where $\mb{V}_{j}$ is the usual $\mb{X}’\mb{X}$ inverse matrix derived from the intercept and covariates $X_{1}$, $X_{2}$, …, $X_{k}$.

The following steps are used to generate imputed values for each imputation:

  1. New parameters $\bbeta _{*} = ({\beta }_{*0}, {\beta }_{*1}, \ldots , {\beta }_{*(k)})$ and ${\sigma }_{*j}^2$ are drawn from the posterior predictive distribution of the parameters. That is, they are simulated from $(\hat{\beta }_{0}, \hat{\beta }_{1}, \ldots , \hat{\beta }_{k})$, ${\sigma }_{j}^2$, and $\mb{V}_{j}$. The variance is drawn as

    \[ {\sigma }_{*j}^2 = \hat{\sigma }_{j}^2 (n_{j}-k-1) / g \]

    where g is a ${\chi }_{n_{j}-k-1}^{2}$ random variate and $n_{j}$ is the number of nonmissing observations for $Y_{j}$. The regression coefficients are drawn as

    \[ \bbeta _{*} = \hat{\bbeta } + {\sigma }_{*j} \mb{V}_{hj}’ \mb{Z} \]

    where $\mb{V}_{hj}$ is the upper triangular matrix in the Cholesky decomposition, $\mb{V}_{j} = \mb{V}_{hj}’ \mb{V}_{hj}$, and $\mb{Z}$ is a vector of $k+1$ independent random normal variates.

  2. The missing values are then replaced by

    \[ {\beta }_{*0} + {\beta }_{*1} \, x_{1} + {\beta }_{*2} \, x_{2} + \ldots + {\beta }_{*(k)} \, x_{k} + z_{i} \, {\sigma }_{*j} \]

    where $x_{1}, x_{2}, \ldots , x_{k}$ are the values of the covariates and $z_{i}$ is a simulated normal deviate.