The MI Procedure

FCS Methods for Data Sets with Arbitrary Missing Patterns

For a data set with an arbitrary missing data pattern, you can use FCS methods to impute missing values for all variables, assuming the existence of a joint distribution for these variables (Brand, 1999; van Buuren, 2007). FCS method involves two phases in each imputation: the preliminary filled-in phase followed by the imputation phase.

At the filled-in phase, the missing values for all variables are filled in sequentially over the variables taken one at a time. The missing values for each variable are filled in using the specified method, or the default method for the variable if a method is not specified, with preceding variables serving as the covariates. These filled-in values provide starting values for these missing values at the imputation phase.

At the imputation phase, the missing values for each variable are imputed using the specified method and covariates at each iteration. The default method for the variable is used if a method is not specified, and the remaining variables are used as covariates if the set of covariates is not specified. After a number of iterations, as specified with the NBITER= option, the imputed values in each variable are used for the imputation. At each iteration, the missing values are imputed sequentially over the variables taken one at a time.

The MI procedure orders the variables as they are ordered in the VAR statement. For example, if the order of the p variables in the VAR statement is $Y_{1}$, $Y_{2}$, …, $Y_{p}$, then $Y_{1}$, $Y_{2}$, …, $Y_{p}$ (in that order) are used in the filled-in and imputation phases.

The filled-in phase replaces missing values with filled-in values for each variable. That is, with p variables $Y_{1}$, $Y_{2}$, …, $Y_{p}$ (in that order), the missing values are filled in by using the sequence,

$\displaystyle  \btheta ^{(0)}_{1}  $
$\displaystyle  \sim  $
$\displaystyle  P(\,  \btheta _{1} \, |\,  Y_{1(\mi {obs})})  $
$\displaystyle Y^{(0)}_{1(*)}  $
$\displaystyle  \sim  $
$\displaystyle  P(\,  Y_{1} \, |\,  \btheta ^{(0)}_{1})  $
$\displaystyle Y^{(0)}_{1}  $
$\displaystyle  =  $
$\displaystyle  (Y_{1(\mi {obs})}, Y^{(0)}_{1(*)})  $
$\displaystyle  $
$\displaystyle  \ldots  $
$\displaystyle  $
$\displaystyle  $
$\displaystyle  \ldots  $
$\displaystyle  $
$\displaystyle \btheta ^{(0)}_{p}  $
$\displaystyle  \sim  $
$\displaystyle  P(\,  \btheta _{p} \, |\,  Y^{(0)}_{1}, \ldots , Y^{(0)}_{p-1} , Y_{p(\mi {obs})} \, )  $
$\displaystyle Y^{(0)}_{p(*)}  $
$\displaystyle  \sim  $
$\displaystyle  P(\,  Y_{p} \, |\,  \btheta ^{(0)}_{p} \, )  $
$\displaystyle Y^{(0)}_{p}  $
$\displaystyle  =  $
$\displaystyle  (Y_{p(\mi {obs})}, Y^{(0)}_{p(*)})  $

where $Y_{j(\mi {obs})}$ is the set of observed $Y_ j$ values, $Y^{(0)}_{j(*)}$ is the set of filled-in $Y_ j$ values, $Y^{(0)}_{j}$ is the set of both observed and filled-in $Y_ j$ values, and $\btheta ^{(0)}_{j}$ is the set of simulated parameters for the conditional distribution of $Y_{j}$ given variables $Y_{1}$, $Y_{2}$, …, $Y_{j-1}$.

For each variable $Y_{j}$ with missing values, the corresponding imputation method is used to fit the model with covariates $Y_{1}, Y_{2}, \ldots , Y_{j-1}$. The observed observations for $Y_{j}$, which might include observations with filled-in values for $Y_{1}, Y_{2}, \ldots , Y_{j-1}$, are used in the model fitting. With this resulting model, a new model is drawn and then used to impute missing values for $Y_{j}$.

The imputation phase replaces these filled-in values $Y^{(0)}_{j(*)}$ with imputed values for each variable sequentially at each iteration. That is, with p variables $Y_{1}$, $Y_{2}$, …, $Y_{p}$ (in that order), the missing values are imputed with the sequence at iteration t + 1,

$\displaystyle  \btheta ^{(t+1)}_{1}  $
$\displaystyle  \sim  $
$\displaystyle  P(\,  \btheta _{1} \, |\,  Y_{1(\mi {obs})}, Y^{(t)}_{2}, \ldots , Y^{(t)}_{p} \, )  $
$\displaystyle Y^{(t+1)}_{1(*)}  $
$\displaystyle  \sim  $
$\displaystyle  P(\,  Y_{1} \, |\,  \btheta ^{(t+1)}_{1} \, )  $
$\displaystyle Y^{(t+1)}_{1}  $
$\displaystyle  =  $
$\displaystyle  (Y_{1(\mi {obs})}, Y^{(t+1)}_{1(*)})  $
$\displaystyle  $
$\displaystyle  \ldots  $
$\displaystyle  $
$\displaystyle  $
$\displaystyle  \ldots  $
$\displaystyle  $
$\displaystyle \btheta ^{(t+1)}_{p}  $
$\displaystyle  \sim  $
$\displaystyle  P(\,  \btheta _{p} \, |\,  Y^{(t+1)}_{1}, \ldots , Y^{(t+1)}_{p-1} , Y_{p(\mi {obs})} \, )  $
$\displaystyle Y^{(t+1)}_{p(*)}  $
$\displaystyle  \sim  $
$\displaystyle  P(\,  Y_{p} \, |\,  \btheta ^{(t+1)}_{p} \, )  $
$\displaystyle Y^{(t+1)}_{p}  $
$\displaystyle  =  $
$\displaystyle  (Y_{p(\mi {obs})}, Y^{(t+1)}_{p(*)})  $

where $Y_{j(\mi {obs})}$ is the set of observed $Y_ j$ values, $Y^{(t+1)}_{j(*)}$ is the set of imputed $Y_ j$ values at iteration t + 1, $Y^{(t)}_{j(*)}$ is the set of filled-in $Y_ j$ values (t = 0) or the set of imputed $Y_ j$ values at iteration t (t > 0), $Y^{(t+1)}_{j}$ is the set of both observed and imputed $Y_ j$ values at iteration t + 1, and $\btheta ^{(t+1)}_{j}$ is the set of simulated parameters for the conditional distribution of $Y_{j}$ given covariates constructed from $Y_{1}$, …, $Y_{j-1}$, $Y_{j+1}$, …, $Y_{p}$.

At each iteration, a specified model is fitted for each variable with missing values by using observed observations for that variable, which might include observations with imputed values for other variables. With this resulting model, a new model is drawn and then used to impute missing values for the imputed variable.

The steps are iterated long enough for the results to reliably simulate an approximately independent draw of the missing values for an imputed data set.

The imputation methods used in the filled-in and imputation phases are similar to the corresponding monotone methods for monotone missing data. You can use a regression method or a predictive mean matching method to impute missing values for a continuous variable, a logistic regression method for a classification variable with a binary or ordinal response, and a discriminant function method for a classification variable with a binary or nominal response. See the sections Monotone and FCS Regression Methods, Monotone and FCS Predictive Mean Matching Methods, Monotone and FCS Discriminant Function Methods, and Monotone and FCS Logistic Regression Methods for these methods.

The FCS method requires fewer iterations than the MCMC method (van Buuren, 2007). Often, as few as five or 10 iterations are enough to produce satisfactory results (van Buuren, 2007; Brand, 1999).