Most SAS statistical procedures exclude observations with any missing variable values from the analysis.
These observations are called incomplete cases. Although analyzing only complete cases has the advantage
of simplicity, the information contained in the incomplete cases is lost. This approach also ignores possible
systematic differences between the complete cases and the incomplete cases, and the resulting inference might
not be applicable to the population of all cases, especially with a small number of complete cases.
Another strategy for handling missing data is multiple imputation, which
replaces each missing value with a set of plausible values that represent the uncertainty about the right value
to impute. The multiply imputed data sets are then analyzed by using standard procedures for complete data and
combining the results from these analyses. No matter which completedata analysis is used, the process of combining
results from different data sets is essentially the same.
The SAS/STAT missing data analysis procedures include the following:
 CALIS Procedure — Fits structural equation models
 GEE Procedure — Generalized estimating equations approach to generalized linear models
 MCMC Procedure — General purpose Markov chain Monte Carlo (MCMC) simulation procedure that is
designed to fit Bayesian models with arbitrary priors and likelihood functions
 MI Procedure — Performs multiple imputation of missing data
 MIANALYZE Procedure — Combines the results of the analyses of imputations and generates valid statistical inferences
 SURVEYIMPUTE Procedure — Imputes missing values of an item in a data set by replacing them with observed values from
the same item and computes replicate weights (such as jackknife weights) that account for the imputation
CALIS Procedure
The CALIS procedure fits structural equation models, which express relationships among a system of variables
that can be either observed variables (manifest variables) or unobserved hypothetical variables (latent variables).
PROC CALIS enables you to do the following:
 estimate parameters and test hypotheses for constrained and unconstrained problems in the following:
 multiple and multivariate linear regression
 linear measurementerror models
 path analysis and causal modeling
 simultaneous equation models with reciprocal causation
 exploratory and confirmatory factor analysis of any order
 canonical correlation
 a wide variety of other (non)linear latent variable models
 estimate parameters by using the following criteria:
 unweighted least squares
 generalized least squares
 weighted least squares
 diagonally weighted least squares
 maximum likelihood
 full information maximum likelihood
 maximum likelihood with SatorraBentler scaled model fit chisquare statistic and sandwichtype standard error estimation
 robust estimation with maximum likelihood model evaluation
 specify models using the following modeling languages:
 FACTOR—supports the input of factorvariable relations
 LINEQS—uses equations to describe variable relationships
 LISMOD—utilizes LISREL model matrices for defining models
 MSTRUCT—supports direct parameterization in the mean and covariance matrices
 PATH—provides an intuitive causal path specification interface
 RAM—utilizes the formulation of the reticular action model
 REFMODEL—provides a quick way for model referencing and respecification
 choose among the following optimization algorithms:
 LevenbergMarquardt algorithm (More, 1978)
 trustregion algorithm (Gay 1983)
 NewtonRaphson algorithm with line search
 ridgestabilized NewtonRaphson algorithm
 various quasiNewton and dual quasiNewton algorithms: BroydenFletcherGoldfarbShanno and
DavidonFletcherPowell, including a sequential quadratic programming algorithm
for processing nonlinear equality and inequality constraints
 various conjugate gradient algorithms: automatic restart algorithm of Powell (1977),
FletcherReeves, PolakRibiere, and conjugate descent algorithm of Fletcher (1980)

 use the following methods to automatically generate initial values for the optimization process:
 twostage least squares estimation
 instrumental variable factor analysis
 approximate factor analysis
 ordinary least squares estimation
 McDonald's (McDonald and Hartmann 1992) method
 formulate general equality and inequality constraints by using programming statements
 specify free unnamed parameters in all models
 analyze linear dependencies in the information matrix (approximate Hessian matrix) that might be helpful in detecting unidentified models
 perform multiplegroup analysis. Groups can also be fitted by multiple models simultaneously.
 specify linear and nonlinear equality and inequality constraints on the parameters with several
different statements, depending on the type of input
 compute Lagrange multiplier test indices for simple constant and equality parameter constraints and for active boundary constraints
 produce a SAS data set that contains information about the optimal parameter estimates
(parameter estimates, gradient, Hessian, projected Hessian and Hessian of Lagrange function for constrained optimization, the information matrix, and standard errors)
 produce a SAS data set that contains residuals and, for exploratory factor analysis, the rotated and unrotated factor loadings
 perform analysis of multiple samples with equal sample size by analyzing a moment supermatrix that contains the
individual moment matrices as block diagonal submatrices
 perform residual analysis at the case level or observation level
 perform direct robust estimation based on residual weighting or twostage robust estimation based on analyzing
the robust mean and covariance matrices
 input the model fit information of the customized baseline model of your choice.
PROC CALIS then computes various fit indices (mainly the incremental fit indices) based on your customized model fit
rather than the fit of the default uncorrelatedness model.
 compute weighted covariances or correlations
 create a SAS data set that corresponds to any output table
 perform BY group processing, which enables you to obtain separate analyses on grouped observations
 automatically create the following types of graphs:
 distribution of residuals in the moment matrices
 caselevel residual diagnostics
 path diagrams

For further details, see
CALIS Procedure
GEE Procedure
The GEE procedure fits generalized linear models for longitudinal data by using the generalized estimating equations (GEE)
estimation method of Liang and Zeger (1986). The GEE method fits a marginal model to longitudinal data and is commonly used to analyze
longitudinal data when the populationaverage effect is of interest.
The following are highlights of the GEE procedure's features:
 perform weighted GEE estimation when there are missing data that are missing at random (MAR)
 supports the following response variable distributions:
 binomial
 gamma
 inverse Gaussian
 negative binomial
 normal
 Poisson
 multinomial
 supports the following link functions:
 complementary loglog
 identity
 log
 logit
 probit
 reciprocal
 power with exponent 2

 supports the following correlation structures:
 first order autoregressive
 exchangeable
 independent
 mdependent
 unstructured
 fixed (user specified)
 performs alternating logistic regression analysis for ordinal and binary data
 supports ESTIMATE, LSMEANS, and OUTPUT statements
 creates a SAS data set that corresponds to any output table
 automatically creates graphs by using ODS Graphics

For further details, see
GEE Procedure
MCMC Procedure
The MCMC procedure is a general purpose Markov chain Monte Carlo (MCMC) simulation procedure that
is designed to fit a wide range of Bayesian models.
PROC MCMC procedure enables you to do the following:
 specify a likelihood function for the data, prior distributions for the parameters, and hyperprior distributions if you are fitting hierarchical models
 obtain samples from the corresponding posterior distributions, produces summary and diagnostic
statistics, and save the posterior samples in an output data set that can be used for further analysis
 analyze data that have any likelihood, prior, or hyperprior as long as these functions are programmable using the SAS data step functions
 enter parameters into a model linearly or in any nonlinear functional form
 fit dynamic linear models, state space models, autoregressive models, or other models that have a conditionally dependent structure
on either the randomeffects parameters or the response variable
 fit models that contain differential equations or models that require integration

 use an adaptive blocked randomwalk Metropolis algorithm that uses a normal or t proposal distribution by default
 use a Hamiltonian Monte Carlo algorithm with a fixed step size and predetermined number of steps
 use a NoUTurn sampler with the Hamiltonian algorithm
 create a user defined sampler as an alternative to the default algorithms
 create a data set that contains random samples from the posterior predictive distribution of the response variable
 perform BY group processing, which enables you to obtain separate analyses on grouped observations
 take advantage of multiple processors
 create a SAS data set that corresponds to any output table
 automatically create graphs by using ODS Graphics

For further details, see
MCMC Procedure
MI Procedure
The MI procedure is a multiple imputation procedure that creates multiply imputed data sets for incomplete pdimensional multivariate data.
It uses methods that incorporate appropriate variability across the m imputations. The imputation method of choice depends on the patterns
of missingness in the data and the type of the imputed variable. The following are highlights of the MI procedure's features:
 creates multiple imputed data sets for incomplete multivariate data
 applies parametric and nonparametric methods for data sets with monotone missing values:
 continuous variables:
 regression method
 predictive mean matching method
 propensity score method
 classification variables:
 logistic regression method
 discriminant function method
 enables you to specify a multivariate imputation that uses fully conditional specification (FCS) methods
 applies a Markov chain Monte Carlo (MCMC) method for data sets with arbitrary missing patterns
 uses the EM algorithm to compute the MLE for (μ,Σ), the
means and covariance matrix, of a multivariate normal distribution from the input data set with missing values

 provides the following transformation methods for data that do not satisfy the assumption of multivariate normality:
 BoxCox
 exponential
 logarithmic
 logit
 power
 performs sensitivity analysis by generating multiple imputations for different scenarios under the assumption that the data are missing not at random
 performs BY group processing, which enables you to obtain separate analyses on grouped observations
 creates a SAS data set that corresponds to any output table
 automatically creates graphs by using ODS Graphics

Once the m complete data sets are analyzed using standard SAS procedures, the MIANALYZE
procedure can be used to generate valid statistical inferences about these parameters by combining
results from the m analyses.
For further details, see
MI Procedure
MIANALYZE Procedure
The MIANALYZE procedure combines the results of the analyses of imputations and generates
valid statistical inferences.
A companion procedure, PROC MI, creates multiply imputed data sets for incomplete multivariate
data. It uses methods that incorporate appropriate variability across the m imputations.
The analyses of imputations are obtained by using standard SAS procedures (such as PROC REG)
for complete data. No matter which completedata analysis is used, the process of combining results
from different imputed data sets is essentially the same and results in valid statistical inferences that
properly reflect the uncertainty due to missing values. These results of analyses are combined in the
MIANALYZE procedure to derive valid inferences.
The MIANALYZE procedure reads parameter estimates and associated standard errors or covariance
matrix that are computed by the standard statistical procedure for each imputed data set. The
MIANALYZE procedure then derives valid univariate inference for these parameters. With an additional
assumption about the population between and within imputation covariance matrices, multivariate
inference based on Wald tests can also be derived.
For some parameters of interest, you can use TEST statements to test linear hypotheses about the
parameters. For others, it is not straightforward to compute estimates and associated covariance
matrices with standard statistical SAS procedures, and thus, require special techniques.
Examples include correlation coefficients between two variables and ratios of variable means.
For further details, see
MIANALYZE Procedure
SURVEYIMPUTE Procedure
The SURVEYIMPUTE procedure imputes missing values of an item in a data set by replacing them with observed
values from the same item. The principles by which the imputation is performed are particularly useful for survey data.
The following are highlights of the SURVEYIMPUTE procedure's features:
 fully efficient fractional hotdeck imputation
 traditional hotdeck imputation with the following donor selection methods
 approximate Bayesian bootstrap
 simple random samples without replacement
 simple random samples with replacement
 probability proportional to respondent weights with replacement
 computes imputationadjusted replicate weights

 computes imputationadjusted balanced repeated replication (BRR) weights
 computes imputationadjusted jackknife weights
 provides a CELLS statement which names the variables that identify the imputation cells
 imputes variables jointly or independently for the fully efficient fractional imputation method
 creates a SAS data set that contains the imputed data

For further details, see
SURVEYIMPUTE Procedure