Missing data are nearly always an issue in longitudinal studies. This example uses the MCMC procedure to fit Bayesian logistic regression models to analyze air pollution data with missing values. A typical missing data strategy is to eliminate observations with missing data, but the resulting inference might not be valid if the observations with missing covariates are substantially different from the observations with complete data.
Researchers studied the effects of air pollution on respiratory disease in children. The response variable (Y) represented whether a child exhibited wheezing symptoms; it was recorded as 1 for symptoms exhibited and 0 for no symptoms exhibited. City of residency (X1) and maternal smoking status (X2) were the explanatory variables. The variable X1 was coded as 1 if the child lived in the more polluted city, Steel City, and 0 if the child lived in Green Hills. The X2 variable was the number of cigarettes the mother reported that she smoked per day.
data air;
input y x1 x2;
datalines;
0 0 0
0 0 0
0 1 0
0 0 0
0 0 11
0 1 7
... more lines ...
1 0 0
1 1 10
0 . 4
1 1 16
0 . 13
;
The MEANS and FREQ procedures provide summary statistics for the response and proposed covariates.
Figure 1
displays frequencies for the city of residency variable, and
Figure 2
displays summary statistics for the number of cigarettes smoked per day by the child’s mother.
Figure 3
displays frequencies for the response Y. The covariates X1 and X2 are missing for
and
of the 390 subjects in the study, respectively.
The outputs from the FREQ and MEANS procedures enable you to see a simple summary of the subjects and also to quantify the missingness in the data.
Little and Rubin ( 2002 ) discuss many different classifications of missing data. This example illustrates nonignorable missing at random (NMAR) data. Data are said to be NMAR when the missingness depends on the value that would have been observed. NMAR is the most general type of missing data and occurs frequently in longitudinal studies. You should make inference only after the model that is assumed to be correct is specified for the missing data mechanism. The results are then sensitive to those assumptions.
Suppose you want to fit a Bayesian logistic regression model for whether the subject develops wheezing symptoms with density as follows:
|
|
|
|||
|
|
|
( 1 ) |
for the
subjects.
The complete data likelihood function for each of the subjects is
|
|
|
( 2 ) |
where
denotes a conditional probability density. The binary density is evaluated at the specified value of
and corresponding mean parameter
as defined in Equation
1
. The three parameters in the complete data likelihood are
, and
, which correspond to an intercept, adjustment for living in Steel City, and slope for maternal smoking, respectively.
Next, the covariates X1 and X2 are written in terms of whether they were missing (
and
) or observed (
and
). Your goal is to make inferences from the observed data likelihood
|
( 3 ) |
by multiplying the conditional distribution
by Equation
2
and integrating over the missing observations. To make inferences from Equation
, you need to specify a distribution for the missing covariates,
, where
represents the hyperparameters in the missing data distributions. In missing data problems, you can model the missing data mechanism to fit the true missing data mechanism any way you choose. Suppose you specify a joint distribution of X1 and X2 in terms of the product of a conditional and marginal distribution; that is
|
|
|
For this example, say
is a logistic regression and
is a Poisson distribution. You treat the missing covariates as parameters, and you place prior distributions on them and their hyperparameters.
Suppose you place the following prior distributions on the three regression parameters, the missing covariates, and hyperparameters:
|
|
|
|||
|
|
|
( 4 ) | ||
|
|
|
|||
|
|
|
( 5 ) | ||
|
|
|
( 6 ) | ||
|
|
|
where
indicates a prior distribution.
The following DATA step creates two variables that identify and index the number of missing covariates. NMISS_X1 increments each time X1 has a missing value, and NMISS_X2 increments each time X2 has a missing value. At the end of the data set, macro variables &NMISS_X1 and &NMISS_X2 are also created to store the total number of missing X1 and X2 covariates, respectively. The %PUT statements enable you to verify that the macro variables have been created successfully.
data pollution;
set air end=eof;
nmiss_X1 + nmiss(X1);
nmiss_X2 + nmiss(X2);
obs = _n_;
if eof then do;
call symputx('nmiss_X1',nmiss_X1);
call symputx('nmiss_X2',nmiss_X2);
end;
run;
%put &nmiss_X1; %put &nmiss_X2;
Using Bayes’ theorem, the observed data likelihood function and prior distributions determine the posterior distribution of
, and
as follows:
|
|
|
|||
|
|
|
|||
|
|
|
where
and
are the macro variables &NMISS_X1 and &NMISS_X2, respectively. PROC MCMC obtains samples from the desired posterior distribution. You do not need to specify the exact form of the posterior distribution.
The odds ratio for comparing Steel City to Green Hills can be written as follows:
|
The odds ratio is useful for interpreting how the odds of developing a wheeze change for a child living in the more polluted city. Similarly, the odds ratio for the maternal smoking effect is written as follows:
|
The following SAS statements use the complete data likelihood function, missing data model assumptions, and prior distributions to fit the Bayesian logistic regression model. The SEED= option specifies a seed for the random number generator (the seed guarantees the reproducibility of the random stream). The NMC= option specifies the number of posterior simulation iterations. The MISSING=AC specifies that observations with missing values should be included in the analysis. The MONITOR= option outputs analysis on selected symbols of interest in the program.
ods graphics on;
proc mcmc data=pollution seed=1181 nmc=10000 missing=ac
monitor=(beta0 beta1 beta2 orx1 orx2);
array x1m[&nmiss_x1];
array x2m[&nmiss_x2];
parms beta0 -1 beta1 0.1 beta2 0.01;
parms alpha10 0 alpha11 0 alpha20 0;
parms x1m1-x1m&nmiss_x1 0;
parms x2m1-x2m&nmiss_x2 0;
prior beta: alpha1: ~ normal(0,var=10);
prior alpha20 ~ normal(0,var=2);
prior x2m: ~ poisson(exp(alpha20));
if x2 = . then x2 = x2m[nmiss_x2];
if obs = 1 then lpm = 0;
if x1 = . then do;
x1 = x1m[nmiss_x1];
p_c = logistic(alpha10 + alpha11*x2);
lpm = lpm + lpdfbern(x1,p_c);
end;
prior x1m: ~ dgeneral(lpm);
p = logistic(beta0 + beta1*x1 + beta2*x2);
model y ~ binary(p);
beginprior;
orx1 = exp(beta1);
orx2 = exp(beta2);
endprior;
run;
ods graphics off;
Each of the two ARRAY statements associates a name with a list of variables and constants. The ARRAY statements enable easy referencing of the variables and corresponding values in the procedure. They also create the X1M and X2M parameters for the observations’ missing covariates in X1 and X2, respectively. The macro variables &NMISS_X1 and &NMISS_X2 specify the number of missing values for X1 and X2, respectively.
The PARMS statements specify the parameters in the model, create blocks with the respective parameters, and assign initial values to each of them. The PRIOR statements specify priors for all the parameters. The notations beta: , alpha: , X1M: , and X2M: in the PRIOR statements are shorthand for all variables that start with ' beta ,' ' alpha ,' ' X1M ,' and ' X2M ,' respectively. The shorthand notation is not necessary, but it keeps your code succinct.
The first IF statement checks whether the covariate X2 is missing. If missing, X2 is replaced with the parameter X2M, which is sampled from the Poisson distribution given in Equation
6
. The second IF statement resets the value of the prior log likelihood (LPM) to zero at the top of the data set (that is, when the data set variable OBS is 1). The third IF statement checks whether the value of X1 is missing for each observation. If it is missing, then the three statements inside the DO loop are executed: The missing values of X1 are replaced with the corresponding sampled values of X1M. Then the P_C assignment statement calculates
according to Equation
5
. As the MCMC procedure cycles through each observation and finds a missing value, the LPM statement cumulatively adds the logarithm of the probability mass function for the binary model given in Equation
4
. Note that X1M is conditional on X2, but not on whether X2 is the sampled or observed value.
The PRIOR statement for the X1M parameters uses the DGENERAL distribution. The letter "D" stands for discrete because the covariate X1 is binary. The GENERAL function indicates that you are using a SAS statement to construct a nonstandard density or distribution. The argument is an expression that takes the value of the logarithm function of the prior or likelihood distribution. The variable LPM is the expression for the sum of the logarithm of probability mass functions given in Equation 4 . The DGENERAL function assigns this prior distribution to the X1M missing value parameters.
The P assignment statement calculates
in the logistic model, as given in Equation
1
. The MODEL statement specifies the complete data likelihood function for Y, as given in Equation
2
.
The BEGINPRIOR and ENDPRIOR statements reduce unnecessary observation-level computations. The statements inside the BEGINPRIOR and ENDPRIOR statements create a block of statements that are run only once per iteration rather than for each observation at each iteration. This enables a quick update of the symbols enclosed in the statements. The statements within the BEGINPRIOR and ENDPRIOR block calculate the odds ratios for the two covariates in the model.
Figure 4 displays diagnostic plots to assess whether the Markov chains have converged.
, and
The trace plots show that the mean of the Markov chain is constant over the graph and is stabilized. The chain was able to traverse the support of the target distribution, and the mixing is good. The trace plots also show that the Markov chain appears to have reached stationary distributions.
The autocorrelation plots indicate low autocorrelation and efficient sampling. The kernel density plots show smooth, unimodal posterior marginal distributions for each parameter.
Figure 5 displays a number of convergence diagnostics, including Monte Carlo standard errors, autocorrelations at selected lags, Geweke diagnostics, and the effective sample sizes.
| Monte Carlo Standard Errors | |||
|---|---|---|---|
| Parameter | MCSE |
Standard
Deviation |
MCSE/SD |
| beta0 | 0.00634 | 0.1862 | 0.0341 |
| beta1 | 0.00736 | 0.2275 | 0.0324 |
| beta2 | 0.000826 | 0.0226 | 0.0366 |
| orx1 | 0.0126 | 0.3914 | 0.0321 |
| orx2 | 0.000834 | 0.0228 | 0.0366 |
| Posterior Autocorrelations | ||||
|---|---|---|---|---|
| Parameter | Lag 1 | Lag 5 | Lag 10 | Lag 50 |
| beta0 | 0.8390 | 0.4279 | 0.1762 | -0.0306 |
| beta1 | 0.8353 | 0.4088 | 0.1474 | -0.0301 |
| beta2 | 0.8492 | 0.4517 | 0.2147 | -0.0390 |
| orx1 | 0.8313 | 0.3983 | 0.1414 | -0.0234 |
| orx2 | 0.8492 | 0.4520 | 0.2155 | -0.0394 |
Figure 6 displays the "Number of Observations" table. It lists the number of observations read from the DATA= data set and the number of observations used in the analysis. If the MISSING=AC statement were omitted from the PROC MCMC statement, the number of observations used in the analysis would be less than the number of observations read because only complete cases would be used in the analysis.
Figure 7 displays summary and interval statistics for each parameter’s posterior distribution.
| Posterior Summaries | ||||||
|---|---|---|---|---|---|---|
| Parameter | N | Mean |
Standard
Deviation |
Percentiles | ||
| 25% | 50% | 75% | ||||
| beta0 | 10000 | -1.3257 | 0.1862 | -1.4543 | -1.3279 | -1.2006 |
| beta1 | 10000 | 0.4975 | 0.2275 | 0.3403 | 0.4999 | 0.6457 |
| beta2 | 10000 | 0.00859 | 0.0226 | -0.00619 | 0.00917 | 0.0233 |
| orx1 | 10000 | 1.6879 | 0.3914 | 1.4053 | 1.6486 | 1.9074 |
| orx2 | 10000 | 1.0089 | 0.0228 | 0.9938 | 1.0092 | 1.0236 |
The odds ratio for X1 is the multiplicative change in the odds of a child wheezing in Steel City compared to the odds of the child wheezing in Green Hills. The estimated odds ratio (ORX1) value is 1.6879 with a corresponding 95% credible interval of
. City of residency is a significant factor in a child’s wheezing status. The estimated odds ratio for X2 is the multiplicative change in the odds of developing a wheeze for each additional reported cigarette smoked per day. The odds ratio of ORX2 indicates that the odds of a child developing a wheeze is 1.0089 times higher for each reported cigarette a mother smokes. The corresponding 95% credible interval is
. Since this interval contains the value 1, maternal smoking is not considered to be a influential effect.
The MCMC procedure is best suited for models with relatively few parameters although there are no formal limitations on the number of parameters you can specify. Data sets with many missing observations will be computationally expensive with the MCMC procedure and might have a long run time.