PROC MCMC: Simple Linear Regression :: SAS/STAT(R) 9.2 User's Guide, Second Edition

The MCMC Procedure

Simple Linear Regression

This section illustrates some basic features of PROC MCMC by using a linear regression model. The model is as follows:

$\text{[math]}$

for the observations $\text{[math]}$ .

The following statements create a SAS data set with measurements of Height and Weight for a group of children:

   title 'Simple Linear Regression';
    
   data Class;
      input Name $ Height Weight @@;
      datalines;
   Alfred  69.0 112.5   Alice  56.5  84.0   Barbara 65.3  98.0
   Carol   62.8 102.5   Henry  63.5 102.5   James   57.3  83.0
   Jane    59.8  84.5   Janet  62.5 112.5   Jeffrey 62.5  84.0
   John    59.0  99.5   Joyce  51.3  50.5   Judy    64.3  90.0
   Louise  56.3  77.0   Mary   66.5 112.0   Philip  72.0 150.0
   Robert  64.8 128.0   Ronald 67.0 133.0   Thomas  57.5  85.0
   William 66.5 112.0
   ;

The equation of interest is as follows:

$\text{[math]}$

The observation errors, $\text{[math]}$ , are assumed to be independent and identically distributed with a normal distribution with mean zero and variance $\text{[math]}$ .

$\text{[math]}$

The likelihood function for each of the Weight, which is specified in the MODEL statement, is as follows:

$\text{[math]}$

where $\text{[math]}$ denotes a conditional probability density and $\text{[math]}$ is the normal density. There are three parameters in the likelihood: $\text{[math]}$ , $\text{[math]}$ , and $\text{[math]}$ . You use the PARMS statement to indicate that these are the parameters in the model.

Suppose that you want to use the following three prior distributions on each of the parameters:

$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$
$\text{[math]}$	$\text{[math]}$	$\text{[math]}$

where $\text{[math]}$ indicates a prior distribution and $\text{[math]}$ is the density function for the inverse-gamma distribution. The normal priors on $\text{[math]}$ and $\text{[math]}$ have large variances, expressing your lack of knowledge about the regression coefficients. The priors correspond to an equal-tail $\text{[math]}$ credible intervals of approximately $\text{[math]}$ for $\text{[math]}$ and $\text{[math]}$ . Priors of this type are often called vague or diffuse priors. See the section Prior Distributions for more information. Typically diffuse prior distributions have little influence on the posterior distribution and are appropriate when stronger prior information about the parameters is not available.

A frequently used diffuse prior for the variance parameter $\text{[math]}$ is the inverse-gamma distribution. With a shape parameter of $\text{[math]}$ and a scale parameter of $\text{[math]}$ , this prior corresponds to an equal-tail $\text{[math]}$ credible interval of $\text{[math]}$ , with the mode at $\text{[math]}$ for $\text{[math]}$ . Alternatively, you can use any other positive prior, meaning that the density support is positive on this variance component. For example, you can use the gamma prior.

According to Bayes’ theorem, the likelihood function and prior distributions determine the posterior (joint) distribution of $\text{[math]}$ , $\text{[math]}$ , and $\text{[math]}$ as follows:

$\text{[math]}$

You do not need to know the form of the posterior distribution when you use PROC MCMC. PROC MCMC automatically obtains samples from the desired posterior distribution, which is determined by the prior and likelihood you supply.

The following statements fit this linear regression model with diffuse prior information:

   ods graphics on;
   proc mcmc data=class outpost=classout nmc=50000 thin=5 seed=246810;
      parms beta0 0 beta1 0;
      parms sigma2 1;
      prior beta0 beta1 ~ normal(mean = 0, var = 1e6);
      prior sigma2 ~ igamma(shape = 3/10, scale = 10/3);
      mu = beta0 + beta1*height;
      model weight ~ n(mu, var = sigma2);
   run;
   ods graphics off;

The ODS GRAPHICS ON statement invokes the ODS Graphics environment and displays the diagnostic plots, such as the trace and autocorrelation function plots of the posterior samples. For more information about ODS, see Chapter 21, Statistical Graphics Using ODS.

The PROC MCMC statement invokes the procedure and specifies the input data set class. The output data set classout contains the posterior samples for all of the model parameters. The NMC= option specifies the number of posterior simulation iterations. The THIN= option controls the thinning of the Markov chain and specifies that one of every 5 samples is kept. Thinning is often used to reduce the correlations among posterior sample draws. In this example, 10,000 simulated values are saved in the classout data set. The SEED= option specifies a seed for the random number generator, which guarantees the reproducibility of the random stream. For more information about Markov chain sample size, burn-in, and thinning, see the section Burn-in, Thinning, and Markov Chain Samples.

The PARMS statements identify the three parameters in the model: beta0, beta1, and sigma2. Each statement also forms a block of parameters, where the parameters are updated simultaneously in each iteration. In this example, beta0 and beta1 are sampled jointly, conditional on sigma2; and sigma2 is sampled conditional on fixed values of beta0 and beta1. In simple regression models such as this, you expect the parameters beta0 and beta1 to have high posterior correlations, and placing them both in the same block improves the mixing of the chain—that is, the efficiency that the posterior parameter space is explored by the Markov chain. For more information, see the section Blocking of Parameters. The PARMS statements also assign initial values to the parameters (see the section Initial Values of the Markov Chains). The regression parameters are given 0 as their initial values, and the scale parameter sigma2 starts at value 1. If you do not provide initial values, the procedure chooses starting values for every parameter.

The PRIOR statements specify prior distributions for the parameters. The parameters beta0 and beta1 both share the same prior—a normal prior with mean $\text{[math]}$ and variance $\text{[math]}$ . The parameter sigma2 has an inverse-gamma distribution with a shape parameter of 3/10 and a scale parameter of 10/3. For a list of standard distributions that PROC MCMC supports, see the section Standard Distributions.

The mu assignment statement calculates the expected value of Weight as a linear function of Height. The MODEL statement uses the shorthand notation, n, for the normal distribution to indicate that the response variable, Weight, is normally distributed with parameters mu and sigma2. The functional argument MEAN= in the normal distribution is optional, but you have to indicate whether sigma2 is a variance (VAR=), a standard deviation (SD=), or a precision (PRECISION=) parameter. See Table 52.2 in the section MODEL Statement for distribution specifications.

The distribution parameters can contain expressions. For example, you can write the MODEL statement as follows:

   model weight ~ n(beta0 + beta1*height, var = sigma2);

Before you do any posterior inference, it is essential that you examine the convergence of the Markov chain (see the section Assessing Markov Chain Convergence). You cannot make valid inferences if the Markov chain has not converged. A very effective convergence diagnostic tool is the trace plot. Although PROC MCMC produces graphs at the end of the procedure output (see Figure 52.6), you should visually examine the convergence graph first.

The first table that PROC MCMC produces is the "Number of Observations" table, as shown in Figure 52.1. This table lists the number of observations read from the DATA= data set and the number of non-missing observations used in the analysis.

Figure 52.1 Observation Information

Simple Linear Regression

The MCMC Procedure

Number of Observations Read

Number of Observations Used

The "Parameters" table, shown in Figure 52.2, lists the names of the parameters, the blocking information (see the section Blocking of Parameters), the sampling method used, the starting values (the section Initial Values of the Markov Chains), and the prior distributions. You should to check this table to ensure that you have specified the parameters correctly, especially for complicated models.

Figure 52.2 Parameter Information

Parameters
Block	Parameter	Sampling Method	Initial Value	Prior Distribution
1	beta0	N-Metropolis	0	normal(mean = 0, var = 1e6)
1	beta1	N-Metropolis	0	normal(mean = 0, var = 1e6)
2	sigma2	N-Metropolis	1.0000	igamma(shape = 3/10, scale = 10/3)

The "Tuning History" table, shown in Figure 52.3, shows how the tuning stage progresses for the multivariate random walk Metropolis algorithm used by PROC MCMC to generate samples from the posterior distribution. An important aspect of the algorithm is the calibration of the proposal distribution. The tuning of the Markov chain is broken into a number of phases. In each phase, PROC MCMC generates trial samples and automatically modifies the proposal distribution as a result of the acceptance rate (see the section Tuning the Proposal Distribution). In this example, PROC MCMC found an acceptable proposal distribution after 7 phases, and this distribution is used in both the burn-in and sampling stages of the simulation.

The "Burn-In History" table shows the burn-in phase, and the "Sampling History" table shows the main phase sampling.

Figure 52.3 Tuning, Burn-In and Sampling History

Tuning History
Phase	Block	Scale	Acceptance Rate
1	1	2.3800	0.0420
	2	2.3800	0.8860
2	1	1.0938	0.2180
	2	15.5148	0.3720
3	1	0.8299	0.4860
	2	15.5148	0.1260
4	1	1.1132	0.4840
	2	9.4767	0.0880
5	1	1.4866	0.5420
	2	5.1914	0.2000
6	1	2.2784	0.4600
	2	3.7859	0.3900
7	1	2.8820	0.3360
	2	3.7859	0.4020

Burn-In History
Block	Scale	Acceptance Rate
1	2.8820	0.3400
2	3.7859	0.4150

Sampling History
Block	Scale	Acceptance Rate
1	2.8820	0.3284
2	3.7859	0.4008

For each posterior distribution, PROC MCMC also reports summary statistics (posterior means, standard deviations, and quantiles) and interval statistics (95% equal-tail and highest posterior density credible intervals), as shown in Figure 52.4. For more information about posterior statistics, see the section Summary Statistics.

Figure 52.4 MCMC Summary and Interval Statistics

Simple Linear Regression

The MCMC Procedure

Posterior Summaries
Parameter	N	Mean	Standard Deviation	Percentiles
Parameter	N	Mean	Standard Deviation	25%	50%	75%
beta0	10000	-142.6	33.9390	-164.5	-142.4	-120.5
beta1	10000	3.8917	0.5427	3.5406	3.8906	4.2402
sigma2	10000	136.8	51.7417	101.8	126.0	159.9

Posterior Intervals
Parameter	Alpha	Equal-Tail Interval		HPD Interval
beta0	0.050	-209.3	-76.1692	-209.7	-77.1624
beta1	0.050	2.8317	4.9610	2.8280	4.9468
sigma2	0.050	69.2208	265.5	58.2627	233.8

By default, PROC MCMC also computes a number of convergence diagnostics to help you determine whether the chain has converged. These are the Monte Carlo standard errors, the autocorrelations at selected lags, the Geweke diagnostics, and the effective sample sizes. These statistics are shown in Figure 52.5. For details and interpretations of these diagnostics, see the section Assessing Markov Chain Convergence.

The "Monte Carlo Standard Errors" table indicates that the standard errors of the mean estimates for each of the parameters are relatively small, with respect to the posterior standard deviations. The values in the "MCSE/SD" column (ratios of the standard errors and the standard deviations) are small, around 0.01. This means that only a fraction of the posterior variability is due to the simulation. The "Autocorrelations of the Posterior Samples" table shows that the autocorrelations among posterior samples reduce quickly and become almost nonexistent after lag 5. The "Geweke Diagnostics" table indicates that no parameter failed the test, and the "Effective Sample Sizes" table reports the number of effective sample sizes of the Markov chain.

Figure 52.5 MCMC Convergence Diagnostics

Simple Linear Regression

The MCMC Procedure

Monte Carlo Standard Errors
Parameter	MCSE	Standard Deviation	MCSE/SD
beta0	0.4576	33.9390	0.0135
beta1	0.00731	0.5427	0.0135
sigma2	0.7151	51.7417	0.0138

Posterior Autocorrelations
Parameter	Lag 1	Lag 5	Lag 10	Lag 50
beta0	0.2986	-0.0008	0.0162	0.0193
beta1	0.2971	0.0000	0.0135	0.0161
sigma2	0.2966	0.0062	0.0008	-0.0068

Geweke Diagnostics
Parameter	z	Pr > \|z\|
beta0	0.1105	0.9120
beta1	-0.1701	0.8649
sigma2	-0.2175	0.8278

Effective Sample Sizes
Parameter	ESS	Correlation Time	Efficiency
beta0	5501.1	1.8178	0.5501
beta1	5514.8	1.8133	0.5515
sigma2	5235.4	1.9101	0.5235

PROC MCMC produces a number of graphs, shown in Figure 52.6, which also aid convergence diagnostic checks. With the trace plots, there are two important aspects to examine. First, you want to check whether the mean of the Markov chain has stabilized and appears constant over the graph. Second, you want to check whether the chain has good mixing and is "dense," in the sense that it quickly traverses the support of the distribution to explore both the tails and the mode areas efficiently. The plots show that the chains appear to have reached their stationary distributions.

Next, you should examine the autocorrelation plots, which indicate the degree of autocorrelation for each of the posterior samples. High correlations usually imply slow mixing. Finally, the kernel density plots estimate the posterior marginal distributions for each parameter.

Figure 52.6 Diagnostic Plots for $\text{[math]}$ , $\text{[math]}$ and $\text{[math]}$

Diagnostic Plots for 0, 1 and σ2, continued

In regression models such as this, you expect the posterior estimates to be very similar to the maximum likelihood estimators with noninformative priors on the parameters, The REG procedure produces the following fitted model (code not shown):

$\text{[math]}$

These are very similar to the means show in Figure 52.4. With PROC MCMC, you can carry out informative analysis that uses specifications to indicate prior knowledge on the parameters. Informative analysis is likely to produce different posterior estimates, which are the result of information from both the likelihood and the prior distributions. Incorporating additional information in the analysis is one major difference between the classical and Bayesian approaches to statistical inference.

Top of Page