Introduction to Bayesian Analysis Procedures: Markov Chain Monte Carlo Method :: SAS/STAT(R) 9.2 User's Guide, Second Edition

Introduction to Bayesian Analysis Procedures

Markov Chain Monte Carlo Method

The Markov chain Monte Carlo (MCMC) method is a general simulation method for sampling from posterior distributions and computing posterior quantities of interest. MCMC methods sample successively from a target distribution. Each sample depends on the previous one, hence the notion of the Markov chain. A Markov chain is a sequence of random variables, $\text{[math]}$ , $\text{[math]}$ , $\text{[math]}$ , for which the random variable $\text{[math]}$ depends on all previous $\text{[math]}$ s only through its immediate predecessor $\text{[math]}$ . You can think of a Markov chain applied to sampling as a mechanism that traverses randomly through a target distribution without having any memory of where it has been. Where it moves next is entirely dependent on where it is now.

Monte Carlo, as in Monte Carlo integration, is mainly used to approximate an expectation by using the Markov chain samples. In the simplest version

$\text{[math]}$

where $\text{[math]}$ is a function of interest and $\text{[math]}$ are samples from $\text{[math]}$ on its support $\text{[math]}$ . This approximates the expected value of $\text{[math]}$ . The earliest reference to MCMC simulation occurs in the physics literature. Metropolis and Ulam (1949) and Metropolis et al. (1953) describe what is known as the Metropolis algorithm (see the section Metropolis and Metropolis-Hastings Algorithms). The algorithm can be used to generate sequences of samples from the joint distribution of multiple variables, and it is the foundation of MCMC. Hastings (1970) generalized their work, resulting in the Metropolis-Hastings algorithm. Geman and Geman (1984) analyzed image data by using what is now called Gibbs sampling (see the section Gibbs Sampler). These MCMC methods first appeared in the mainstream statistical literature in Tanner and Wong (1987).

The Markov chain method has been quite successful in modern Bayesian computing. Only in the simplest Bayesian models can you recognize the analytical forms of the posterior distributions and summarize inferences directly. In moderately complex models, posterior densities are too difficult to work with directly. With the MCMC method, it is possible to generate samples from an arbitrary posterior density $\text{[math]}$ and to use these samples to approximate expectations of quantities of interest. Several other aspects of the Markov chain method also contributed to its success. Most importantly, if the simulation algorithm is implemented correctly, the Markov chain is guaranteed to converge to the target distribution $\text{[math]}$ under rather broad conditions, regardless of where the chain was initialized. In other words, a Markov chain is able to improve its approximation to the true distribution at each step in the simulation. Furthermore, if the chain is run for a very long time (often required), you can recover $\text{[math]}$ to any precision. Also, the simulation algorithm is easily extensible to models with a large number of parameters or high complexity, although the "curse of dimensionality" often causes problems in practice.

Properties of Markov chains are discussed in Feller (1968), Breiman (1968), and Meyn and Tweedie (1993). Ross (1997) and Karlin and Taylor (1975) give a non-measure-theoretic treatment of stochastic processes, including Markov chains. For conditions that govern Markov chain convergence and rates of convergence, see Amit (1991), Applegate, Kannan, and Polson (1990), Chan (1993), Geman and Geman (1984), Liu, Wong, and Kong (1991a, 1991b), Rosenthal (1991a, 1991b), Tierney (1994), and Schervish and Carlin (1992). Besag (1974) describes conditions under which a set of conditional distributions gives a unique joint distribution. Tanner (1993), Gilks, Richardson, and Spiegelhalter (1996), Chen, Shao, and Ibrahim (2000), Liu (2001), Gelman et al. (2004), Robert and Casella (2004), and Congdon (2001, 2003, 2005) provide both theoretical and applied treatments of MCMC methods. You can also see the section A Bayesian Reading List for a list of books with varying levels of difficulty of treatment of the subject and its application to Bayesian statistics.

Metropolis and Metropolis-Hastings Algorithms

The Metropolis algorithm is named after its inventor, the American physicist and computer scientist Nicholas C. Metropolis. The algorithm is simple but practical, and it can be used to obtain random samples from any arbitrarily complicated target distribution of any dimension that is known up to a normalizing constant. The Bayesian procedures use a special case of the Metropolis algorithm called the Gibbs sampler to obtain posterior samplers.

Suppose you want to obtain $\text{[math]}$ samples from a univariate distribution with probability density function $\text{[math]}$ . Suppose $\text{[math]}$ is the $\text{[math]}$ th sample from $\text{[math]}$ . To use the Metropolis algorithm, you need to have an initial value $\text{[math]}$ and a symmetric proposal density $\text{[math]}$ . For the $\text{[math]}$ th iteration, the algorithm generates a sample from $\text{[math]}$ based on the current sample $\text{[math]}$ , and it makes a decision to either accept or reject the new sample. If the new sample is accepted, the algorithm repeats itself by starting at the new sample. If the new sample is rejected, the algorithm starts at the current point and repeats. The algorithm is self-repeating, so it can be carried out as long as required. In practice, you have to decide the total number of samples needed in advance and stop the sampler after that many iterations have been completed.

Suppose $\text{[math]}$ is a symmetric distribution. The proposal distribution should be an easy distribution from which to sample, and it must be such that $\text{[math]}$ , meaning that the likelihood of jumping to $\text{[math]}$ from $\text{[math]}$ is the same as the likelihood of jumping back to $\text{[math]}$ from $\text{[math]}$ . The most common choice of the proposal distribution is the normal distribution $\text{[math]}$ with a fixed $\text{[math]}$ . The Metropolis algorithm can be summarized as follows:

Set $\text{[math]}$ . Choose a starting point $\text{[math]}$ . This can be an arbitrary point as long as $\text{[math]}$ .
Generate a new sample, $\text{[math]}$ , by using the proposal distribution $\text{[math]}$ .
Calculate the following quantity:

$\text{[math]}$
Sample $\text{[math]}$ from the uniform distribution $\text{[math]}$ .
Set $\text{[math]}$ if $\text{[math]}$ ; otherwise set $\text{[math]}$ .
Set $\text{[math]}$ . If $\text{[math]}$ , the number of desired samples, return to step 2. Otherwise, stop.

Note that the number of iteration keeps increasing regardless of whether a proposed sample is accepted.

This algorithm defines a chain of random variates whose distribution will converge to the desired distribution $\text{[math]}$ , and so from some point forward, the chain of samples is a sample from the distribution of interest. In Markov chain terminology, this distribution is called the stationary distribution of the chain, and in Bayesian statistics, it is the posterior distribution of the model parameters. The reason that the Metropolis algorithm works is beyond the scope of this documentation, but you can find more detailed descriptions and proofs in many standard textbooks, including Roberts (1996) and Liu (2001). The random-walk Metropolis algorithm is used in PROC MCMC.

You are not limited to a symmetric random-walk proposal distribution in establishing a valid sampling algorithm. A more general form, the Metropolis-Hastings (MH) algorithm, was proposed by Hastings (1970). The MH algorithm uses an asymmetric proposal distribution: $\text{[math]}$ . The difference in its implementation comes in calculating the ratio of densities:

$\text{[math]}$

Other steps remain the same.

The extension of the Metropolis algorithm to a higher-dimensional $\text{[math]}$ is straightforward. Suppose $\text{[math]}$ is the parameter vector. To start the Metropolis algorithm, select an initial value for each $\text{[math]}$ and use a multivariate version of proposal distribution $\text{[math]}$ , such as a multivariate normal distribution, to select a $\text{[math]}$ -dimensional new parameter. Other steps remain the same as those previously described, and this Markov chain eventually converges to the target distribution of $\text{[math]}$ . Chib and Greenberg (1995) provide a useful tutorial on the algorithm.

Independence Sampler

Another type of Metropolis algorithm is the "independence" sampler. It is called the independence sampler because the proposal distribution in the algorithm does not depend on the current point as it does with the random-walk Metropolis algorithm. For this sampler to work well, you want to have a proposal distribution that mimics the target distribution and have the acceptance rate be as high as possible.

Set $\text{[math]}$ . Choose a starting point $\text{[math]}$ . This can be an arbitrary point as long as $\text{[math]}$ .
Generate a new sample, $\text{[math]}$ , by using the proposal distribution $\text{[math]}$ . The proposal distribution does not depend on the current value of $\text{[math]}$ .
Calculate the following quantity:

$\text{[math]}$
Sample $\text{[math]}$ from the uniform distribution $\text{[math]}$ .
Set $\text{[math]}$ if $\text{[math]}$ ; otherwise set $\text{[math]}$ .
Set $\text{[math]}$ . If $\text{[math]}$ , the number of desired samples, return to step 2. Otherwise, stop.

A good proposal density should have thicker tails than those of the target distribution. This requirement sometimes can be difficult to satisfy especially in cases where you do not know what the target posterior distributions are like. In addition, this sampler does not produce independent samples as the name seems to imply, and sample chains from independence samplers can get stuck in the tails of the posterior distribution if the proposal distribution is not chosen carefully. The independence sampler is used in PROC MCMC.

Gibbs Sampler

The Gibbs sampler, named by Geman and Geman (1984) after the American physicist Josiah W. Gibbs, is a special case of the Metropolis sampler in which the proposal distributions exactly match the posterior conditional distributions and proposals are accepted 100% of the time. Gibbs sampling requires you to decompose the joint posterior distribution into full conditional distributions for each parameter in the model and then sample from them. The sampler can be efficient when the parameters are not highly dependent on each other and the full conditional distributions are easy to sample from. Some researchers favor this algorithm because it does not require an instrumental proposal distribution as Metropolis methods do. However, while deriving the conditional distributions can be relatively easy, it is not always possible to find an efficient way to sample from these conditional distributions.

Suppose $\text{[math]}$ is the parameter vector, $\text{[math]}$ is the likelihood, and $\text{[math]}$ is the prior distribution. The full posterior conditional distribution of $\text{[math]}$ is proportional to the joint posterior density; that is,

$\text{[math]}$

For instance, the one-dimensional conditional distribution of $\text{[math]}$ given $\text{[math]}$ , is computed as the following:

$\text{[math]}$

The Gibbs sampler works as follows:

Set $\text{[math]}$ , and choose an arbitrary initial value of $\text{[math]}$ .
Generate each component of $\text{[math]}$ as follows:
- draw $\text{[math]}$ from $\text{[math]}$
- draw $\text{[math]}$ from $\text{[math]}$
- ...
- draw $\text{[math]}$ from $\text{[math]}$
Set $\text{[math]}$ . If $\text{[math]}$ , the number of desired samples, return to step 2. Otherwise, stop.

The name "Gibbs" was introduced by Geman and Geman (1984). Gelfand et al. (1990) first used Gibbs sampling to solve problems in Bayesian inference. See Casella and George (1992) for a tutorial on the sampler. The GENMOD, LIFEREG, and PHREG procedures update parameters using the Gibbs sampler.

Adaptive Rejection Sampling Algorithm

The GENMOD, LIFEREG, and PHREG procedures use the adaptive rejection sampling (ARS) algorithm to sample parameters sequentially from their univariate full conditional distributions. The ARS algorithm is a rejection algorithm that was originally proposed by Gilks and Wild (1992). Given a log-concave density (the log of the density is concave), you can construct an envelope to the density by using linear segments. You then use the linear segment envelope as a proposal density (it becomes a piecewise exponential density on the original scale and is easy to generate samplers from) in the rejection sampling. The log-concavity condition is met in some of the models fit by the procedures. For example, the posterior densities for the regression parameters in the generalized linear models are log-concave under flat priors. When this condition fails, the ARS algorithm calls for an additional Metropolis-Hasting step (Gilks, Best, and Tan; 1995), and the modified algorithm becomes the adaptive rejection metropolis sampling (ARMS) algorithm. The GENMOD, LIFEREG, and PHREG procedures can recognize whether a model is log-concave and select the appropriate sampler for the problem at hand.

The GENMOD, LIFEREG, and PHREG procedures implement the ARMS algorithm based on code kindly provided by Walter R. Gilks, University of Leeds (Gilks; 2003), to obtain posterior samples. For a detailed description and explanation of the algorithm, see Gilks and Wild (1992) and Gilks, Best, and Tan (1995).

Burn-in, Thinning, and Markov Chain Samples

Burn-in refers to the practice of discarding an initial portion of a Markov chain sample so that the effect of initial values on the posterior inference is minimized. For example, suppose the target distribution is $\text{[math]}$ and the Markov chain was started at the value $\text{[math]}$ . The chain might quickly travel to regions around 0 in a few iterations. However, including samples around the value $\text{[math]}$ in the posterior mean calculation can produce substantial bias in the mean estimate. In theory, if the Markov chain is run for an infinite amount of time, the effect of the initial values decreases to zero. In practice, you do not have the luxury of infinite samples. In practice, you assume that after $\text{[math]}$ iterations, the chain has reached its target distribution and you can throw away the early portion and use the good samples for posterior inference. The value of $\text{[math]}$ is the burn-in number.

With some models you might experience poor mixing (or slow convergence) of the Markov chain. This can happen, for example, when parameters are highly correlated with each other. Poor mixing means that the Markov chain slowly traverses the parameter space (see the section Visual Analysis via Trace Plots for examples of poorly mixed chains) and the chain has high dependence. High sample autocorrelation can result in biased Monte Carlo standard errors. A common strategy is to thin the Markov chain in order to reduce sample autocorrelations. You thin a chain by keeping every $\text{[math]}$ th simulated draw from each sequence. You can safely use a thinned Markov chain for posterior inference as long as the chain converges. It is important to note that thinning a Markov chain can be wasteful because you are throwing away a $\text{[math]}$ fraction of all the posterior samples generated. MacEachern and Berliner (1994) show that you always get more precise posterior estimates if the entire Markov chain is used. However, other factors, such as computer storage or plotting time, might prevent you from keeping all samples.

To use the GENMOD, LIFEREG, MCMC, and PHREG procedures, you need to determine the total number of samples to keep ahead of time. This number is not obvious and often depends on the type of inference you want to make. Mean estimates do not require nearly as many samples as small-tail percentile estimates. In most applications, you might find that keeping a few thousand iterations is sufficient for reasonably accurate posterior inference. In all four procedures, the relationship between the number of iterations requested, the number of iterations kept, and the amount of thinning is as follows:

$\text{[math]}$

where $\text{[math]}$ is the rounding operator.

Top of Page