This example demonstrates how to perform Bayesian quantile regression by using SAS/STAT software’s MCMC procedure.
Quantile regression is a technique for estimating conditional quantile functions. With quantile regression, you can model any location within a distribution, and you can estimate a set of quantiles and produce a complete picture of the covariate effect. These capabilities enable you to answer questions about how changes in the covariates affect the location, scale, and shape of a response variable’s distribution. For a Bayesian approach to quantile regression, you form the likelihood function based on the asymmetric Laplace distribution, regardless of the actual distribution of the data. In general, you can choose any prior for the quantile regression parameters, but it has been shown that the use of improper uniform priors produces a proper joint posterior distribution (Yu and Moyeed, 2001).
The pth regression quantile is defined as any solution, , to the quantile regression minimization problem
where the loss function
Minimization of the loss function is equivalent to the maximization of a likelihood function formed by combining independently distributed asymmetric Laplace densities (Yu and Moyeed, 2001).
The asymmetric Laplace distribution has the following probability density function (Yu and Zhang, 2005):
Given the observations , the posterior distribution of , is given by
where is the prior distribution of and is the likelihood function, written as
There is no standard conjugate prior distribution for the parameters of the pth regression quantile, but you can use Markov chain Monte Carlo (MCMC) methods to approximate the posterior distributions of the unknown parameters. Yu and Moyeed (2001) prove that if you choose the prior of to be improper uniform, then the resulting joint posterior distribution is proper.
This example from Yu and Moyeed (2001) explores the distribution of the serum concentration (grams per liter) of immunoglobulin-G (IgG) conditional on
age in a sample of 298 children. A detailed discussion of the data appears in Isaacs et al. (1983) and Royston and Altman
(1994). The following DATA step reads the data and creates the data set
data igg; input IgG Age; Age2=age*age; datalines; 1.5 .5 2.7 .5 1.9 .5 ... more lines ... 7.6 5.916667 3.1 6 6.8 6 ; run;
You can use the SGPLOT procedure as follows to generate a histogram of the response variable
IgG for a visual inspection of its marginal distribution:
ods graphics on; proc sgplot data=igg; histogram igg / nbins=50; run;
Figure 1 shows the distribution of
IgG. The distribution is asymmetric and skewed to the right.
You can also use PROC SGPLOT as follows to generate a scatterplot of
proc sgplot data=igg; scatter x=Age y=IgG; run;
Figure 2 shows the relationship between
Age. There is some evidence of an upward trend.
Yu and Moyeed (2001) fit a quantile regression model of the form
The following SAS statements demonstrate how to use the MCMC procedure to fit this quantile regression model for a single quantile. Because PROC MCMC does not recognize the asymmetric Laplace distribution, you must write out the log-likelihood function and use the GENERAL() function.
The PROPCOV option in the PROC MCMC statement specifies that the conjugate-gradient method be used in constructing the initial covariance matrix for the Metropolis-Hastings algorithm. The NTU= option requests 1,000 tuning iterations. Experience shows that increasing the number of tuning iterations often results in improved mixing of the Markov chains. The MINTUNE= option requests a minimum of 10 proposal tuning loops. Increasing the minimum number of proposal tuning loops above the default number of 2 is sometimes needed in order to achieve convergence of the Markov chain. The NMC= option requests 30,000 MCMC iterations, excluding the burn-in iterations.
The BEGINCNST and ENDCNST statements define a statement block within which PROC MCMC processes the programming statements only during the setup stage of the simulation. You can use the BEGINCNST and ENDCNST statement block to define the constant , the quantile that is being fitted.
The PARMS statement lists the model parameters, b0, b1, and b2, and specifies that their initial values be 0.
The PRIOR statement specifies that the parameters b0, b1, and b2 have independent, improper uniform prior distributions.
The next three statements define the log-likelihood of the quantile regression model. The first and second statements are for convenience in writing the third statement. The first statement defines
Mu, the location parameter of the asymmetric Laplace distribution. The second statement defines the random variable
u, which is equal to the response variable
IgG minus the location parameter
Mu. The third statement defines the log-likelihood function in terms of the quantile and the random variable
The MODEL statement specifies that the response variable
IgG have an asymmetric Laplace distribution, as defined by the log-likelihood function
proc mcmc data=igg seed=5263 propcov=congra ntu=1000 mintune=10 nmc=30000; begincnst; p=0.95; endcnst; parms (b0-b2) 0; prior b: ~ general(0); mu= b0 + b1*age + b2*age2; u = igg - mu; ll = log(p)+log(1-p) - 0.5*(abs(u)+(2*p-1)*u); model igg ~ general(ll); run;
Figure 3 shows the trace plots, autocorrelation plots, and kernel density estimates of the parameters b0, b1, and b2, respectively. All three trace plots indicate convergence of the Markov chain. The autocorrelation plots show evidence of autocorrelation in the posterior samples, which often indicates slow mixing. The kernel density estimates are unimodal and relatively smooth.
Output 1 shows that the Monte Carlo standard errors (MCSE) of each parameter are small relative to the posterior standard deviations (SD). A small MCSE/SD ratio indicates that the Markov chain has stabilized and the mean estimates do not vary much over time.
Output 1: Monte Carlo Standard Errors
|Monte Carlo Standard Errors|
Output 2 shows the “Effective Sample Sizes” table. The autocorrelation times for the three parameters range from 16.59 to 23.08, and the efficiency rates are low. These results account for the relatively small effective sample sizes, given a nominal sample size of 30,000.
Output 2: Effective Sample Size
|Effective Sample Sizes|
Output 3 and Output 4 show the posterior summaries of the estimated parameters b0, b1, and b2. The sample mean of the output chain for the intercept, b0, is 7.17. The 95% credible interval is strictly positive, indicating with high probability that the intercept is positive. The sample mean for b1, the coefficient of the variable
Age, is –0.33. The 95% credible interval contains 0, indicating that the direction of the effect of
Age on the 0.95 quantile is ambiguous. The sample mean for b2, the coefficient of
Age2, is 0.23. The 95% credible interval also contains 0, indicating ambiguity about the direction of the effect of
Age2 on the 0.95 quantile.
Output 3: Posterior Summaries
Output 4: Posterior Intervals
|Parameter||Alpha||Equal-Tail Interval||HPD Interval|
In many quantile regression problems, it is useful to examine the quantile process, or how the estimated regression parameter for each covariate changes as
varies over the interval
. This example, which is a variation on an example
from Yu and Moyeed (2001), explores the stack loss data from Brownlee (1965). The data are from the operation of a
plant where ammonia is oxidized to nitric acid; the oxidation is measured on 21 consecutive days. The response variable,
StackLoss, is the percentage of ammonia lost (times 10),
and there are three explanatory variables:
AirFlow, which measures the air flow to the plant;
WaterTemp, which measures the cooling water inlet
AcidConc, which measures the acid concentration. The following DATA step reads the data and creates a SAS data set:
data stackloss; input stackloss airflow watertemp acidconc; datalines; 42 80 27 89 37 80 27 88 37 75 25 90 28 62 24 87 18 62 22 87 18 62 23 87 19 62 24 93 20 62 24 93 15 58 23 87 14 58 18 80 14 58 18 89 13 58 17 88 11 58 18 82 12 58 19 93 8 50 18 89 7 50 18 86 8 50 19 72 8 50 19 79 9 50 20 80 15 56 20 82 15 70 20 91 ; run;
Figure 4 shows the distribution of
StackLoss and scatter plots of
The distribution of
StackLoss is bimodal and right-skewed, and it has a noticeable gap between the values of 20 and 28. These characteristics suggest the possibility that
the response variable has a mixture distribution.
The relationship between
AirFlow appears to be linear.
The relationship between
WaterTemp is approximately linear, but there is a suggestion of nonlinearity.
The relationship between
AcidConc might best be described as heteroscedastic. Again, there is a suggestion that the response
variable has a mixture distribution because of the apparent clustering of
Figure 4: Visual Inspection of the Data
Distribution of Stack Loss
Stack Loss by Air Flow
Stack Loss by Water Temperature
Stackloss by Acid Concentration
Yu and Moyeed (2001) fit the following quantile regression model to these data:
The following statements show an efficient way to use the MCMC procedure to request the estimated quantile processes for the quantile regression parameters. Suppose you span the range of in increments of 0.05. That is, you want to estimate the quantiles; there are 19 quantiles in all. First, you create a new data set that has 19 copies of your original data set. The new data set must be indexed and sorted by . Second, you use the MCMC procedure’s BY-group processing capabilities to estimate the model parameters for each of the 19 quantiles. The ODS OUTPUT statement saves the “Posterior Summaries” and “Posterior Intervals” tables as SAS data sets. These data sets are used later to produce plots of the quantile processes for each parameter. The DATA= option of the PROC MCMC statement specifies the newly created input data set. The BEGINCNST and ENDCNST statement block, where you previously specified a single value for , is replaced with a BY statement. < /p>
data by_stackloss; set stackloss; do p = .05 to .95 by .05; output; end; run;
proc sort data=by_stackloss; by p; run;
ods output postsummaries=by_ps postintervals=by_pi; proc mcmc data=by_stackloss seed=73625 propcov=congra ntu=1000 nmc=30000 mintune=17; by p; parms (b0-b3) 0; prior b: ~ general(0); mu= b0 + b1*airflow + b2*watertemp + b3*acidconc; u = stackloss - mu; ll = log(p)+log(1-p) - 0.5*(abs(u)+(2*p-1)*u); model stackloss ~ general(ll); run;
The contents of the “Posterior Summaries” and “Posterior Intervals” tables for
the whole quantile process are saved in the ODS OUTPUT data sets
By_pi, respectively. The following statements merge these two
data sets, sort the merged data set, and then use the SGPLOT and PRINT procedures to produce quantile process plots and tables for each model parameter:
data process; merge by_ps by_pi; run;
proc sort data=process out=process; by parameter p; run;
proc sgplot data=process(where=(parameter="b0")); title "Estimated Parameter by Quantile"; title2 "With 95% HPD Interval"; series x=p y=mean / markers legendlabel="Intercept (b0)"; band x=p lower=hpdlower upper=hpdupper / transparency=.5 legendlabel="HPD Interval"; yaxis label="Intercept (b0)"; xaxis label="Quantile"; refline 0 / axis=y; run;
proc print data=process(where=(parameter="b0")) noobs; title "Estimated Parameter b0 by Quantile"; title2 "with 95% HPD Interval"; var p mean hpdlower hpdupper; run;
Figure 5 and Output 5 show that the sample means for the intercept, b0, are negative throughout the process. They are increasing up to the 0.15 quantile and are then relatively constant up to the 0.60 quantile. The means then begin a steady decline until the 0.85 quantile, where they begin increasing. The 95% HPD intervals are negative throughout most the process; the exception is the 0.95 quantile, whose 95% HPD interval contains 0.
Figure 5: Quantile Process Plot for b0
Output 5: Quantile Process Table for b0
|Estimated Parameter b0 by Quantile|
|with 95% HPD Interval|
Figure 6 and Output 6 show that the sample means for the parameter b1, the coefficient for
AirFlow, are positive throughout the process. The means increase for most of the process, peak at the 0.65 quantile, decrease monotonically until the 0.90 quantile, and then
increase slightly. The 95% HPD intervals are all strictly positive except for the 0.05 quantile’s interval, which contains 0.
Figure 6: Quantile Process Plot for b1
Output 6: Quantile Process Table for b1
|Estimated Parameter b1 by Quantile|
|with 95% HPD Interval|
Figure 7 and Output 7 show that the sample means for the parameter b2, the coefficient for
WaterTemp, are positive throughout the process. The means decrease monotonically up to the 0.50 quantile and then increase monotonically, forming a quadratic shape.
The 95% HPD intervals are all strictly positive.
Figure 7: Quantile Process Plot for b2
Output 7: Quantile Process Table for b2
|Estimated Parameter b2 by Quantile|
|with 95% HPD Interval|
Figure 8 and Output 8 show that the sample means for the parameter b3, the coefficient for
AcidConc, are relatively constant and close to, but systematically below, 0. The 95% HPD intervals contain 0 throughout the process.
Figure 8: Quantile Process Plot for b3
Output 8: Quantile Process Table for b3
|Estimated Parameter b3 by Quantile|
|with 95% HPD Interval|
Isaacs, D., Altman, D. G., Tidmarsh, C. E., Valman, H. B., and Webster, A. D. B. (1983), “Serum Immunoglobulin Concentrations in Preschool Children Measured by Laser Nephelometry: Reference Ranges for IgG, IgA, IgM,” Journal of Clinical Pathology, 36, 1193–1196.