SUPPORT / SAMPLES & SAS NOTES
 

Support

Sample 69692: Quantile-based prediction intervals for generalized linear models

DetailsResultsDownloadsAboutRate It

Quantile-based prediction intervals for generalized linear models

Contents: Purpose / History / Requirements / Usage / Details / Limitations / Missing Values / See Also

 

PURPOSE:
The GLMPI macro computes 100(1-α)% prediction intervals using quantiles of the specified response distribution. A wide range of distributions is supported. Intervals that include uncertainty in the mean estimate are also available.
HISTORY:
 
The version of the GLMPI macro that you are using is displayed when you specify anything as the first argument. Here is an example:
%glmpi(v)

The GLMPI macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active internet connection available), the macro issues the following message:

NOTE: Unable to check for newer version of GLMPI macro.

The computations performed by the macro are not affected by the appearance of this message. However, you can avoid this check by specifying nochk as the first macro argument. This action can be useful if your machine has no connection to the internet.

Version
Update Notes
2.0 Rewritten to provide quantile-based intervals.
REQUIREMENTS:
The GLMPI macro requires only Base SAS®. A modeling procedure in SAS/STAT® is required to fit the generalized linear model and save the information that is required by the macro.
USAGE:
Before running the GLMPI macro, you must first fit a generalized linear model and save the predicted values. This is typically done using the OUTPUT statement. See the examples on the Results tab.

Follow the instructions on the Downloads tab of this sample to save the GLMPI macro definition. Replace the text within quotation marks in the following statement with the location of the GLMPI macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the GLMPI macro and make it available for use:

   %inc "<location of your file containing the GLMPI macro>";

Following this statement, you can call the GLMPI macro. See the Results tab for examples.

The following are required when using the GLMPI macro:

pred=variable-name
Specifies the variable in the data= data set that contains the predicted values from the fitted model.
dist=distribution procedure
Specifies the name of the distribution of the response variable and the procedure that is used to fit the model. Distribution should be one of the following: POISSON, NEGBIN, GAMMA, NORMAL, BINOMIAL, IGAUSS, GENPOISSON, TWEEDIE, WEIBULL, BETA, or LOGNORMAL. Procedure should be one of the following: ADAPTIVEREG, FMM, GAMPL, GENMOD, GLIMMIX, HPFMM, HPGENSELECT, HPLOGISTIC, NLMIXED, PROBIT, or SURVEYLOGISTIC. Procedure can be omitted if the model was fit by PROC GENMOD. For the normal distribution, GLM, REG, and LIFEREG can also be specified as the procedure, but those procedures can provide prediction intervals directly as discussed in the Details section below.

For distributions that have an estimated scale or dispersion parameter, scale= must also be specified. Note that the geometric distribution is supported by specifying dist=negbin scale=1. Similarly, the exponential distribution is supported with dist=gamma scale=1. NLMIXED is supported, but only for the distributions available in its MODEL statement (not including GENERAL) and also assuming that the distribution scale parameter is not altered either in programming statements or in the distribution option in the MODEL statement.

The following might be required:

lower=variable-name
Required if type=qu or type=all. Names the variable that contains the lower confidence limit for the mean in the data= data set.
upper=variable-name
Required if type=qu or type=all. Names the variable that contains the upper confidence limits for the mean in the data= data set.
response=variable-name
Required if options=cover. Specifies the response variable in the data= data set.
trials=variable-name
Required if dist=binomial. Specifies the variable in the data= data set that represents the number of trials for a binomial response. See the Details section below for information when the trials= variable equals 1.
power=value
Required if dist=tweedie. Specifies the value of the Tweedie power parameter that is estimated by the modeling procedure.
scale=value
Required for distributions that include a scale or dispersion parameter. Specifies the value of the scale or dispersion parameter that is estimated by the modeling procedure.

The following are optional:

data=data-set-name
Specifies the name of the data set that contains the predicted values of the fitted model. If omitted, the last-created data set is used.
type=Q | QU | ALL
Specifies the type of prediction limits. type=q requests only quantile-based limits. type=qu requests only quantile-based limits that include uncertainty in the mean. type=all requests both. If omitted, type=all.
out=data-set-name
Names the output data set that is created by the GLMPI macro. This data set contains all of the variables in the data= data set as well as the requested prediction limits. See lclq=, uclq=, lclqu=, uclqu=. If omitted, the data set is named GLM_PI.
alpha=α
Requests prediction intervals with 100(1-α)% nominal coverage. α must be between 0 and 1. By default, α=0.05 which produces 95% prediction intervals. Note that intervals that include mean uncertainty will generally exceed the nominal coverage probability.
lclq=variable-name
Names the variable that contains the lower quantile-based prediction limit for a future observation in the out= data set. The default name is LCLQ.
uclq=variable-name
Names the variable that contains the upper quantile-based prediction limit for a future observation in the out= data set. The default name is UCLQ.
lclqu=variable-name
Names the variable that contains the lower quantile-based prediction limit that includes uncertainty in the mean for a future observation in the out= data set. The default name is LCLQU.
uclqu=variable-name
Names the variable that contains the upper quantile-based prediction limit that includes uncertainty in the mean for a future observation in the out= data set. The default name is UCLQU.
options=COVER
Computes and displays the prediction interval coverage proportion in the data= data set. If omitted, interval coverage is not computed.
DETAILS:
The GLMPI macro provides quantile-based prediction intervals for future, individual values in generalized linear models. While confidence intervals for the mean in generalized linear models can be obtained using options in many procedures such as the GENMOD and GLIMMIX procedures, prediction intervals in general are not.

To use the GLMPI macro, the desired model should first be fit using the appropriate modeling procedure. Use the available options in the procedure to save the predicted means of the observations in a data set, optionally along with confidence limits. Specify this data set in data= in the GLMPI macro. See the above descriptions of the other options available in the macro.

The prediction interval limits are quantiles of the specified response distribution. The macro transforms the mean and scale or dispersion parameter (if applicable) estimated by the modeling procedure into the parameters of the response distribution. These parameters are then used in the QUANTILE function to obtain the prediction limits for each observation in the data= data set. Additionally, prediction intervals that include uncertainty in the mean estimate can optionally be produced (type=qu). This is done by using the confidence limits of the mean rather than the estimated mean when obtaining the quantile-based limits.

In the case of a normally distributed response, prediction intervals are directly available using options in the OUTPUT statement in the GLM and REG procedures. Additionally, for a few response distributions including the normal, lognormal, and Weibull distributions, quantile-based prediction limits can be produced using the P= and Q= options in the OUTPUT statement of the LIFEREG procedure.

The coverage proportion in the data= data set can be computed by creating an indicator variable that equals 1 if the prediction interval contains the observed response and 0 otherwise. The mean of this variable over the out= data set is the coverage proportion. This can be done for the intervals with and without including mean uncertainty. However, the coverage proportion for the intervals that include mean uncertainty will generally exceed the nominal coverage probability, 100(1-α).

For binary response data, prediction intervals are not very useful when each observation in the data is a single Bernoulli (binary) response, coded 0 or 1. In such a case, a prediction interval for population i with event probability pi will capture 100% of future observations if the limits contain the entire [0,1] range, or 0% of the observations if the limits are both within the [0,1] range, or 100pi% or 100(1-pi)% of the observations if one limit of the interval falls in the [0,1] range. Prediction intervals can be useful for binomial data in which each observation represents a set of independent Bernoulli trials. Such data are modeled using events/trials syntax in most procedures that fit binary response models. In this case, future observations can have observed event proportions across the [0,1] range and a 100(1-α)% prediction interval might obtain its nominal coverage when the limits fall in the [0,1] range.

Output data sets

The out= data set contains all observations in the data= data set plus variables that contain the limits of the requested prediction intervals. The prediction limit variables are named as specified in lclq= and uclq=, and/or in lclqu= and uclqu= if type=qu or all.

If options=cover is specified, then data set COVER is created that contains the coverage proportion(s) in the data= data set for either or both of the interval types as specified by type=.

LIMITATIONS:
Confidence and prediction limits are not available for multinomial or zero-inflated models. They are also not available for GEE models fit by using the REPEATED statement. See the dist= description above for more details.
MISSING VALUES:
Quantile-based prediction interval limits can be computed for any observation in the data= data set that has a nonmissing value of the pred= variable. Most modeling procedures are able to provide a predicted value if the observed response is missing but none of the predictors or other variables defining the model are missing.
SEE ALSO:
See this note that discusses several response distributions and shows how parameters estimated by a modeling procedure can be transformed for use in the RAND, PDF, QUANTILE, and related functions.



These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.