Many nonparametric methods do not perform well when there are a large number of independent variables in the model. The sparseness of data in this setting inflates the variance of the estimates. The problem of rapidly increasing variance for increasing dimensionality is sometimes referred to as the "curse of dimensionality." Interpretability is another problem with nonparametric regression based on kernel and smoothing spline estimates (Hastie and Tibshirani 1990).
To overcome these difficulties, Stone (1985) proposed additive models. These models estimate an additive approximation to the multivariate regression function. The benefits of an additive approximation are at least twofold. First, because each of the individual additive terms is estimated using a univariate smoother, the curse of dimensionality is avoided, at the cost of not being able to approximate universally. Second, estimates of the individual terms explain how the dependent variable changes with the corresponding independent variables.
Hastie and Tibshirani (1990) proposed generalized additive models. These models assume that the mean of the dependent variable depends on an additive predictor through a nonlinear link function. Generalized additive models permit the response probability distribution to be a member of the exponential family of distributions. Many widely used statistical models belong to this general class, including additive models for Gaussian data, nonparametric logistic models for binary data, and nonparametric log-linear models for Poisson data.
Suppose that is a response random variable and is a set of predictor variables. A regression procedure can be viewed as a method of estimating how the value of depends on the values of . The standard linear regression model assumes that the expected value of has a linear form:
Given a sample of values for and , estimates of are often obtained by the least squares method.
The additive model generalizes the linear model by modeling the expected value of Y as
where , , are smooth functions. These functions are estimated in a nonparametric fashion.
Generalized additive models extend traditional linear models in another way, namely by allowing for a link between and the expected value of . This amounts to allowing for an alternative distribution for the underlying random variation besides just the normal distribution. Although Gaussian models can be used in many statistical applications, there are types of problems for which they are not appropriate. For example, the normal distribution might not be adequate for modeling discrete responses such as counts or bounded responses such as proportions.
Generalized additive models consist of a random component, an additive component, and a link function that relates these two components to each other. The response , the random component, is assumed to have a density in the exponential family
where is called the natural parameter and is the scale parameter. The normal, binomial, and Poisson distributions are all in this family. The quantity
where are smooth functions defines the additive component. Finally, the relationship between the mean of the response variable and is defined by a link function . The most commonly used link function is the canonical link, for which .
A combination of backfitting and local scoring algorithms is used in the actual fitting of the model.
Generalized additive models and generalized linear models can be applied in similar situations, but they serve different analytic purposes. Generalized linear models emphasize estimation and inference for the parameters of the model; generalized additive models focus on exploring data nonparametrically.
SAS/STAT software provides two procedures that fit generalized additive models: the GAM procedure and the GAMPL procedure. The features, comparative differences, and an example that uses both procedures are presented in the discussion that follows.
The GAM procedure implements the generalized additive model proposed by Hastie and Tibshirani (1990). PROC GAM does the following:
fits nonparametric or semiparametric additive models
supports the use of multidimensional data
supports multiple SCORE statements
enables you to specify the model degrees of freedom or smoothing parameter
PROC GAM can fit Gaussian, binomial, Poisson, gamma, and inverse Gaussian distributions. Although theoretically more than one link can exist for each distribution, PROC GAM always uses the canonical link. This is because the difference between link alternatives is less pronounced for nonparametric models, as a result of the flexibility of nonparametric model forms.
The GAMPL procedure is a high-performance procedure that fits generalized additive models that are based on low-rank regression splines (Wood 2006). Each spline term is constructed by the thin-plate regression spline technique (Wood 2003). A roughness penalty is applied to each spline term by a smoothing parameter that controls the balance between goodness of fit and the roughness of the spline curve. PROC GAMPL fits models for standard distributions in the exponential family, such as binomial, gamma, Gaussian, inverse Gaussian, Poisson, and negative binomial distributions.
PROC GAMPL runs in either single-machine mode or distributed mode. Note: Distributed mode requires SAS High-Performance .
PROC GAMPL offers the following basic features:
estimates the regression parameters of a generalized additive model that has fixed smoothing parameters by using penalized likelihood estimation
estimates the smoothing parameters of a generalized additive model by using either the performance iteration method or the outer iteration method
estimates the regression parameters of a generalized linear model by using maximum likelihood techniques
tests the total contribution of each spline term based on the Wald statistic
provides model-building syntax in the CLASS statement and effect-based parametric effects in the MODEL statement, which are used in other SAS/STAT analytic procedures (in particular, the GLM, LOGISTIC, GLIMMIX, and MIXED procedures)
provides response-variable options
enables you to construct a spline term by using multiple variables
provides control options for constructing a spline term, such as fixed degrees of freedom, initial smoothing parameter, fixed smoothing parameter, smoothing parameter search range, user-supplied knot values, and so on
provides multiple link functions for any distribution
provides a WEIGHT statement for weighted analysis
provides a FREQ statement for grouped analysis
provides an OUTPUT statement to produce a data set that has predicted values and other observationwise statistics
produces graphs via ODS Graphics
The GAMPL procedure is a high-performance analytical procedure. It does everything in the preceding list, plus it also does the following:
enables you to run in distributed mode on a cluster of machines that distribute the data and the computations
enables you to run in single-machine mode on the server where SAS is installed
exploits all the available cores and concurrent threads, regardless of execution mode
Both the GAMPL procedure and the GAM procedure in SAS/STAT software fit generalized additive models. However, the GAMPL procedure uses different approaches for constructing spline basis expansions, fitting generalized additive models, and testing smoothing components. The GAMPL procedure focuses on automatic smoothing parameter selection by using global model-evaluation criteria to find optimal models. The GAM procedure focuses on constructing models by fitting partial residuals against each smoothing term. In general, you should not expect similar results from the two procedures. The following sections summarize the differences.
The GAMPL procedure uses thin-plate regression splines to construct basis expansions for each spline term, and each term allows multiple variables. The GAM procedure uses univariate or bivariate smoothing splines to construct basis expansions, and each term allows only one or two variables. The thin-plate regression splines that PROC GAMPL uses are low-rank approximations of multivariate smoothing splines. PROC GAM also allows loess smoothers.
The GAMPL procedure fits a generalized additive model that has fixed smoothing parameters by using a global design matrix and a roughness penalty matrix. The GAM procedure uses partial residuals to fit against single smoothing terms. For models that have unknown smoothing parameters, PROC GAMPL estimates smoothing parameters simultaneously by optimizing global criteria such as generalized cross validation (GCV) and the unbiased risk estimator (UBRE). PROC GAM estimates each smoothing parameter by optimizing the local criterion GCV one spline term at a time.
The GAMPL procedure supports all the distribution families and all the link functions that the GAM procedure supports. In addition, PROC GAMPL fits models in the negative binomial family. PROC GAMPL supports any link function for each distribution, whereas PROC GAM supports only the canonical link for each distribution.
The GAMPL procedure tests the total contribution for a spline term, including both linear and nonlinear trends. The GAM procedure tests the existence of nonlinearity for a spline term beyond the linear trend.
A global Bayesian posterior covariance matrix is available for models that are fitted by the GAMPL procedure. The confidence limits for prediction of each observation are available, in addition to componentwise confidence limits. For generalized additive models that are fitted by the GAM procedure, only the componentwise confidence limits are available, and they are based on the partial residuals for each smoothing term. The degrees of freedom for generalized additive models that are fitted by PROC GAMPL is defined as the trace of the degrees-of-freedom matrix. The degrees of freedom for generalized additive models that are fitted by PROC GAM is approximated by summing the trace of the smoothing matrix for each smoothing term.
The data in this example are based on a study by Bell et al. (1994). Bell and his associates studied the
result of multiple-level thoracic and lumbar laminectomy, a corrective spinal surgery commonly performed on
children. The data in the study consist of retrospective measurements on 83 patients. The specific outcome of
interest is the presence (1) or absence (0) of kyphosis, defined as a forward flexion of the spine of at least
40 degrees from vertical. The available predictor variables are age in months at time of the operation
(Age
), the starting of vertebrae levels involved in the operation (StartVert
), and the number of levels involved (NumVert
). The
goal of this analysis is to identify risk factors for kyphosis.
The following statements fit a logistic additive model by using the GAM procedure. The model consists of the
binary dependent variable Kyphosis
and the ordinary independent variables
NumVert
, Age
, and StartVert
. Each term is fit by using a univariate smoothing spline with three degrees of
freedom. Of these three degrees of freedom, one is taken up by the linear portion of the fit and two are left
for the nonlinear spline portion. Although this might seem to be an unduly modest amount of flexibility, it is
better to be conservative with a data set this small.
ods graphics on; proc gam data=kyphosis plots=components(clm commonaxes); model kyphosis (event='1')= spline(NumVert,df=3) spline(Age,df=3) spline(StartVert,df=3) /dist=binary; run;
The PLOTS=COMPONENTS option in the PROC GAM statement produces plots of the individual smoothing components. The CLM suboption in the PLOTS=COMPONENTS option adds a curvewise Bayesian confidence band to each smoothing component, and the COMMONAXES suboption forces all three smoothing component plots to share the same vertical axis limits, allowing a visual judgment of the relative nonparametric effect sizes.
The MODEL statement requests an additive model with a univariate smoothing spline for each term. The response variable option EVENT= chooses Kyphosis=1 (presence) as the event so that the probability of the presence of kyphosis is modeled. The option DIST=BINARY with binary responses specifies a logistic model.
Output 1 displays the model summary statistics.
Output 1: Summary Statistics
Note: | PROC GAM is modeling the probability that Kyphosis=1. One way to change this to model the probability that Kyphosis=0 is to specify the response variable option EVENT='0'. |
Iteration Summary and Fit Statistics | |
---|---|
Number of local scoring iterations | 9 |
Local scoring convergence criterion | 2.6613916E-9 |
Final Number of Backfitting Iterations | 1 |
Final Backfitting Criterion | 5.2335391E-9 |
The Deviance of the Final Estimate | 46.610920737 |
The critical part of the PROC GAM results is the "Analysis of Deviance" table, shown in Output 2. For each smoothing effect in the model, this table gives a test that compares the deviance between the full model and the model without the nonparametric
component of this variable. The analysis of deviance results indicate that the nonparametric effect of
Age
is highly significant, the nonparametric effect of StartVert
is nearly significant, and the nonparametric effect of NumVert
is insignificant at the 5% level.
Output 2: Model Fit Statistics
The plots in Figure 1 show that the partial predictions corresponding
to both Age
and StartVert
have a quadratic
pattern, whereas NumVert
has a more complicated but ultimately nonsignificant
pattern.
Figure 1: Partial Prediction for Each Predictor
The following statements fit a generalized additive model to the kyphosis study data by using the GAMPL procedure:
proc gampl data=kyphosis plots seed=12345; model kyphosis (event='1')= spline(NumVert) spline(Age) spline(StartVert) /dist=binary; run;
The PLOTS option in the PROC GAMPL statement produces a panel of plots of partial prediction curves or surfaces of smoothing components. The SEED= option specifies an integer that is used to start the pseudorandom number generator for subset sampling from observations to form knots if necessary, and for truncated eigendecomposition.
The MODEL statement requests an additive model with a univariate smoothing spline for each term. PROC GAMPL uses an optimization algorithm to determine the optimal smoothing parameter. The response variable option EVENT= chooses Kyphosis=1 (presence) as the event so that the probability of the presence of kyphosis is modeled. The option DIST=BINARY with binary responses specifies a logistic model.
Output 3 displays the model summary statistics.
Output 3: Summary Statistics
Output 4 displays the model fit statistics, which include spline transformations of all variables.
Output 4: Fit Statistics
Output 5 displays the parameter estimates and test for the smoothing
components. PROC GAMPL reports that the effective degrees of freedom value for spline transformations of
NumVert
is quite close to 1. This suggests a strictly linear form for
NumVert
. For Age
and StartVert
, the degrees of freedom values are much larger than 1. This suggests a nonlinear
pattern in the dependency of the response on Age
and StartVert
.
The "Tests for Smoothing Components" table shows that the spline transformations for NumVert
and StartVert
are significant in predicting diabetes
testing results at the 5% level; the significance test for Age
is
borderline.
Output 5: Parameter Estimates and Tests
The smoothing component panel (which is produced by the PLOTS option and shown in Output 42.2.9) visualizes
the spline transformations for the three variables in addition to 95% Bayesian curvewise confidence bands. For
NumVert
, the spline transformation is almost a straight line. For Age
, the spline transformation shows a nonmonotonic nonlinear trend and that it is
unambiguously nonzero only within a fairly narrow range of values. For NumVert
,
the dependency is slightly nonlinear and monotonically decreasing.
Output 6: Smoothing Components Panel
For more information, see the following:
The GAM Procedure in SAS/STAT User's Guide
The GAMPL Procedure in SAS/STAT User's Guide: High-Performance Procedures or SAS/STAT User's Guide
Hastie, T. J., and Tibshirani, R. J. (1990). Generalized Additive Models. New York: Chapman & Hall.
Stone, C. J. (1985). “Additive Regression and Other Nonparametric Models.” Annals of Statistics 13:689–705.
Wood, S. (2003). “Thin Plate Regression Splines.” Journal of the Royal Statistical Society, Series B 65:95–114.
Wood, S. (2006). Generalized Additive Models. Boca Raton, FL: Chapman & Hall/CRC.