Fitting Algorithms :: SAS/STAT(R) 13.1 User's Guide

Fitting Algorithms

Subsections:

Forward Selection
Backward Selection
Variable Transformations
Goodness-of-Fit Criteria
Generalized Linear Models
Fast Algorithm

The multivariate adaptive regression splines algorithm (Friedman, 1991b) is a predictive modeling algorithm that combines nonparametric variable transformations with a recursive partitioning scheme.

The algorithm originates with Smith (1982), who proposes a nonparametric method that applies the model selection method (stepwise regression) to a large number of truncated power spline functions, which are evaluated at different knot values. This method constructs spline functions and selects relevant knot values automatically with the model selection method. However, the method is applicable only to problems in low dimensions. For multiple variables, the number of tensor products between spline basis functions is too large to fit even a single model. The multivariate adaptive regression splines algorithm avoids this situation by using forward selection to build the model gradually instead of using the full set of tensor products of spline basis functions.

Like the recursive partitioning algorithm, which has “growing” and “pruning” steps, the multivariate adaptive regression splines algorithm contains two stages: forward selection and backward selection. During the forward selection process, bases are created from interactions between existing parent bases and nonparametric transformations of continuous or classification variables as candidate effects. After the model grows to a certain size, the backward selection process begins by deleting selected bases. The deletion continues until the null model is reached, and then an overall best model is chosen based on some goodness-of-fit criterion. The next three subsections give details about the selection process and methods of nonparametric transformation of variables. The fourth subsection describes how the multivariate adaptive regression splines algorithm is applied to fit generalized linear models. The fifth subsection describes the fast algorithm (Friedman, 1993) for speeding up the fitting process.

Forward Selection

The forward selection process in the multivariate adaptive regression splines algorithm is as follows:

Initialize by setting $\mb {B}_0=\mb {1}$ and $M=1$ .
Repeat the following steps until the maximum number of bases $M_{\max }$ has been reached or the model cannot be improved by any combination of $\mb {B}_ m$ , $\mb {v}$ , and t.
1. Set the lack-of-fit criterion $\mathrm{LOF}^{*}=\infty$ .
2. For each selected basis: $\mb {B}_ m, m\in \{ 0,\dots ,M-1\}$ do the following for each variable $\mb {v}$ that $\mb {B}_ m$ does not consist of $\mb {v}\notin \{ \mb {v}(k,m)|1\le k\le K_ m\}$
  1. For each knot value (or a subset of categories) t of $\mb {v}: t\in \{ \mb {v}\}$ , form a model with all currently selected bases $\sum _{i=0}^{M-1}\mb {B}_ i$ and two new bases: $\mb {B}_ m\mb {T}_1(\mb {v},t)$ and $\mb {B}_ m\mb {T}_2(\mb {v},t)$ .
  2. Compute the lack-of-fit criterion for the new model LOF.
  3. If $\mathrm{LOF}<\mathrm{LOF}^{*}$ , then update $\mathrm{LOF}^{*}=\mathrm{LOF}$ , $m^{*}=m$ , $\mb {v}^{*}=\mb {v}$ , and $t^{*}=t$ .
3. Update the model by adding two bases that improve the most $\mb {B}_{m^{*}}\mb {T}_1(\mb {v}^{*},t^{*})$ and $\mb {B}_{m^{*}}\mb {T}_2(\mb {v}^{*},t^{*})$ .
4. Set $M=M+2$ .

The essential part of each iteration is to search a combination of $\mb {B}_ m$ , $\mb {v}$ , and t such that adding two corresponding bases most improve the model. The objective of the forward selection step is to build a model that overfits the data. The lack-of-fit criterion for linear models is usually the residual sum of squares (RSS).

Backward Selection

The backward selection process in the multivariate adaptive regression splines algorithm is as follows:

Initialize by setting the overall lack-of-fit criterion: $\mathrm{LOF}^{*}=\infty$ .
Repeat the following steps until the null model is reached. The final model is the best one that is found during the backward deletion process.
1. For a selected basis $\mb {B}_ m, m\in \{ 1,\dots ,M\}$ :
  1. Compute the lack-of-fit criterion, LOF, for a model that excludes $\mb {B}_ m$ .
  2. If $\mathrm{LOF}<\mathrm{LOF}^{*}$ , save the model as the best one. Let $m^{*}=m$ .
  3. Delete $\mb {B}_{m^{*}}$ from the current model.
2. Set $M=M-1$ .

The objective of the backward selection is to “prune” back the overfitted model to find the best model that has good predictive performance. So the lack-of-fit criteria that characterize model loyalty to original data are not appropriate. Instead, the multivariate adaptive regression splines algorithm uses a quantity similar to the generalized cross validation criterion. See the section Goodness-of-Fit Criteria for more information.

Variable Transformations

The type of transformation depends on the variable type:

For a continuous variable, the transformation is a linear truncated power spline,

$\mb {T}_1(\mb {v},t) = (v-t)_{+} = \begin{cases} v-t, & \mbox{if }v>t\\ 0, & \mbox{otherwise} \end{cases}$

$\mb {T}_2(\mb {v},t) = [-(v-t)]_{+} = \begin{cases} 0, & \mbox{if }v>t\\ t-v, & \mbox{otherwise} \end{cases}$

where t is a knot value for variable $\mb {v}$ and v is an observed value for $\mb {v}$ . Instead of examining every unique value of $\mb {v}$ , a series of knot values with a minimum span are used by assuming the smoothness of the underlying function. Friedman (1991b) uses the following formula to determine a reasonable number of counts between knots (span size). For interior knots, the span size is determined by

$-\frac{2}{5}\log _2\left[-\frac{\log (1-\alpha )}{pn_ m} \right]$

For boundary knots, the span size is determined by

$3-\log _2\frac{\alpha }{p}$

where $\alpha$ is the parameter that controls the knot density, p is the number of variables, and $n_ m$ is the number of observations that a parent basis $\mb {B}_ m>0$ .
For a classification variable, the transformation is an indicator function,

$\mb {T}_1(\mb {v},t) = \begin{cases} 1, & \mbox{if }v\in \{ c_1,\dots ,c_ t\} \\ 0, & \mbox{otherwise} \end{cases}$

$\mb {T}_2(\mb {v},t) = \begin{cases} 0, & \mbox{if }v\in \{ c_1,\dots ,c_ t\} \\ 1, & \mbox{otherwise} \end{cases}$

where $\{ c_1,\dots ,c_ t\}$ is a subset of all categories of variable $\mb {v}$ . The smoothing is applied to categorical variables by assuming that subsets of categories tend to have similar properties, analogous to the assumption that a local neighborhood has close predictions for continuous variables.

If a categorical variable has k distinct categories, then there are a total of $2^{k-1}-1$ possible subsets to consider. The computation cost is equal to all-subsets selection in regression, which is expensive for large k values. The multivariate adaptive regression splines algorithm use the stepwise selection method to select categories to form the subset $\{ c_1,\dots ,c_ t\}$ . The method is still greedy, but it reduces computation and still yields reasonable final models.

Goodness-of-Fit Criteria

Like other nonparametric regression procedures, the multivariate adaptive regression splines algorithm can yield complicated models that involve high-order interactions in which many knot values or subsets are considered. Besides the basis functions, both the forward selection and backward selection processes are also highly nonlinear. Because of the trade-off between bias and variance, the complicated models that contain many parameters tend to have low bias but high variance. To select models that achieve good prediction performance, Craven and Wahba (1979) propose the widely used generalized cross validation criterion (GCV),

$\mathrm{GCV} = \frac{1}{n}\sum _{i=1}^ n \left(\frac{y_ i-\hat{f}_ i}{1-\mathrm{trace}(\mb {S})/n}\right)^2= \frac{\mathrm{RSS}}{n(1-\mathrm{trace}(\mb {S})/n)^2}$

where y is the response, $\hat{f}$ is an estimate of the underlying smooth function, and $\mb {S}$ is the smoothing matrix such that $\hat{\mb {y}}=\mb {Sy}$ . The effective degrees of freedom for the smoothing spline can be defined as $\mathrm{trace}(\mb {S})$ . In the multivariate adaptive regression splines algorithm, Friedman (1991b) uses a similar quantity as the lack-of-fit criterion,

$\mathrm{LOF} = \frac{\mathrm{RSS}}{n(1-(M+d(M-1)/2)/n)^2}$

where d is the degrees-of-freedom cost for each nonlinear basis function and M is total number of linearly independent bases in the model. Because any candidate model that is evaluated at each step of the multivariate adaptive regression splines algorithm is a linear model, M is actually the trace of the hat matrix. The only difference between the GCV criterion and the LOF criterion is the extra term $d(M-1)$ . The corresponding effective degrees of freedom is defined as $M+d(M-1)/2$ . The quantity d takes into account the extra nonlinearity in forming new bases, and it operates as a smoothing parameter. Larger values of d tend to result in smoother function estimates. Based on many practical experiments and some theoretic work (Owen, 1991), Friedman suggests that the value of d is typically in the range of $[2,4]$ . For data that have complicated structures, the value of d could be much larger.

Alternatively, you can use the cross validation as the goodness-of-fit criterion or use a separate validation data set to select models and a separate testing data set to evaluate selected models.

Generalized Linear Models

Friedman (1991b) applies the multivariate adaptive regression splines algorithm to a logistic model by using the squared error loss between the response and inversely linked values in the goodness-of-fit criterion:

$\sum _{i=1}^ n\left(y_ i-\frac{1}{1+\exp (\mb {x}\bbeta )}\right)^2$

When a final model is obtained, the ordinary logistic model is fitted on selected bases. Some realizations of the multivariate adaptive regression splines algorithm ignore the distributional properties and derive model bases that are based on the least squares criterion. The reason to ignore the distributional properties or use least squares approximations is that examining the lack-of-fit criterion for each combination of $\mb {B}_ m,\mb {v}$ , and t is computationally formidable, because one generalized linear model fit involves multiple steps of weighted least squares. The ADAPTIVEREG procedure extends the multivariate adaptive regression splines algorithm to generalized linear models as suggested by Buja et al. (1991).

In the forward selection process, the ADAPTIVEREG procedure extends the algorithm in the following way. Suppose there are $(2k+1)$ bases after the kth iteration. Then a generalized linear model is fitted against the data by using the selected bases. Then the weighted least squares method uses the working weights and working response in the last step of the iterative reweighted least squares algorithm as the weight and response for selecting new bases in the $(k+1)$ th iteration. Then the residual chi-square statistic is used to select two new bases. This is similar to the forward selection scheme that the LOGISTIC procedure uses. For more information about the score chi-square statistic, see the section Testing Individual Effects Not in the Model in Chapter 58: The LOGISTIC Procedure.

In the backward selection process, the ADAPTIVEREG procedure extends the algorithm in the following way. Suppose there are M bases in the selected model. The Wald chi-square statistic is used to determine which basis to delete. After one basis is selected for deletion, a generalized linear model is refitted with the remaining bases. This is similar to the backward deletion scheme that the LOGISTIC procedure uses. For more information about the Wald chi-square statistic, see the section Testing Linear Hypotheses about the Regression Coefficients in Chapter 58: The LOGISTIC Procedure.

Accordingly, the lack-of-fit criterion in the forward selection for generalized linear models is the score chi-square statistic. For the lack-of-fit criterion in the backward selection process for generalized linear models, the residual sum of squares term is replaced by the model deviance.

Fast Algorithm

The original multivariate adaptive regression splines algorithm is computationally expensive. To improve the computation speed, Friedman (1993) proposes the fast algorithm. The essential idea of the fast algorithm is to reduce the number of combinations of $\mb {B},\mb {v}$ , and t that are examined at each step of forward selection.

Suppose there are $(2k+1)$ bases that are formed after the kth iteration, where a parent basis $\mb {B}_ m$ is selected to construct two new bases. Consider a queue with bases as its elements. At the top of the queue are the selected parent $\mb {B}_ m$ and two newly constructed bases, $\mb {B}_{2k}$ and $\mb {B}_{2k+1}$ . The rest of the queue is sorted based on the minimum lack-of-fit criterion for each basis:

$J(\mb {B}_ i) = \min _{\substack{\mathrm{for~ all~ eligible~ }\mb {v}\\ \mathrm{for~ all~ knot~ }t}}\mathrm{LOF}(\mb {v},t|\mb {B}_ i),~ i=1,\dots ,2k-1$

When k is not small, there are a relatively large number of bases in the model, and adding more bases is unlikely to dramatically improve the fit. Thus the ranking of the bases in the priority queue is not likely to change much during adjacent iterations. So the candidate parent bases can be restricted to the top K ones in the queue for $(k+1)$ th iteration. After the kth iteration, the top bases have new $J(\mb {B}_ i)$ values, whereas the values of the bottom bases are unchanged. The queue is reordered based on $J(\mb {B}_ i)$ values. This corresponds to the K= option value for the FAST option in the MODEL statement.

To avoid losing the candidate bases that are ranked at the bottom of the queue and to allow them to rise back to the top, a natural “aging” factor is introduced into each basis. This is accomplished by defining the priority for each basis function to be

$P(\mb {B}_ i) = R(\mb {B}_ i)+\beta (k_ c-k_ r)$

where $R(\mb {B}_ i)$ is the rank of ith basis in the queue, $k_ c$ is the current iteration number, and $k_ r$ is the number of the iteration where the $J(\mb {B}_ i)$ value was last computed. The top K candidate bases are then sorted again based on this priority. Large $\beta$ values cause bases that have low improvement during previous iterations to rise faster to the top of the list. This corresponds to the BETA= value for the FAST option in the MODEL statement.

For a candidate basis in the top of the priority queue, the minimum lack-of-fit criterion $J(\mb {B}_ i)$ is recomputed for all eligible variables $\mb {v}$ for the $(k+1)$ iteration. An optimal variable is likely to be the same as the one that was found during the previous iteration. So the fast multivariate adaptive regression splines algorithm introduces another factor H to save the computation cost. The factor specifies how often $J(\mb {B}_ i)$ should be recomputed for all eligible variables. If H = 1, then optimization over all variables is done at each iteration when a parent basis is considered. If H = 5, the complete optimization is done after five iterations. For an iteration count less than the specified H, the optimization is done only for the optimal variable found in the last complete optimization. The only exceptions are the top three candidates, $\mb {B}_{2k-1}$ (which is the parent basis $\mb {B}_ m$ used to construct two new bases) and two new ones, $\mb {B}_{2k}$ and $\mb {B}_{2k+1}$ . The complete optimization for them is performed at each iteration. This corresponds to the H= option value for the FAST option in the MODEL statement.

The ADAPTIVEREG Procedure