The GLMSELECT Procedure

Building the SSCP Matrix

Traditional implementations of FORWARD and STEPWISE selection methods start by computing the augmented crossproduct matrix for all the specified effects. This initial crossproduct matrix is updated as effects enter or leave the current model by sweeping the columns corresponding to the parameters of the entering or departing effects. Building the starting crossproduct matrix can be done with a single pass through the data and requires $O(m^2)$ storage and $O(n m^2)$ work, where n is the number of observations and m is the number of parameters. If k selection steps are done, then the total work sweeping effects in and out of the model is $O(k m^2)$. When $n >> m$, the work required is dominated by the time spent forming the crossproduct matrix. However, when m is large (tens of thousands), just storing the crossproduct matrix becomes intractable even though the number of selected parameters might be small. Note also that when interactions of classification effects are considered, the number of parameters considered can be large, even though the number of effects considered is much smaller.

When the number of selected parameters is smaller than the total number of parameters, it turns out that many of the crossproducts are not needed in the selection process. Let $\mb{y}$ denote the dependent variable, and suppose at some step of the selection process that $\bX $ denotes the $n \times p$ design matrix columns corresponding to the currently selected model. Let $\bZ =\mb{z}_1,\mb{z}_2,\ldots ,\mb{z}_{m-p}$ denote the design matrix columns corresponding to the $m-p$ effects not yet in the model. Then in order to compute the reduction in the residual sum of squares when $\mb{z}_ j$ is added to the model, the only additional crossproducts needed are $\mb{z}_ j’y$, $\mb{z}_ j’\bX $, and $\mb{z}_ j’\mb{z}_ j$. Note that it is not necessary to compute any of $\mb{z}_ j’ \mb{z}_ i$ with $i\neq j$ and if $p << m$, and this yields a substantial saving in both memory required and computational work. Note, however, that this strategy does require a pass through the data at any step where adding an effect to the model is considered.

PROC GLMSELECT supports both of these strategies for building the crossproduct matrix. You can choose which of these strategies to use by specifying the BUILDSSCP= FULL or BUILDSSCP= INCREMENTAL option in the PERFORMANCE statement. If you request BACKWARD selection, then the full SSCP matrix is required. Similarly, if you request the BIC or CP criterion as the SELECT= , CHOOSE= , or STOP= criterion, or if you request the display of one or both of these criteria with the STATS= BIC, STATS= CP, or STATS= ALL option, then the full model needs to be computed. If you do not specify the BUILDSSCP= option, then PROC GLMSELECT switches to the incremental strategy if the number of effects is greater than one hundred. This default strategy is designed to give good performance when the number of selected parameters is less than about 20% of the total number of parameters. Hence if you choose options that you know will cause the selected model to contain a significantly higher percentage of the total number of candidate parameters, then you should consider specifying BUILDSSCP= FULL. Conversely, if you specify fewer than 100 effects in the MODEL statement but many of these effects have a large number of associated parameters, then specifying BUILDSSCP=INCREMENTAL might result in improved performance.