The GAM Procedure

Computational Resources

Since PROC GAM implements a doubly iterative method (inner backfitting iterations within each local scoring iteration), data are accessed multiple times in performing a fit. To expedite the data access, PROC GAM keeps the data used in the analysis in memory.


\begin{eqnarray*}  n &  = &  \mr {number~  of~  observations~  used~  in~  the~  analysis} \\ p_ r &  = &  \mr {number~  of~  parametric~  variables} \\ p_ s &  = &  \mr {number~  of~  univariate~  spline~  smoothers} \\ p_ l &  = &  \mr {number~  of~  loess~  smoothers}\\ p_ b &  = &  \mr {number~  of~  bivariate~  \text {thin-plate}~  spline~  smoothers} \\ p &  = &  p_ r + p_ s + p_ l + p_ b \\ p_ n &  = &  p_ s + p_ l + p_ b \\ m &  = &  \mr {maximum~  number~  of~  iterations~  for~  the~  backfitting~  algorithm} \\ \end{eqnarray*}

In addition to the space to store the data ($8np$ bytes), the minimum working space (in bytes) needed for fitting a model using PROC GAM is

\[  (16+8p_ r)(n+2p_ r)+(160+48p+16p_ s+8p_ b+8p_ l)n+8p+32p_ b+32p_ s+8m+8n+(4n+4)p_ s+4.  \]

For fitting bivariate thin-plate smoothing spline variables, an extra $80+120n+8n^2+8p_ b$ bytes of memory are needed. For fitting loess variables, an extra $48n+16p_ l$ bytes of memory are needed. If model inference or confidence limits are requested, additional memory is required.

It is difficult to provide accurate estimates of the time required to fit a GAM model. Both the backfitting algorithm and the local scoring algorithm are iterative techniques whose convergence rates depend on the particular data being analyzed. Furthermore, the time required depends on the types of smoothers that you specify, as well as on the inferential information you request.

You can estimate the time required for problems with a larger number of observations by observing the time required for smaller problems and then using the following growth rules (obtained using by simulations) that show that the time required grows proportionally with the following:

  • $n^3$ when at least one bivariate thin-plate spline is used

  • $n^{3/2}$ when only loess smoothers are used

  • n when only univariate smoothing splines are used

For additive models (models with Gaussian response distribution) with a fixed number of observations, the time required is roughly proportional to $p_ n^{3/2}$. For generalized additive models (models with non-Gaussian distributions), the computation time grows more rapidly as $p_ n$ increases. This is harder to quantify as it depends on the distribution family and the number of iterations required for the local scoring algorithm to converge.

Figure 41.4: Feasible Problem Sizes for Different Smoothers

Figure 41.4 shows a rough estimation of feasible sizes for the smoothers that you can use, as a function of the number of observations and number of smoothing components. This figure depicts the regions where you can expect a single fit of an additive model to finish within a few minutes on a typical Pentium 4 system.

Note that the times reflected in Figure 41.4 are based on fitting additive models (no local scoring iterations) when no analysis of deviance or confidence limits are computed. The time required for fitting generalized additive models grows proportionally with the number of the local scoring iterations. Furthermore, analysis of deviance (if you do not request the fast approximations with the ANODEV option) requires fitting multiple GAM models as each smoothing component is omitted sequentially, and so the time estimates need to be multiplied by the number of smoothing components when analysis of deviance is performed. Finally computation of confidence limits for each individual smoother increases the time required, especially when loess smoothers are utilized.

For univariate spline smoothers, subject to the aforementioned caveats, problems that correspond to all shaded regions in Figure 41.4 can be completed within a few minutes. For univariate loess smoothers, the two darkest regions are feasible. For bivariate spline smoothers, problems that correspond to only the darkest shading can be completed in the order of a few minutes. The problems that correspond to the upper right unshaded region might be possible, but they require long computation times.