Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
The GAM Procedure

Example 4.1: Generalized Additive Model with Binary Data

The following example illustrates the capabilities of the GAM procedure and compares it to the GENMOD procedure.

The data used in this example are based on a study by Bell et al. (1989). Bell and his associates studied the result of multiple-level thoracic and lumbar laminectomy, a corrective spinal surgery commonly performed on children. The data in the study consist of retrospective measurements on 83 patients. The specific outcome of interest is the presence (1) or absence (0) of kyphosis, defined as a forward flexion of the spine of at least 40 degrees from vertical. The available predictor variables are Age in months at time of the operation, the starting of vertebrae levels involved in the operation (StartVert), and the number of levels involved (NumVert). The goal of this analysis is to identify risk factors for kyphosis. PROC GENMOD can be used to investigate the relationship among kyphosis and the predictors. The following DATA step creates the data kyphosis:

   title 'Comparing PROC GAM with PROC GENMOD';
   data kyphosis;
      input Age StartVert NumVert Kyphosis @@;
      datalines;
   71  5  3  0     158 14 3 0    128 5   4 1
   2   1  5  0     1   15 4 0    1   16  2 0
   61  17 2  0     37  16 3 0    113 16  2 0
   59  12 6  1     82  14 5 1    148 16  3 0
   18  2  5  0     1   12 4 0    243 8   8 0
   168 18 3  0     1   16 3 0    78  15  6 0
   175 13 5  0     80  16 5 0    27  9   4 0
   22  16 2  0     105 5  6 1    96  12  3 1
   131 3  2  0     15  2  7 1    9   13  5 0
   12  2 14  1     8   6  3 0    100 14  3 0
   4   16 3  0     151 16 2 0    31  16  3 0
   125 11 2  0     130 13 5 0    112 16  3 0
   140 11 5  0     93  16 3 0    1   9   3 0
   52  6  5  1     20  9  6 0    91  12  5 1
   73  1  5  1     35  13 3 0    143 3   9 0
   61  1  4  0     97  16 3 0    139 10  3 1
   136 15 4  0     131 13 5 0    121 3   3 1
   177 14 2  0     68  10 5 0    9   17  2 0
   139 6  10 1     2   17 2 0    140 15  4 0
   72  15 5  0     2   13 3 0    120 8   5 1
   51  9  7  0     102 13 3 0    130 1   4 1
   114 8  7  1     81  1  4 0    118 16  3 0
   118 16 4  0     17  10 4 0    195 17  2 0
   159 13 4  0     18  11 4 0    15  16  5 0
   158 15 4  0     127 12 4 0    87  16  4 0
   206 10 4  0     11  15 3 0    178 15  4 0
   157 13 3  1     26  13 7 0    120 13  2 0
   42  6  7  1     36  13 4 0
   ;

   proc genmod;
      model Kyphosis = Age StartVert NumVert
                       / link=logit dist=binomial;
   run;

Output 4.1.1: GENMOD Analysis: Partial Output
 
Comparing PROC GAM with PROC GENMOD

The GENMOD Procedure

Analysis Of Parameter Estimates
Parameter DF Estimate Standard Error Wald 95% Confidence Limits Chi-Square Pr > ChiSq
Intercept 1 1.2497 1.2424 -1.1853 3.6848 1.01 0.3145
Age 1 -0.0061 0.0055 -0.0170 0.0048 1.21 0.2713
StartVert 1 0.1972 0.0657 0.0684 0.3260 9.01 0.0027
NumVert 1 -0.3031 0.1790 -0.6540 0.0477 2.87 0.0904
Scale 0 1.0000 0.0000 1.0000 1.0000    

NOTE: The scale parameter was held fixed.


The GENMOD analysis of the independent variable effects is shown in Output 4.1.1. Based on these results, the only significant factor is StartVert with odds ratio of -0.1972. The variable NumVert has a p-value of 0.0904 with odds ratio of 0.3031.

The GENMOD procedure assumes a strict linear relationship between the response and the predictors. The following SAS statements use PROC GAM to investigate a less restrictive model, with moderately flexible spline terms for each of the predictors:

   title 'Comparing PROC GAM with PROC GENMOD';
   proc gam data=kyphosis;
      model Kyphosis=spline(Age,df=3) spline(StartVert,df=3) 
                       spline(NumVert,df=3) /dist = logist;
      output out=estimate p;
   run;

The MODEL statement requests an additive model using a univariate BSPLINE for each term. The option dist=logist specifies a logistic model. Each term is fitted using a smoothing spline with three degrees of freedom. Although this might seem to be an unduly modest amount of flexibility, it is better to be conservative with a data set this small. An output data set estimate containing predicted values is requested by the OUTPUT statement.

Output 4.1.2 and Output 4.1.3 list the output from PROC GAM.

Output 4.1.2: Summary Statistics
 
Comparing PROC GAM with PROC GENMOD

The GAM Procedure
Dependent Variable: Kyphosis
Smoothing Model Component: spline(Age) spline(StartVert) spline(NumVert)

Iteration Summary and Fit Statistics
Number of local score iterations 9
Local score convergence criterion 3.9786363E-9
Final number of backfitting iterations 1
Final backfitting criterion 5.2326788E-9
Final residual sum of squares 46.610928346
 
Summary of Input Data Set
Number of Observations 83
Number of Missing Observations 0
Distribution Binomial
Link Function Logit

Output 4.1.3: Analysis of Model
 
Comparing PROC GAM with PROC GENMOD

The GAM Procedure
Dependent Variable: Kyphosis
Smoothing Model Component: spline(Age) spline(StartVert) spline(NumVert)

Regression Model Analysis
Parameter Estimates
Parameter Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept -2.01545 0.74274 -2.71 0.0083
L_Age 0.01213 0.00622 1.95 0.0552
L_StartVert -0.18615 0.06061 -3.07 0.0030
L_NumVert 0.38347 0.15264 2.51 0.0142
 
Smoothing Model Analysis
Fit Statistics of Smoothing Components
Component Smoothing Parameter DF GCV No. of Unique
Obs.
spline(Age) 0.999996 3.000000 328.513619 66
spline(StartVert) 0.999551 3.000000 317.647039 16
spline(NumVert) 0.921758 3.000000 20.144078 10
 
Smoothing Model Analysis
Analysis of Deviance
Source DF Sum of Squares F Value Pr > F
spline(Age) 3.000000 10.494366 16.44 0.0009
spline(StartVert) 3.000000 5.494965 8.61 0.0135
spline(NumVert) 3.000000 2.184514 3.42 0.3311

The critical part of the GAM results is the Analysis of Deviance table, shown in Output 4.1.3. For each smoothing effect in the model, this table gives an F-test comparing the deviance between the full model and the model without this variable. In this case, the analysis of deviance results indicates that the effect of Age and StartVert are highly significant, while the effect of NumVert is insignificant. Plots of predictions against predictor can be used to investigate why PROC GAM and PROC GENMOD produce different results.

The GAM statement requests an output data set of predicted values to be created. Since the estimate of the generalized additive model is the sum of functional estimates of individual predictors, plus a constant, the output data set will contain a column of partial prediction for each predictor. If requested, a Bayesian confidence interval or a point-wise standard-error band, as defined in Hastie and Tibshirani (1990), can be produced in the output data set.

Using the following statements, the data set estimate is plotted in Output 4.1.4:

   proc sort data=estimate(keep=StartVert P_StartVert)
             out =StartVert;
      by StartVert;
   proc sort data=estimate(keep=Age       P_Age      )
             out =Age;
      by Age;
   proc sort data=estimate(keep=NumVert   P_NumVert  )
             out =NumVert;
      by NumVert;
   data Plot; merge StartVert Age NumVert;
   proc standard m=0 s=1 data=Plot out=Plot;
      var StartVert Age NumVert;
   run;
   
   legend1 frame cframe=ligr cborder=black label=none 
           position=center;
   axis1 label=(angle=90 rotate=0 " ") minor=none 
         value=NONE major=NONE;
   axis2 minor=none label=(" ")  major=NONE value=NONE;  
   symbol1 color=red   interpol=join value=none line=1;
   symbol2 color=blue  interpol=join value=none line=2;
   symbol3 color=green interpol=join value=none line=3;
   proc gplot data=Plot;
      title;
      plot P_StartVert*StartVert=1
           P_Age      *Age      =2
           P_NumVert  *NumVert  =3 / overlay legend frame 
           cframe=ligr vaxis=axis1 haxis=axis2;
   run;

Output 4.1.4: Partial Prediction for Each Predictor
game1d.gif (2981 bytes)

The plot shows that the partial predictions corresponding to both Age and StartVert have a strong quadratic pattern, while NumVert has a more complicated but weaker pattern. However, in the plot for NumVert, notice that about half the vertical range of the function is determined by the point at the upper extreme. It would be a good idea, therefore, to re-run the analysis without this point, to see how much it affects the conclusions. You can do this simply by including a WHERE clause when specifying the data set for the GAM procedure, as in the following code:

   title 'Comparing PROC GAM with PROC GENMOD';
   proc gam data=kyphosis(where=(NumVert^=14));
      model Kyphosis=spline(Age,df=3) spline(StartVert,df=3) 
                       spline(NumVert, df=3) /dist = logist;
      output out=estimate p;
   run;

The analysis of deviance table from this re-analysis is shown in Output 4.1.5, and Output 4.1.6 shows the re-computed partial predictor plots.

Output 4.1.5: Analysis After Removing NumVert=14
 
Comparing PROC GAM with PROC GENMOD

The GAM Procedure
Dependent Variable: Kyphosis
Smoothing Model Component: spline(Age) spline(StartVert) spline(NumVert)

Smoothing Model Analysis
Analysis of Deviance
Source DF Sum of Squares F Value Pr > F
spline(Age) 3.000000 10.587568 16.90 0.0007
spline(StartVert) 3.000000 5.477104 8.74 0.0126
spline(NumVert) 3.000000 3.209100 5.12 0.1630

Output 4.1.6: Partial Prediction After Removing NumVert=14
game1f.gif (3049 bytes)

After removing data point NumVert=14, the predictors age and StartVert are still significant and the variable NumVert still insignificant. Based on the plot in Output 4.1.6, the removed point has almost no effect on estimates of curve shape for variables Age and StartVert. But the removal has a dramatic effect on the variable NumVert: this curve for this variable NumVert now also seems quadratic, though it is much less pronounced than for the other two variables.

Having used the GAM procedure to discover an appropriate form of the dependence of Kyphosis on each of the three independent variables, you can use the GENMOD procedure to fit and assess the corresponding parametric model. The following code fits a GENMOD model with quadratic terms for all three variables, including tests for the joint linear and quadratic effects of each variable. The resulting contrast tests are shown in Output 4.1.7.

 
   title 'Comparing PROC GAM with PROC GENMOD';
   proc genmod data=kyphosis(where=(NumVert^=14));
      model kyphosis = Age       Age      *Age
                       StartVert StartVert*StartVert
                       NumVert   NumVert  *NumVert  
                       /link=logit  dist=binomial;
      contrast 'Age'       Age       1, Age*Age             1;
      contrast 'StartVert' StartVert 1, StartVert*StartVert 1;
      contrast 'NumVert'   NumVert   1, NumVert*NumVert     1;
   run;

Output 4.1.7: Joint Linear and Quadratic Tests
 
Comparing PROC GAM with PROC GENMOD

The GENMOD Procedure

Contrast Results
Contrast DF Chi-Square Pr > ChiSq Type
Age 2 13.63 0.0011 LR
StartVert 2 15.41 0.0005 LR
NumVert 2 3.56 0.1684 LR

The results for the quadratic GENMOD model are now quite consistent with the GAM results.

From this example, you can see that PROC GAM is very useful in visualizing the data and detecting the nonlinearity among the variables.

Chapter Contents
Chapter Contents
Previous
Previous
Next
Next
Top
Top

Copyright © 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.