Example 4.2: Comparing PROC GAM with PROC TPSPLINE
This example compares the GAM procedure with the TPSPLINE procedure,
another nonparametric procedure that fits a smooth surface to
multivariate data. It does not assume additivity of the model and uses
very general basis functions for model fitting, making the TPSPLINE procedure much
slower than the GAM procedure. For more details about the TPSPLINE procedure,
refer to "The TPSPLINE Procedure" in SAS/STAT User's
Guide, Version 8.
The data used here is also analyzed in "The TPSPLINE
Procedure" in SAS/STAT User's Guide, Version 8. It
presents age-adjusted melanoma incidences for 37 years from the
Connecticut Tumor Registry (Houghton, Flannery, and Viola 1980):
title 'Comparing PROC GAM with PROC TPSPLINE';
data melanoma;
input year incidences @@;
datalines;
1936 0.9 1937 0.8 1938 0.8 1939 1.3
1940 1.4 1941 1.2 1942 1.7 1943 1.8
1944 1.6 1945 1.5 1946 1.5 1947 2.0
1948 2.5 1949 2.7 1950 2.9 1951 2.5
1952 3.1 1953 2.4 1954 2.2 1955 2.9
1956 2.5 1957 2.6 1958 3.2 1959 3.8
1960 4.2 1961 3.9 1962 3.7 1963 3.3
1964 3.7 1965 3.9 1966 4.1 1967 3.8
1968 4.7 1969 4.4 1970 4.8 1971 4.8
1972 4.8
;
run;
The variable incidences records the number of melanoma cases per
100,000 people for the years 1936 to 1972.
Four to five degrees of freedom for each nonparametric term in a
generalized additive model fits most data well. However, to select DF
more objectively you can use the GCV option to minimize the
generalized cross validation function, as shown in the following PROC
GAM code:
proc gam data=melanoma;
model incidences = spline(year) /method = GCV;
output out=gam p;
run;
The results are listed in Output 4.2.1 and Output 4.2.2.
Output 4.2.1: Summary Statistics
| Comparing PROC GAM with PROC TPSPLINE |
| The GAM Procedure |
| Dependent Variable: incidences |
| Smoothing Model Component: spline(year) |
| Iteration Summary and Fit Statistics |
| Final number of backfitting iterations |
2 |
| Final backfitting criterion |
0 |
| Final residual sum of squares |
1.2242517494 |
| Summary of Input Data Set |
| Number of Observations |
37 |
| Number of Missing Observations |
0 |
| Distribution |
Gaussian |
| Link Function |
Identity |
|
Output 4.2.2: Analysis of Model
| Comparing PROC GAM with PROC TPSPLINE |
| The GAM Procedure |
| Dependent Variable: incidences |
| Smoothing Model Component: spline(year) |
Regression Model Analysis Parameter Estimates |
| Parameter |
Parameter Estimate |
Standard Error |
t Value |
Pr > |t| |
| Intercept |
-212.69706 |
7.00491 |
-30.36 |
<.0001 |
| L_year |
0.11029 |
0.00358 |
30.77 |
<.0001 |
Smoothing Model Analysis Fit Statistics of Smoothing Components |
| Component |
Smoothing Parameter |
DF |
GCV |
No. of Unique Obs. |
| spline(year) |
0.634903 |
13.414936 |
0.088803 |
37 |
Smoothing Model Analysis Analysis of Deviance |
| Source |
DF |
Sum of Squares |
F Value |
Pr > F |
| spline(year) |
13.414936 |
2.736763 |
50.49 |
<.0001 |
|
Based on the summary of the model, the final model has a
DF = 13.414936 and the nonparametric trend is highly
significant. Note that this DF is much greater than the default value
of 4, indicating that there is a great deal of structure in the yearly
incidence rates of melanoma. A prediction plot should reveal the
nature of this structure:
legend1 frame cframe=ligr cborder=black label=none
position=center;
axis1 label=(angle=90 rotate=0);
axis2 minor=none;
symbol1 color=red interpol=join value=none line=1;
proc sort data=gam; by year;
proc gplot data=gam;
title;
plot p_incidences*year = 1 /overlay legend
frame cframe=ligr vaxis=axis1 haxis=axis2;
run;
Output 4.2.3 shows the predicted melanoma rate over time. Two features
stand out on this plot:
- Melanoma incidence on the whole increases over the period of
the study.
- A strong periodic effect is evident, with a period of a little
more than a decade. This has been attributed to the 11-year
sunspot cycle: the more sunspots there are, the more melanoma
cases there are likely to be.
Output 4.2.3: Predicted Melanoma Incidence Rates By Year
Since PROC TPSPLINE also fits a nonparametric model, PROC TPSPLINE and
PROC GAM
fits should be very similar for this univariate case. The following
code produces the TPSPLINE analysis shown in Output 4.2.4:
title 'Comparing PROC GAM with PROC GENMOD';
proc tpspline data=melanoma;
model incidences = (year);
output out=tpspline p;
run;
Output 4.2.4: Analysis from PROC TPSPLINE
| Comparing PROC GAM with PROC TPSPLINE |
| The TPSPLINE Procedure |
| Dependent Variable: incidences |
| Summary of Input Data Set |
| Number of Non-Missing Observations |
37 |
| Number of Missing Observations |
0 |
| Unique Smoothing Design Points |
37 |
| Summary of Final Model |
| Number of Regression Variables |
0 |
| Number of Smoothing Variables |
1 |
| Order of Derivative in the Penalty |
2 |
| Dimension of Polynomial Space |
2 |
| Summary Statistics of Final Estimation |
| log10(n*Lambda) |
-0.0607 |
| Smoothing Penalty |
0.5171 |
| Residual SS |
1.2243 |
| Tr(I-A) |
22.5852 |
| Model DF |
14.4148 |
| Standard Deviation |
0.2328 |
|
The TPSPLINE model analysis shows that the
DF for the model is 14.4148. This is consistent
with the GAM model because the DF value in GAM
excludes the degree of freedom of the linear L_year term.
The OUTPUT statements in the PROC GAM and PROC TPSPLINE code create
gam and tpspline data sets containing the predicted values
for the respective procedures. You can use the following code to look
at the values for the two procedures side by side:
data both; merge gam (rename=(p_incidences=gam ))
tpspline(rename=(p_incidences=tpspline));
proc print data=both;
var year gam tpspline;
run;
The results, the first ten of which are displayed in Output 4.2.5,
show that PROC GAM and PROC TPSPLINE give essentially the same predictions for
this problem.
Output 4.2.5: Melanoma Predictions for First Ten Years
| Comparing PROC GAM with PROC TPSPLINE |
| Obs |
year |
gam |
tpspline |
| 1 |
1936 |
0.82425 |
0.82424 |
| 2 |
1937 |
0.85580 |
0.85580 |
| 3 |
1938 |
0.96379 |
0.96379 |
| 4 |
1939 |
1.15046 |
1.15046 |
| 5 |
1940 |
1.31044 |
1.31044 |
| 6 |
1941 |
1.43881 |
1.43881 |
| 7 |
1942 |
1.58218 |
1.58218 |
| 8 |
1943 |
1.64382 |
1.64382 |
| 9 |
1944 |
1.60148 |
1.60148 |
| 10 |
1945 |
1.57498 |
1.57499 |
|
Copyright © 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.