This example shows how you can use PROC ADAPTIVEREG to fit a classification model for a data set with a binary response. It illustrates how you can use the PARTITION statement to create subsets of data for training and testing purposes. It also demonstrates how to use the OUTPUT statement. Finally, it shows how you can improve the modeling speed by changing some default settings.
This example concerns a study on classifying whether an email is junk email (coded as 1) or not (coded as 0). The data were collected in HewlettPackard labs and donated by George Forman. The data set contains 4,601 observations with 58 variables. The response variable is a binary indicator of whether an email is considered spam or not. The 57 variables are continuous variables that record frequencies of some common words and characters in emails and lengths of uninterrupted sequences of capital letters. The data set is publicly available at the UCI Machine Learning repository (Asuncion and Newman, 2007).
The following DATA step downloads the data from the UCI Machine Learning repository and creates a SAS data set called spambase
:
%let base = http://archive.ics.uci.edu/ml/machinelearningdatabases; data spambase; infile "&base/spambase/spambase.data" device=url dsd dlm=','; input Make Address All _3d Our Over Remove Internet Order Mail Receive Will People Report Addresses Free Business Email You Credit Your Font _000 Money Hp Hpl George _650 Lab Labs Telnet _857 Data _415 _85 Technology _1999 Parts Pm Direct Cs Meeting Original Project Re Edu Table Conference Semicol Paren Bracket Bang Dollar Pound Cap_Avg Cap_Long Cap_Total Class; run;
This example shows how you can use PROC ADAPTIVEREG to build a model with good predictive power and then use it to classify observations in independent data sets. PROC ADAPTIVEREG enables you to partition your data into subsets for training, validation, and testing. The training set is used to build models, the validation set is used to estimate prediction errors and select models, and the testing set is used independently to evaluate the final model. When the sample size is not large enough, sample reusing approaches are used instead, such as bootstrap and cross validation. For this data set, the sample size is sufficient to support a random partitioning. Because the GCV model selection criterion itself serves as an estimate of prediction error, this data set is split into two separate subsets. The training set is used to build the classification model, and the test set is used to evaluate the model. The PARTITION statement performs the random partitioning for you, as shown in the following statements:
proc adaptivereg data=spambase seed=10359; class Class; model class = _000 _85 _415 _650 _857 _1999 _3d address addresses all bang bracket business cap_avg cap_long cap_total conference credit cs data direct dollar edu email font free george hp hpl internet lab labs mail make meeting money order original our over paren parts people pm pound project re receive remove report semicol table technology telnet will you your / additive dist=binomial; partition fraction(test=0.333); output out=spamout p(ilink); run;
The FRACTION option in the PARTITION statement specifies that 33.3% of observations in the spambase
data set are randomly selected to form the testing set while the rest of the data form the training set. If you want to use
the same partitioning for further analysis, you can specify the seed for the random number generator so that the exact same
random number stream can be duplicated. For the preceding statements, the seed is 10359, which is specified in the PROC ADAPTIVEREG statement. The response variable is a twolevel variable, which is specified in the CLASS statement. The ADDITIVE option specifies that this is an additive model without interactions between spline basis functions; this option makes the
predictive model more interpretable. The DIST=BINOMIAL option specifies the distribution of the response variable. The ILINK
option in the OUTPUT statement requests predicted probabilities for each observation.
The “Model Information” table in Output 24.3.1 includes the distribution, link function, and the random number seed.
Output 24.3.1: Model Information
Model Information  

Data Set  WORK.SPAMBASE 
Response Variable  Class 
Distribution  Binomial 
Link Function  Logit 
Random Number Seed  10359 
The “Number of Observations” table in Output 24.3.2 lists the total number of observations used. It also lists number of observations for the training set and the test set.
Output 24.3.2: Number of Observations
Number of Observations Read  4601 

Number of Observations Used  4601 
Number of Observations Used for Training  3028 
Number of Observations Used for Testing  1573 
The response variable is a binary classification variable. PROC ADAPTIVEREG produces the “Response Profile” table in Output 24.3.3. The table shows the response level frequencies for the training set and the probability that PROC ADAPTIVEREG models.
Output 24.3.3: Response Profile
Response Profile  

Ordered Value 
Class  Total Frequency 
1  0  2788 
2  1  1813 
The “Fit Statistics” table in Output 24.3.3 shows that the final model for the training set contains large effective degrees freedom.
Output 24.3.4: Fit Statistics
Fit Statistics  

GCV  0.23427 
GCV RSquare  0.82508 
Effective Degrees of Freedom  173 
Log Likelihood  315.30998 
Deviance (Train)  630.61996 
Deviance (Test)  806.74112 
To classify emails from the test set, the following rule is used. For each observation, the email is classified as spam
if the predicted probability of Class
= '0' is greater than the predicted probability of Class
= '1', and ham (a good email) otherwise. Because the response is binary, you can classify an email as spam if the predicted
probability of Class
= '0' is less than 0.5. The following statements evaluate classification errors:
data test; set spamout(where=(_ROLE_='TEST')); if ((pred>0.5 & class=0)  (pred<0.5 & class=1)) then Error=0; else error=1; run;
proc freq data=test; tables class*error/nocol; run;
Output 24.3.5 shows the misclassification errors for all observations and observations of each response category. Compared to the results from other statistical learning algorithms that use different training subsets (Hastie, Tibshirani, and Friedman, 2001), these results from PROC ADAPTIVEREG are competitive.
Output 24.3.5: Crosstabulation Table for Test Set Prediction


It takes approximately 3GB of memory and about five and a half minutes to fit the model on a workstation with a 12way 2.6GHz AMD Opteron processor. The following analyses illustrate how you can change some default settings to improve the modeling speed without sacrificing much predictive capability. As discussed in the section Computational Resources, the computation cost for PROC ADAPTIVEREG is proportional to . For the same data set, you can significantly increase the modeling speed by reducing the maximum number of basis functions that are allowed for the forward selection.
PROC ADAPTIVEREG uses 115 as the maximum number of basis functions. Suppose you want to set the maximum number to 61, which is approximately half the default value. The following program fits a multivariate adaptive regression splines model with MAXBASIS= set to 61. The same random number seed is used to get the exact same data partitioning.
proc adaptivereg data=spambase seed=10359; class Class; model class = _000 _85 _415 _650 _857 _1999 _3d address addresses all bang bracket business cap_avg cap_long cap_total conference credit cs data direct dollar edu email font free george hp hpl internet lab labs mail make meeting money order original our over paren parts people pm pound project re receive remove report semicol table technology telnet will you your / maxbasis=61 additive dist=binomial; partition fraction(test=0.333); output out=spamout2 p(ilink); run;
The “Fit Statistics” table in Output 24.3.6 displays summary statistics for the second model. The log likelihood of the second model is smaller than that of the first model, which is expected because the effective degrees of freedom is 95, much smaller than the effective degrees of freedom of the first model. This means that the fitted model is much simpler than the first model. Both the GCV and GCV Rsquare values show that the estimated prediction capability of the second model is slightly less than the first model.
Output 24.3.6: Fit Statistics
Fit Statistics  

GCV  0.27971 
GCV RSquare  0.79115 
Effective Degrees of Freedom  95 
Log Likelihood  397.32916 
Deviance (Train)  794.65833 
Deviance (Test)  682.79427 
By predicting observations in the test set, the second model has an overall misclassification error of 5.28%, which is slightly lower that that of the first model. This shows that the predictive power of the second model is actually greater than of the first model due to reduced model complexity. The computation takes around 80 seconds on the same workstation and consumes approximately 170MB of memory. This is a significant improvement in both computation speed and memory cost.
You can further improve the modeling speed by using the FAST option in the MODEL statement. The FAST option avoids evaluating certain combinations of parent basis functions and variables. For example, you can specify the FAST(K=20) option so that in each forward selection iteration, PROC ADAPTIVEREG uses only the top 20 parent basis functions (based on their maximum improvement from the previous iteration) to construct and evaluate new basis functions. The underlying assumption, as discussed in the section Fast Algorithm, is that parent basis functions that offer low improvement at previous steps are less likely to yield new basis functions that offer large improvement at the current step. The following statements illustrate the FAST option:
proc adaptivereg data=spambase seed=10359; class Class; model class = _000 _85 _415 _650 _857 _1999 _3d address addresses all bang bracket business cap_avg cap_long cap_total conference credit cs data direct dollar edu email font free george hp hpl internet lab labs mail make meeting money order original our over paren parts people pm pound project re receive remove report semicol table technology telnet will you your / maxbasis=61 fast(k=20) additive dist=binomial; partition fraction(test=0.333); output out=spamout3 p(ilink); run;
The fitted model is the same as the second model. The computation time reduces further to 70 seconds. You should tune the parameters for the FAST option with care because the underlying assumption does not always hold.
With this investigation, the second model can serve as a good classifier. It contains 26 variables. The “Variable Importance” table (Output 24.3.7) lists all variables and their importance values in descending order. Two variables in the model, George
and Hp
, are important factors in classifying emails as not spam. George Forman, the donor of the original data set, collected emails
from filed work and personal emails at HewlettPackard labs. Thus these two variables are strong indicators of emails that
are not spam. This confirms the results from the fitted multivariate adaptive regression splines model by PROC ADAPTIVEREG.
Output 24.3.7: Variable Importance
Variable Importance  

Variable  Number of Bases 
Importance 
George  1  100.00 
Hp  1  78.35 
Edu  3  61.25 
Remove  2  49.21 
Bang  3  44.14 
Free  2  34.18 
Meeting  3  32.57 
_1999  2  29.71 
Dollar  2  28.30 
Money  3  26.39 
Cap_Long  3  24.41 
Our  2  19.46 
Semicol  2  14.98 
Re  2  13.52 
Business  3  13.48 
Over  3  12.63 
Cap_Total  3  12.50 
Will  1  10.81 
Pound  2  9.73 
Internet  1  5.88 
_000  1  4.57 
You  2  3.17 