Example 24.3 Predicting E-Mail Spam :: SAS/STAT(R) 12.1 User's Guide

Example 24.3 Predicting E-Mail Spam

This example shows how you can use PROC ADAPTIVEREG to fit a classification model for a data set with a binary response. It illustrates how you can use the PARTITION statement to create subsets of data for training and testing purposes. It also demonstrates how to use the OUTPUT statement. Finally, it shows how you can improve the modeling speed by changing some default settings.

This example concerns a study on classifying whether an e-mail is junk e-mail (coded as 1) or not (coded as 0). The data were collected in Hewlett-Packard labs and donated by George Forman. The data set contains 4,601 observations with 58 variables. The response variable is a binary indicator of whether an e-mail is considered spam or not. The 57 variables are continuous variables that record frequencies of some common words and characters in e-mails and lengths of uninterrupted sequences of capital letters. The data set is publicly available at the UCI Machine Learning repository (Asuncion and Newman, 2007).

The following DATA step downloads the data from the UCI Machine Learning repository and creates a SAS data set called spambase:

%let base = http://archive.ics.uci.edu/ml/machine-learning-databases;
data spambase;
   infile "&base/spambase/spambase.data" device=url dsd dlm=',';
    input Make Address All _3d Our Over Remove Internet Order Mail Receive 
       Will People Report Addresses Free Business Email You Credit Your Font 
       _000 Money Hp Hpl George _650 Lab Labs Telnet _857 Data _415 _85 
       Technology _1999 Parts Pm Direct Cs Meeting Original Project Re Edu 
       Table Conference Semicol Paren Bracket Bang Dollar Pound Cap_Avg 
       Cap_Long Cap_Total Class; 
   run;

This example shows how you can use PROC ADAPTIVEREG to build a model with good predictive power and then use it to classify observations in independent data sets. PROC ADAPTIVEREG enables you to partition your data into subsets for training, validation, and testing. The training set is used to build models, the validation set is used to estimate prediction errors and select models, and the testing set is used independently to evaluate the final model. When the sample size is not large enough, sample reusing approaches are used instead, such as bootstrap and cross validation. For this data set, the sample size is sufficient to support a random partitioning. Because the GCV model selection criterion itself serves as an estimate of prediction error, this data set is split into two separate subsets. The training set is used to build the classification model, and the test set is used to evaluate the model. The PARTITION statement performs the random partitioning for you, as shown in the following statements:

proc adaptivereg data=spambase seed=10359;
   class Class;
   model class = _000         _85         _415         _650         _857
                 _1999        _3d          address      addresses    all
                 bang         bracket      business     cap_avg      cap_long
                 cap_total    conference   credit       cs           data
                 direct       dollar       edu          email        font
                 free         george       hp           hpl          internet      
                 lab          labs         mail         make         meeting
                 money        order        original     our          over
                 paren        parts        people       pm           pound
                 project      re           receive      remove       report
                 semicol      table        technology   telnet       will
                 you          your  / additive dist=binomial;
   partition fraction(test=0.333);
   output out=spamout p(ilink);
run;

The FRACTION option in the PARTITION statement specifies that 33.3% of observations in the spambase data set are randomly selected to form the testing set while the rest of the data form the training set. If you want to use the same partitioning for further analysis, you can specify the seed for the random number generator so that the exact same random number stream can be duplicated. For the preceding statements, the seed is 10359, which is specified in the PROC ADAPTIVEREG statement. The response variable is a two-level variable, which is specified in the CLASS statement. The ADDITIVE option specifies that this is an additive model without interactions between spline basis functions; this option makes the predictive model more interpretable. The DIST=BINOMIAL option specifies the distribution of the response variable. The ILINK option in the OUTPUT statement requests predicted probabilities for each observation.

The “Model Information” table in Output 24.3.1 includes the distribution, link function, and the random number seed.

Output 24.3.1: Model Information

The ADAPTIVEREG Procedure

Model Information
Data Set	WORK.SPAMBASE
Response Variable	Class
Distribution	Binomial
Link Function	Logit
Random Number Seed	10359

The “Number of Observations” table in Output 24.3.2 lists the total number of observations used. It also lists number of observations for the training set and the test set.

Output 24.3.2: Number of Observations

Number of Observations Read	4601
Number of Observations Used	4601
Number of Observations Used for Training	3028
Number of Observations Used for Testing	1573

The response variable is a binary classification variable. PROC ADAPTIVEREG produces the “Response Profile” table in Output 24.3.3. The table shows the response level frequencies for the training set and the probability that PROC ADAPTIVEREG models.

Output 24.3.3: Response Profile

Response Profile
Ordered Value	Class	Total Frequency
1	0	2788
2	1	1813

Probability modeled is Class='0'.

The “Fit Statistics” table in Output 24.3.3 shows that the final model for the training set contains large effective degrees freedom.

Output 24.3.4: Fit Statistics

Fit Statistics
GCV	0.23427
GCV R-Square	0.82508
Effective Degrees of Freedom	173
Log Likelihood	-315.30998
Deviance (Train)	630.61996
Deviance (Test)	806.74112

To classify e-mails from the test set, the following rule is used. For each observation, the e-mail is classified as spam if the predicted probability of Class = '0' is greater than the predicted probability of Class = '1', and ham (a good e-mail) otherwise. Because the response is binary, you can classify an e-mail as spam if the predicted probability of Class = '0' is less than 0.5. The following statements evaluate classification errors:

data test;
   set spamout(where=(_ROLE_='TEST'));
   if ((pred>0.5 & class=0) | (pred<0.5 & class=1))
   then Error=0;
   else error=1;
run;

proc freq data=test;
   tables class*error/nocol;
run;

Output 24.3.5 shows the misclassification errors for all observations and observations of each response category. Compared to the results from other statistical learning algorithms that use different training subsets (Hastie, Tibshirani, and Friedman, 2001), these results from PROC ADAPTIVEREG are competitive.

Output 24.3.5: Crosstabulation Table for Test Set Prediction

The FREQ Procedure

Frequency
Percent
Row Pct

885

56.26

93.75

3.75

6.25

944

60.01

592

37.64

94.12

2.35

5.88

629

39.99

1477

93.90

6.10

1573

100.00

It takes approximately 3GB of memory and about five and a half minutes to fit the model on a workstation with a 12-way 2.6GHz AMD Opteron processor. The following analyses illustrate how you can change some default settings to improve the modeling speed without sacrificing much predictive capability. As discussed in the section Computational Resources, the computation cost for PROC ADAPTIVEREG is proportional to $pNM_{\max }^3$ . For the same data set, you can significantly increase the modeling speed by reducing the maximum number of basis functions that are allowed for the forward selection.

PROC ADAPTIVEREG uses 115 as the maximum number of basis functions. Suppose you want to set the maximum number to 61, which is approximately half the default value. The following program fits a multivariate adaptive regression splines model with MAXBASIS= set to 61. The same random number seed is used to get the exact same data partitioning.

proc adaptivereg data=spambase seed=10359;
   class Class;
   model class = _000         _85         _415         _650         _857
                 _1999        _3d          address      addresses    all
                 bang         bracket      business     cap_avg      cap_long
                 cap_total    conference   credit       cs           data
                 direct       dollar       edu          email        font
                 free         george       hp           hpl          internet      
                 lab          labs         mail         make         meeting
                 money        order        original     our          over
                 paren        parts        people       pm           pound
                 project      re           receive      remove       report
                 semicol      table        technology   telnet       will
                 you          your  / maxbasis=61 additive dist=binomial;
   partition fraction(test=0.333);
   output out=spamout2 p(ilink);
run;

The “Fit Statistics” table in Output 24.3.6 displays summary statistics for the second model. The log likelihood of the second model is smaller than that of the first model, which is expected because the effective degrees of freedom is 95, much smaller than the effective degrees of freedom of the first model. This means that the fitted model is much simpler than the first model. Both the GCV and GCV R-square values show that the estimated prediction capability of the second model is slightly less than the first model.

Output 24.3.6: Fit Statistics

The ADAPTIVEREG Procedure

Fit Statistics
GCV	0.27971
GCV R-Square	0.79115
Effective Degrees of Freedom	95
Log Likelihood	-397.32916
Deviance (Train)	794.65833
Deviance (Test)	682.79427

By predicting observations in the test set, the second model has an overall misclassification error of 5.28%, which is slightly lower that that of the first model. This shows that the predictive power of the second model is actually greater than of the first model due to reduced model complexity. The computation takes around 80 seconds on the same workstation and consumes approximately 170MB of memory. This is a significant improvement in both computation speed and memory cost.

You can further improve the modeling speed by using the FAST option in the MODEL statement. The FAST option avoids evaluating certain combinations of parent basis functions and variables. For example, you can specify the FAST(K=20) option so that in each forward selection iteration, PROC ADAPTIVEREG uses only the top 20 parent basis functions (based on their maximum improvement from the previous iteration) to construct and evaluate new basis functions. The underlying assumption, as discussed in the section Fast Algorithm, is that parent basis functions that offer low improvement at previous steps are less likely to yield new basis functions that offer large improvement at the current step. The following statements illustrate the FAST option:

proc adaptivereg data=spambase seed=10359;
   class Class;
   model class = _000         _85         _415         _650         _857
                 _1999        _3d          address      addresses    all
                 bang         bracket      business     cap_avg      cap_long
                 cap_total    conference   credit       cs           data
                 direct       dollar       edu          email        font
                 free         george       hp           hpl          internet      
                 lab          labs         mail         make         meeting
                 money        order        original     our          over
                 paren        parts        people       pm           pound
                 project      re           receive      remove       report
                 semicol      table        technology   telnet       will
                 you          your  / maxbasis=61 fast(k=20) additive dist=binomial;
   partition fraction(test=0.333);
   output out=spamout3 p(ilink);
run;

The fitted model is the same as the second model. The computation time reduces further to 70 seconds. You should tune the parameters for the FAST option with care because the underlying assumption does not always hold.

With this investigation, the second model can serve as a good classifier. It contains 26 variables. The “Variable Importance” table (Output 24.3.7) lists all variables and their importance values in descending order. Two variables in the model, George and Hp, are important factors in classifying e-mails as not spam. George Forman, the donor of the original data set, collected e-mails from filed work and personal e-mails at Hewlett-Packard labs. Thus these two variables are strong indicators of e-mails that are not spam. This confirms the results from the fitted multivariate adaptive regression splines model by PROC ADAPTIVEREG.

Output 24.3.7: Variable Importance

Variable Importance
Variable	Number of Bases	Importance
George	1	100.00
Hp	1	78.35
Edu	3	61.25
Remove	2	49.21
Bang	3	44.14
Free	2	34.18
Meeting	3	32.57
_1999	2	29.71
Dollar	2	28.30
Money	3	26.39
Cap_Long	3	24.41
Our	2	19.46
Semicol	2	14.98
Re	2	13.52
Business	3	13.48
Over	3	12.63
Cap_Total	3	12.50
Will	1	10.81
Pound	2	9.73
Internet	1	5.88
_000	1	4.57
You	2	3.17

The ADAPTIVEREG Procedure (Experimental)

Example 24.3 Predicting E-Mail Spam