Junk E-Mail Data

The Sashelp.JunkMail data set comes from a study that classifies whether an e-mail is junk e-mail (coded as 1) or not (coded as 0). The data were collected in Hewlett-Packard labs and donated by George Forman. The data set contains 4,601 observations with 58 variables. The response variable is a binary indicator of whether an e-mail is considered spam or not. The 57 variables are continuous variables that record frequencies of some common words and characters and lengths of uninterrupted sequences of capital letters in e-mails. The following steps display information about the Sashelp.JunkMail data set and create Figure B.10:

title 'Junk E-Mail Data';
proc contents data=sashelp.JunkMail varnum;
   ods select position;
run;

title 'The First Five Observations Out of 4601';
proc print data=sashelp.JunkMail(obs=5) heading=horizontal;
run;

Figure B.10: Junk E-Mail Data

Junk E-Mail Data

Variables in Creation Order
# Variable Type Len Label
1 Test Num 8 0 - Training, 1 - Test
2 Make Num 8  
3 Address Num 8  
4 All Num 8  
5 _3D Num 8 3D
6 Our Num 8  
7 Over Num 8  
8 Remove Num 8  
9 Internet Num 8  
10 Order Num 8  
11 Mail Num 8  
12 Receive Num 8  
13 Will Num 8  
14 People Num 8  
15 Report Num 8  
16 Addresses Num 8  
17 Free Num 8  
18 Business Num 8  
19 Email Num 8  
20 You Num 8  
21 Credit Num 8  
22 Your Num 8  
23 Font Num 8  
24 _000 Num 8 000
25 Money Num 8  
26 HP Num 8  
27 HPL Num 8  
28 George Num 8  
29 _650 Num 8 650
30 Lab Num 8  
31 Labs Num 8  
32 Telnet Num 8  
33 _857 Num 8 857
34 Data Num 8  
35 _415 Num 8 415
36 _85 Num 8 85
37 Technology Num 8  
38 _1999 Num 8 1999
39 Parts Num 8  
40 PM Num 8  
41 Direct Num 8  
42 CS Num 8  
43 Meeting Num 8  
44 Original Num 8  
45 Project Num 8  
46 RE Num 8  
47 Edu Num 8  
48 Table Num 8  
49 Conference Num 8  
50 Semicolon Num 8  
51 Paren Num 8  
52 Bracket Num 8  
53 Exclamation Num 8  
54 Dollar Num 8  
55 Pound Num 8  
56 CapAvg Num 8 Capital Run Length Average
57 CapLong Num 8 Capital Run Length Longest
58 CapTotal Num 8 Capital Run Length Total
59 Class Num 8 0 - Not Junk, 1 - Junk


The First Five Observations Out of 4601

Obs Test Make Address All _3D Our Over Remove Internet Order Mail Receive Will People Report Addresses Free Business Email You Credit Your Font _000 Money HP HPL George _650 Lab Labs Telnet _857 Data _415 _85 Technology _1999 Parts PM Direct CS Meeting Original Project RE Edu Table Conference Semicolon Paren Bracket Exclamation Dollar Pound CapAvg CapLong CapTotal Class
1 1 0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00 0.00 0.32 0.00 1.29 1.93 0.00 0.96 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.000 0 0.778 0.000 0.000 3.756 61 278 1
2 0 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43 0.43 0 0 0 0 0 0 0 0 0 0 0 0 0.07 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.132 0 0.372 0.180 0.048 5.114 101 1028 1
3 1 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16 0.06 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.06 0 0 0.12 0 0.06 0.06 0 0 0.01 0.143 0 0.276 0.184 0.010 9.821 485 2259 1
4 0 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.137 0 0.137 0.000 0.000 3.537 40 191 1
5 0 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00 0 0 0.00 0.135 0 0.135 0.000 0.000 3.537 40 191 1