The GAMPL Procedure

Example 7.2 Nonparametric Logistic Regression

This example shows how you can use PROC GAMPL to build a nonparametric logistic regression model for a data set that contains a binary response and then use that model to classify observations.

The example uses the Pima Indian Diabetes data set, which can be obtained from the UCI Machine Learning Repository (Asuncion and Newman 2007). It is extracted from a larger database that was originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases. Data are for female patients who are at least 21 years old, are of Pima Indian heritage, and live near Phoenix, Arizona. The objective of this study is to investigate the relationship between a diabetes diagnosis and variables that represent physiological measurements and medical attributes. Some missing or invalid observations are removed from the analysis. The reduced data set contains 532 records. The following DATA step creates the data set Pima:

title 'Pima Indian Diabetes Study';
data Pima;
   input NPreg Glucose Pressure Triceps BMI Pedigree Age Diabetes Test@@;
   datalines;
 6  148  72  35  33.6  0.627  50  1  1    1   85  66  29  26.6  0.351  31  0  1
 1   89  66  23  28.1  0.167  21  0  0    3   78  50  32    31  0.248  26  1  0
 2  197  70  45  30.5  0.158  53  1  0    5  166  72  19  25.8  0.587  51  1  1
 0  118  84  47  45.8  0.551  31  1  1    1  103  30  38  43.3  0.183  33  0  1
 3  126  88  41  39.3  0.704  27  0  0    9  119  80  35    29  0.263  29  1  0

   ... more lines ...   

 1  128  48  45  40.5  0.613  24  1  0    2  112  68  22  34.1  0.315  26  0  0
 1  140  74  26  24.1  0.828  23  0  1    2  141  58  34  25.4  0.699  24  0  1
 7  129  68  49  38.5  0.439  43  1  1    0  106  70  37  39.4  0.605  22  0  0
 1  118  58  36  33.3  0.261  23  0  1    8  155  62  26    34  0.543  46  1  0
;

The data set contains nine variables, including the binary response variable Diabetes. Table 7.15 describes the variables.

Table 7.15: Variables in the Pima Data Set

Variable

Description

NPreg

Number of pregnancies

Glucose

Two-hour plasma glucose concentration in an oral glucose tolerance test

Pressure

Diastolic blood pressure (mm Hg)

Triceps

Triceps skin fold thickness (mm)

BMI

Body mass index (weight in kg/(height in m)${}^2$)

Pedigree

Diabetes pedigree function

Age

Age (years)

Diabetes

0 if test negative for diabetes, 1 if test positive

Test

0 for training role, 1 for test


The Test variable splits the data set into training and test subsets. The training observations (whose Test value is 0) hold approximately 59.4% of the data. To build a model that is based on the training data and evaluate its performance by predicting the test data, you use the following statements to create a new variable, Result, whose value is the same as that of the Diabetes variable for a training observation and is missing for a test observation:

data Pima;
   set Pima;
   Result = Diabetes;
   if Test=1 then Result=.;
run;

As a starting point of your analysis, you can build a parametric logistic regression model on the training data and predict the test data. The following statements use PROC HPLOGISTIC to perform the analysis:

proc hplogistic data=Pima;
   model Diabetes(event='1') = NPreg Glucose Pressure Triceps
                               BMI Pedigree Age;
   partition role=Test(test='1' train='0');
run;

Output 7.2.1 shows the summary statistics from the parametric logistic regression model.

Output 7.2.1: Fit Statistics

Pima Indian Diabetes Study

The HPLOGISTIC Procedure

Fit Statistics
Description Training Testing
-2 Log Likelihood 295.79 176.90
AIC (smaller is better) 311.79 192.90
AICC (smaller is better) 312.24 193.65
BIC (smaller is better) 342.19 219.37



Output 7.2.2 shows fit statistics for both training and test subsets of the data, including the misclassification error for the test data.

Output 7.2.2: Partition Fit Statistics

Partition Fit Statistics
Statistic Training Testing
Area under the ROCC 0.8607 0.8547
Average Square Error 0.1452 0.1401
Hosmer-Lemeshow Test 0.9726 0.1487
Misclassification Error 0.2061 0.2178
R-Square 0.3438 0.2557
Max-rescaled R-Square 0.4695 0.3706
McFadden's R-Square 0.3197 0.2522
Mean Difference 0.3780 0.3605
Somers' D 0.7213 0.7093
True Negative Fraction 0.8702 0.8435
True Positive Fraction 0.6639 0.6182



The parametric logistic regression model is restricted in the sense that all variables affect the response in strictly linear fashion. If you are uncertain that a variable is an important factor and its contribution is linear in predicting the response, you might want to choose a nonparametric logistic regression model to fit the data. You can use PROC GAMPL to form a nonparametric model by including the spline transformation of each explanatory variable, as shown in the following statements:

proc gampl data=Pima seed=12345;
   model Result(event='1') = spline(NPreg)    spline(Glucose)
                             spline(Pressure) spline(Triceps)
                             spline(BMI)      spline(Pedigree)
                             spline(Age) / dist=binary;
run;

Because the response variable Result is binary, the DIST=BINARY option in the MODEL statement specifies a binary distribution for the response variable. By default, PROC GAMPL models the probability of the first ordered response category, which is a negative diabetes testing result in this case. The EVENT= option specifically requests that PROC GAMPL model the probability of positive diabetes testing results. The “Response Profile” table in Output 7.2.3 shows the frequencies of the response in both categories.

Output 7.2.3: Response Profile

Pima Indian Diabetes Study

The GAMPL Procedure

Response Profile
Ordered
Value
Result Total
Frequency
1 0 208
2 1 122

You are modeling the probability that Result='1'.




Output 7.2.4 lists the summary statistics from the nonparametric logistic regression model, which include spline transformations of all variables.

Output 7.2.4: Fit Statistics

Fit Statistics
Penalized Log Likelihood -130.55095
Roughness Penalty 3.20011
Effective Degrees of Freedom 25.85896
Effective Degrees of Freedom for Error 302.97752
AIC (smaller is better) 309.61970
AICC (smaller is better) 314.20203
BIC (smaller is better) 407.86031
UBRE (smaller is better) 0.14225



The “Tests for Smoothing Components” table in Output 7.2.5 shows approximate tests results. Although some spline terms are significant, others are not. The null testing hypothesis is whether the total contribution from a variable is 0. So you can form a reduced model by removing those nonsignificant spline terms from the model. In this case, spline transformations for NPreg, Pressure, BMI, and Triceps are dropped from the model because their p-values are larger than the 0.1 nominal level.

Output 7.2.5: Tests for Smoothing Components

Tests for Smoothing Components
Component Effective
DF
Effective
DF for Test
Chi-Square Pr > ChiSq
Spline(NPreg) 1.00001 1 0.0758 0.7831
Spline(Glucose) 1.00000 1 38.4662 <.0001
Spline(Pressure) 7.80438 8 10.2605 0.2472
Spline(Triceps) 1.00000 1 0.6343 0.4258
Spline(BMI) 8.00000 8 11.8047 0.1601
Spline(Pedigree) 1.00002 1 8.5323 0.0035
Spline(Age) 5.05456 7 15.7671 0.0273



The following statements use PROC GAMPL to fit a reduced nonparametric logistic regression model. The OUTPUT statement requests predicted probabilities for both training and test data sets. The ID statement requests that the Diabetes and Test variables also be included in the output data set so that you can use them to identify test observations and compute misclassification errors.

ods graphics on;
proc gampl data=Pima plots seed=12345;
   model Result(event='1') = spline(Glucose)
                             spline(Pedigree) spline(Age) / dist=binary;
   output out=PimaOut;
   id Diabetes Test;
run;

Output 7.2.6 shows the summary statistics from the reduced nonparametric logistic regression model. The values of the information criteria are better than of the parametric logistic regression model.

Output 7.2.6: Fit Statistics

Pima Indian Diabetes Study

The GAMPL Procedure

Fit Statistics
Penalized Log Likelihood -149.85765
Roughness Penalty 2.85613
Effective Degrees of Freedom 8.05242
Effective Degrees of Freedom for Error 320.61181
AIC (smaller is better) 312.96402
AICC (smaller is better) 313.41826
BIC (smaller is better) 343.55593
UBRE (smaller is better) -0.00230



In the “Estimates for Smoothing Components” table in Output 7.2.7, PROC GAMPL reports that the effective degrees of freedom value for spline transformations of Glucose is quite close to 1. This suggests strictly linear form for Glucose. For Pedigree, the degrees of freedom value demonstrates a moderate amount of nonlinearity. For Age, the degrees of freedom value is much larger than 1. The measure suggests a nonlinear pattern in the dependency of the response on Age.

Output 7.2.7: Estimates for Smoothing Components

Estimates for Smoothing Components
Component Effective
DF
Smoothing
Parameter
Roughness
Penalty
Number of
Parameters
Rank of
Penalty
Matrix
Number of
Knots
Spline(Glucose) 1.00000 9.032E10 2.711E-7 9 10 110
Spline(Pedigree) 1.51071 0.4383 0.5086 9 10 283
Spline(Age) 4.54171 69.8810 2.3475 9 10 42



The “Tests for Smoothing Components” table in Output 7.2.8 shows that all spline transformations are significant in predicting diabetes testing results.

Output 7.2.8: Tests for Smoothing Components

Tests for Smoothing Components
Component Effective
DF
Effective
DF for Test
Chi-Square Pr > ChiSq
Spline(Glucose) 1.00000 1 53.0363 <.0001
Spline(Pedigree) 1.51071 2 9.9354 0.0070
Spline(Age) 4.54171 6 23.0661 0.0008



The smoothing component panel (which is produced by the PLOTS option and is shown in Output 7.2.9) visualizes the spline transformations for the four variables in addition to 95% Bayesian curvewise confidence bands. For Glucose, the spline transformation is almost a straight line. For Pedigree, the spline transformation shows a slightly nonlinear trend. For Age, the dependency is obviously nonlinear.

Output 7.2.9: Smoothing Components Panel

Smoothing Components Panel


The following statements compute the misclassification error on the test data set from the reduced nonparametric logistic regression model that PROC GAMPL produces:

data test;
   set PimaOut(where=(Test=1));
   if ((Pred>0.5 & Diabetes=1) | (Pred<0.5 & Diabetes=0))
   then Error=0;
   else Error=1;
run;

proc freq data=test;
   tables Diabetes*Error/nocol norow;
run;

Output 7.2.10 shows the misclassification errors for observations in the test set and observations of each response category. The error is smaller than the error from the parametric logistic regression model.

Output 7.2.10: Crosstabulation Table for Test Set Prediction

Pima Indian Diabetes Study

The FREQ Procedure

Frequency
Percent
Table of Diabetes by Error
Diabetes Error
0 1 Total
0
130
64.36
17
8.42
147
72.77
1
35
17.33
20
9.90
55
27.23
Total
165
81.68
37
18.32
202
100.00