The HPLOGISTIC Procedure

Example 54.4 Partitioning Data

This example uses the Pima Indian Diabetes data set, which can be obtained from the UCI Machine Learning Repository (Asuncion and Newman 2007). It is extracted from a larger database that was originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases. Data are for female patients who are at least 21 years old, are of Pima Indian heritage, and live near Phoenix, Arizona. The objective of this study is to investigate the relationship between a diabetes diagnosis and variables that represent physiological measurements and medical attributes. Some missing or invalid observations are removed from the analysis. The reduced data set contains 532 records. The following DATA step creates the data set Pima:

title 'Pima Indian Diabetes Study';
data Pima;
   input NPreg Glucose Pressure Triceps BMI Pedigree Age Diabetes Role@@;
   datalines;
 6  148   72  35  33.6  0.627  50  1  0   1   85   66  29  26.6  0.351  31  0  1    
 1   89   66  23  28.1  0.167  21  0  0   3   78   50  32    31  0.248  26  1  0    
 2  197   70  45  30.5  0.158  53  1  0   5  166   72  19  25.8  0.587  51  1  1    
 0  118   84  47  45.8  0.551  31  1  0   1  103   30  38  43.3  0.183  33  0  0    
 3  126   88  41  39.3  0.704  27  0  0   9  119   80  35    29  0.263  29  1  0    

   ... more lines ...   

 1  128   48  45  40.5  0.613  24  1  1   2  112   68  22  34.1  0.315  26  0  1    
 1  140   74  26  24.1  0.828  23  0  1   2  141   58  34  25.4  0.699  24  0  0    
 7  129   68  49  38.5  0.439  43  1  1   0  106   70  37  39.4  0.605  22  0  0    
 1  118   58  36  33.3  0.261  23  0  1   8  155   62  26    34  0.543  46  1  0    
;

The data set contains nine variables, including the binary response variable Diabetes. Table 54.11 describes the variables.

Table 54.11: Variables in the Pima Data Set

Variable

Description

NPreg

Number of pregnancies

Glucose

Two-hour plasma glucose concentration in an oral glucose tolerance test

Pressure

Diastolic blood pressure (mm Hg)

Triceps

Triceps skin fold thickness (mm)

BMI

Body mass index (weight in kg/(height in m)$^2$)

Pedigree

Diabetes pedigree function

Age

Age (years)

Diabetes

0 if test negative for diabetes, 1 if test positive

Role

0 for training role, 1 for test


In the following program, the PARTITION statement divides the data into two parts. The training data have a Role value of 0 and hold about 59% of the data; the rest of the data are used to evaluate the fit. A stepwise selection method selects the best model based on the training observations.

proc hplogistic data=Pima;
   model Diabetes(event='1') = NPreg Glucose Pressure Triceps BMI Pedigree Age;
   partition role=Role(train='0' test='1');
   selection method=stepwise;
run;

Selected results from the analysis are shown in Output 54.4.1 through Output 54.4.3.

The "Number of Observations" and "Response Profile" tables in Output 54.4.1 are divided into training and testing columns.

Output 54.4.1: Partitioned Counts

Pima Indian Diabetes Study

The HPLOGISTIC Procedure

Number of Observations
Description Total Training Testing
Number of Observations Read 532 316 216
Number of Observations Used 532 316 216

Response Profile
Ordered
Value
Diabetes Total
Frequency
Training Testing
1 0 355 204 151
2 1 177 112 65

You are modeling the probability that Diabetes='1'.




The standard likelihood-based fit statistics for the selected model are displayed in the "Fit Statistics" table, with a column for each of the training and testing subsets.

Output 54.4.2: Partitioned Fit Statistics

Fit Statistics
Description Training Testing
-2 Log Likelihood 297.07 182.60
AIC (smaller is better) 307.07 192.60
AICC (smaller is better) 307.26 192.88
BIC (smaller is better) 325.84 209.47



More fit statistics are displayed in the "Partition Fit Statistics" table shown in Output 54.4.3. These statistics are computed for both the training and testing data and should be very similar between the two groups when the training data are representative of the testing data. The statistics include the likelihood-based R-square statistics, as well as several prediction-based statistics that are described in the sections Model Fit and Assessment Statistics and The Hosmer-Lemeshow Goodness-of-Fit Test. For this model, the values of the statistics seem similar between the two disjoint subsets.

Output 54.4.3: More Partitioned Fit Statistics

Partition Fit Statistics
Statistic Training Testing
Area under the ROCC 0.8397 0.8734
Average Square Error 0.1536 0.1327
Hosmer-Lemeshow Test 0.5868 0.4382
Misclassification Error 0.2310 0.1898
R-Square 0.3025 0.3147
Max-rescaled R-Square 0.4157 0.4459
McFadden's R-Square 0.2770 0.3089
Mean Difference 0.3302 0.3782
Somers' D 0.6794 0.7467
True Negative Fraction 0.8725 0.8675
True Positive Fraction 0.5804 0.6769



If you want to display the "Partition Fit Statistics" table without partitioning your data set, you must identify all your data as training data. One way to do this is to define the fractions for the other roles to be zero:

proc hplogistic data=Pima;
   model Diabetes(event='1') = NPreg Glucose Pressure Triceps BMI Pedigree Age;
   partition fraction(test=0 validation=0);
run;

Another way is to specify a constant variable as the training role:

data Pima;
   set Pima;
   Role=0;
run;
proc hplogistic data=Pima;
   model Diabetes(event='1') = NPreg Glucose Pressure Triceps BMI Pedigree Age;
   partition role=Role(train='0');
run;

The resulting "Partition Fit Statistics" table is shown in Output 54.4.4.

Output 54.4.4: All Data Are Training Data

Pima Indian Diabetes Study

The HPLOGISTIC Procedure

Partition Fit Statistics
Statistic Training
Area under the ROCC 0.8598
Average Square Error 0.1415
Hosmer-Lemeshow Test 0.6473
Misclassification Error 0.2124
R-Square 0.3267
Max-rescaled R-Square 0.4539
McFadden's R-Square 0.3110
Mean Difference 0.3643
Somers' D 0.7196
True Negative Fraction 0.8930
True Positive Fraction 0.5763