This example uses the Pima Indian Diabetes data set, which can be obtained from the UCI Machine Learning Repository (Asuncion
and Newman 2007). It is extracted from a larger database that was originally owned by the National Institute of Diabetes and Digestive and
Kidney Diseases. Data are for female patients who are at least 21 years old, are of Pima Indian heritage, and live near Phoenix,
Arizona. The objective of this study is to investigate the relationship between a diabetes diagnosis and variables that represent
physiological measurements and medical attributes. Some missing or invalid observations are removed from the analysis. The
reduced data set contains 532 records. The following DATA step creates the data set Pima
:
title 'Pima Indian Diabetes Study'; data Pima; input NPreg Glucose Pressure Triceps BMI Pedigree Age Diabetes Role@@; datalines; 6 148 72 35 33.6 0.627 50 1 0 1 85 66 29 26.6 0.351 31 0 1 1 89 66 23 28.1 0.167 21 0 0 3 78 50 32 31 0.248 26 1 0 2 197 70 45 30.5 0.158 53 1 0 5 166 72 19 25.8 0.587 51 1 1 0 118 84 47 45.8 0.551 31 1 0 1 103 30 38 43.3 0.183 33 0 0 3 126 88 41 39.3 0.704 27 0 0 9 119 80 35 29 0.263 29 1 0 ... more lines ... 1 128 48 45 40.5 0.613 24 1 1 2 112 68 22 34.1 0.315 26 0 1 1 140 74 26 24.1 0.828 23 0 1 2 141 58 34 25.4 0.699 24 0 0 7 129 68 49 38.5 0.439 43 1 1 0 106 70 37 39.4 0.605 22 0 0 1 118 58 36 33.3 0.261 23 0 1 8 155 62 26 34 0.543 46 1 0 ;
The data set contains nine variables, including the binary response variable Diabetes
. Table 10.11 describes the variables.
Table 10.11: Variables in the Pima
Data Set
Variable |
Description |
---|---|
|
Number of pregnancies |
|
Two-hour plasma glucose concentration in an oral glucose tolerance test |
|
Diastolic blood pressure (mm Hg) |
|
Triceps skin fold thickness (mm) |
|
Body mass index (weight in kg/(height in m)) |
|
Diabetes pedigree function |
|
Age (years) |
|
0 if test negative for diabetes, 1 if test positive |
|
0 for training role, 1 for test |
In the following program, the PARTITION statement divides the data into two parts. The training data have a Role
value of 0 and hold about 59% of the data; the rest of the data are used to evaluate the fit. A stepwise selection method
selects the best model based on the training observations.
proc hplogistic data=Pima; model Diabetes(event='1') = NPreg Glucose Pressure Triceps BMI Pedigree Age; partition role=Role(train='0' test='1'); selection method=stepwise; run;
Selected results from the analysis are shown in Output 10.4.1 through Output 10.4.3.
The "Number of Observations" and "Response Profile" tables in Output 10.4.1 are divided into training and testing columns.
Output 10.4.1: Partitioned Counts
The standard likelihood-based fit statistics for the selected model are displayed in the "Fit Statistics" table, with a column for each of the training and testing subsets.
Output 10.4.2: Partitioned Fit Statistics
More fit statistics are displayed in the "Partition Fit Statistics" table shown in Output 10.4.3. These statistics are computed for both the training and testing data and should be very similar between the two groups when the training data are representative of the testing data. The statistics include the likelihood-based R-square statistics, as well as several prediction-based statistics that are described in the sections Model Fit and Assessment Statistics and The Hosmer-Lemeshow Goodness-of-Fit Test. For this model, the values of the statistics seem similar between the two disjoint subsets.
Output 10.4.3: More Partitioned Fit Statistics
Partition Fit Statistics | ||
---|---|---|
Statistic | Training | Testing |
Area under the ROCC | 0.8397 | 0.8734 |
Average Square Error | 0.1536 | 0.1327 |
Hosmer-Lemeshow Test | 0.5868 | 0.4382 |
Misclassification Error | 0.2310 | 0.1898 |
R-Square | 0.3025 | 0.3147 |
Max-rescaled R-Square | 0.4157 | 0.4459 |
McFadden's R-Square | 0.2770 | 0.3089 |
Mean Difference | 0.3302 | 0.3782 |
Somers' D | 0.6794 | 0.7467 |
True Negative Fraction | 0.8725 | 0.8675 |
True Positive Fraction | 0.5804 | 0.6769 |
If you want to display the "Partition Fit Statistics" table without partitioning your data set, you must identify all your data as training data. One way to do this is to define the fractions for the other roles to be zero:
proc hplogistic data=Pima; model Diabetes(event='1') = NPreg Glucose Pressure Triceps BMI Pedigree Age; partition fraction(test=0 validation=0); run;
Another way is to specify a constant variable as the training role:
data Pima; set Pima; Role=0; run; proc hplogistic data=Pima; model Diabetes(event='1') = NPreg Glucose Pressure Triceps BMI Pedigree Age; partition role=Role(train='0'); run;
The resulting "Partition Fit Statistics" table is shown in Output 10.4.4.
Output 10.4.4: All Data Are Training Data
Pima Indian Diabetes Study |
Partition Fit Statistics | |
---|---|
Statistic | Training |
Area under the ROCC | 0.8598 |
Average Square Error | 0.1415 |
Hosmer-Lemeshow Test | 0.6473 |
Misclassification Error | 0.2124 |
R-Square | 0.3267 |
Max-rescaled R-Square | 0.4539 |
McFadden's R-Square | 0.3110 |
Mean Difference | 0.3643 |
Somers' D | 0.7196 |
True Negative Fraction | 0.8930 |
True Positive Fraction | 0.5763 |