The assessment of a model can be optimistically biased if the data used to fit the model are also used in the assessment of the model. Two ways of dealing with this are discussed and illustrated below. The first is to split the available data into training and validation data sets. The model is fit (trained) using the training data set and then assessed by applying the model to the validation data set. However, when the amount of data available is small, this can result in an unacceptably small training data set. Another option is crossvalidation which provides an unbiased assessment of the model without reducing the training data set.
The ROC (Receiver Operating Characteristic) curve and the area under the ROC curve (AUC) are commonly used to assess binary response models such as logistic models. This example illustrates the use of a validation data set and crossvalidation to produce an ROC curve and estimate its area.
See this note on comparing the areas under competing binary response models.
The DATA step below creates the data sets used in this example. Data were gathered from four blocks. Data set TRAIN contains the data from the first three blocks and will be used as the training data set. Data set VALID contains the fourth block and will be used as the validation data set. Data set ALLDATA contains the complete set of data and will be used to illustrate crossvalidation.
data alldata train valid;
input block entry lat lng n r @@;
do i=1 to r;
y=1;
output alldata;
if block<4 then output train;
else output valid;
end;
do i=1 to n-r;
y=0;
output alldata;
if block<4 then output train;
else output valid;
end;
datalines;
1 14 1 1 8 2 1 16 1 2 9 1
1 7 1 3 13 9 1 6 1 4 9 9
1 13 2 1 9 2 1 15 2 2 14 7
1 8 2 3 8 6 1 5 2 4 11 8
1 11 3 1 12 7 1 12 3 2 11 8
1 2 3 3 10 8 1 3 3 4 12 5
1 10 4 1 9 7 1 9 4 2 15 8
1 4 4 3 19 6 1 1 4 4 8 7
2 15 5 1 15 6 2 3 5 2 11 9
2 10 5 3 12 5 2 2 5 4 9 9
2 11 6 1 20 10 2 7 6 2 10 8
2 14 6 3 12 4 2 6 6 4 10 7
2 5 7 1 8 8 2 13 7 2 6 0
2 12 7 3 9 2 2 16 7 4 9 0
2 9 8 1 14 9 2 1 8 2 13 12
2 8 8 3 12 3 2 4 8 4 14 7
3 7 1 5 7 7 3 13 1 6 7 0
3 8 1 7 13 3 3 14 1 8 9 0
3 4 2 5 15 11 3 10 2 6 9 7
3 3 2 7 15 11 3 9 2 8 13 5
3 6 3 5 16 9 3 1 3 6 8 8
3 15 3 7 7 0 3 12 3 8 12 8
3 11 4 5 8 1 3 16 4 6 15 1
3 5 4 7 12 7 3 2 4 8 16 12
4 9 5 5 15 8 4 4 5 6 10 6
4 12 5 7 13 5 4 1 5 8 15 9
4 15 6 5 17 6 4 6 6 6 8 2
4 14 6 7 12 5 4 7 6 8 15 8
4 13 7 5 13 2 4 8 7 6 13 9
4 3 7 7 9 9 4 10 7 8 6 6
4 2 8 5 12 8 4 11 8 6 9 7
4 5 8 7 11 10 4 16 8 8 15 7
;
ROC analysis using separate training and validation data sets
Begin by fitting the model to the training data set, TRAIN. Include a SCORE statement to apply the fitted model to the validation data set (VALID) and create a data set of predicted event probabilities (VALPRED). The OUTROC= options in the MODEL and SCORE statements provide plots of the ROC curves for the training and validation data sets (when ODS graphics is on). The point estimates of the areas under these two curves are given in the plots. If desired, adding the ROC statement produces a confidence interval for the area under the ROC curve for the training data. If the ROCCONTRAST statement is also added, then a test is provided of the hypothesis that the AUC of the model applied to the training data equals 0.5, which is the AUC of an uninformative model.
ods graphics on;
proc logistic data=train;
model y(event="1") = entry / outroc=troc;
score data=valid out=valpred outroc=vroc;
roc; roccontrast;
run;
The AUC for the fitted model applied to the training data set is 0.7193. When applied to the validation data set, the AUC is 0.6350. The significant ROC contrast test (p<0.0001) indicates that the fitted model is better than the uninformative model when applied to the training data.
| 0.7193 |
0.0217 |
0.6768 |
0.7618 |
0.4386 |
0.4626 |
0.2188 |
| 0.5000 |
0 |
0.5000 |
0.5000 |
0 |
. |
0 |
|
To obtain a confidence interval and test of the AUC for the validation data, a second PROC LOGISTIC step is needed. In this step, use the predicted event probabilities from scoring the validation data (in data set VALPRED) as the variable in the PRED= option in the ROC statement. Since Y=1 is specified as the event level in this example, the variable containing the predicted event probabilities is named P_1. By specifying no predictors in the MODEL statement, a model containing only an intercept is fitted. The area under this model is 0.5 which is the area of an uninformative model. Including the ROCCONTRAST statement compares the fitted and uninformative models applied to the validation data.
proc logistic data=valpred;
model y(event="1")=;
roc pred=p_1;
roccontrast;
run;
The AUC from applying the fitted model to the validation data set is again shown to be 0.6350. The significant ROC contrast test (p=0.0009) indicates that the fitted model is better than the uninformative model when applied to the validation data.
| 0.5000 |
0 |
0.5000 |
0.5000 |
0 |
. |
0 |
| 0.6350 |
0.0406 |
0.5554 |
0.7147 |
0.2700 |
0.2860 |
0.1341 |
|
Beginning in SAS 9.3, the point estimate of the AUC for the validation data can be obtained by specifying the FITSTAT option in the SCORE statement. The following statements produce point estimates for both the training data set (as the c statistic) and the validation data set.
proc logistic data=train;
model y(event="1") = entry;
score data=valid out=valpred fitstat;
run;
| 69.3 |
0.439 |
| 25.5 |
0.463 |
| 5.2 |
0.219 |
| 73406 |
0.719 |
| 193 |
-131.1 |
0.3990 |
266.2176 |
266.2807 |
272.743 |
272.743 |
0.015667 |
0.020973 |
0.635025 |
0.241074 |
|
ROC analysis using crossvalidation
Assessment via crossvalidation is done by fitting the model to the complete data set and using the crossvalidated predicted probabilities to provide an ROC analysis. The crossvalidated predicted probability for an observation simulates the process of fitting the model ignoring the observation and then using the model fit to the remaining observations to compute the predicted probability for the ignored observation. In the first LOGISTIC step below, the model is fit to the complete data (data set ALLDATA). The PREDPROBS=CROSSVALIDATE option in the OUTPUT statement creates a data set containing the crossvalidated predicted probabilities. The second LOGISTIC step refits the model (labeled Model) and produces its ROC curve and area estimate. Since Y=1 is specified as the event level in this example, the variable containing the crossvalidated predicted event probabilities is named XP_1. The PRED=XP_1 option in the ROC statement produces a second ROC curve (labeled ROC1) and area estimate based on the crossvalidated probabilities. The ROCCONTRAST statement tests the equality of the AUCs of the fitted model with and without crossvalidation.
proc logistic data=alldata;
model y(event="1") = entry;
output out=preds predprobs=crossvalidate;
run;
proc logistic data=preds;
model y(event="1") = entry;
roc pred=xp_1;
roccontrast;
run;
Note that the AUC drops significantly (p<0.0001) from 0.697 to 0.670 when crossvalidation is used.
| 0.6970 |
0.0193 |
0.6592 |
0.7348 |
0.3939 |
0.4163 |
0.1961 |
| 0.6700 |
0.0200 |
0.6308 |
0.7093 |
0.3400 |
0.3400 |
0.1693 |
|
Operating System and Release Information
| SAS System | SAS/STAT | z/OS | | |
| OpenVMS VAX | | |
| Microsoft® Windows® for 64-Bit Itanium-based Systems | | |
| Microsoft Windows Server 2003 Datacenter 64-bit Edition | | |
| Microsoft Windows Server 2003 Enterprise 64-bit Edition | | |
| Microsoft Windows XP 64-bit Edition | | |
| Microsoft® Windows® for x64 | | |
| OS/2 | | |
| Microsoft Windows 95/98 | | |
| Microsoft Windows 2000 Advanced Server | | |
| Microsoft Windows 2000 Datacenter Server | | |
| Microsoft Windows 2000 Server | | |
| Microsoft Windows 2000 Professional | | |
| Microsoft Windows NT Workstation | | |
| Microsoft Windows Server 2003 Datacenter Edition | | |
| Microsoft Windows Server 2003 Enterprise Edition | | |
| Microsoft Windows Server 2003 Standard Edition | | |
| Microsoft Windows Server 2003 for x64 | | |
| Microsoft Windows Server 2008 | | |
| Microsoft Windows Server 2008 for x64 | | |
| Microsoft Windows XP Professional | | |
| Windows 7 Enterprise 32 bit | | |
| Windows 7 Enterprise x64 | | |
| Windows 7 Home Premium 32 bit | | |
| Windows 7 Home Premium x64 | | |
| Windows 7 Professional 32 bit | | |
| Windows 7 Professional x64 | | |
| Windows 7 Ultimate 32 bit | | |
| Windows 7 Ultimate x64 | | |
| Windows Millennium Edition (Me) | | |
| Windows Vista | | |
| Windows Vista for x64 | | |
| 64-bit Enabled AIX | | |
| 64-bit Enabled HP-UX | | |
| 64-bit Enabled Solaris | | |
| ABI+ for Intel Architecture | | |
| AIX | | |
| HP-UX | | |
| HP-UX IPF | | |
| IRIX | | |
| Linux | | |
| Linux for x64 | | |
| Linux on Itanium | | |
| OpenVMS Alpha | | |
| OpenVMS on HP Integrity | | |
| Solaris | | |
| Solaris for x64 | | |
| Tru64 UNIX | | |
*
For software releases that are not yet generally available, the Fixed
Release is the software release in which the problem is planned to be
fixed.