SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 52973: Plot and compare ROC curves from a fitted model used to score validation or other data

DetailsAboutRate It

The Receiver Operating Characteristic (ROC) curve is a popular way to summarize the predictive ability of a binary logistic model. You can produce a plot of the ROC curve for the fitted model (and a data set containing the ROC plot data) by specifying the OUTROC= option in the MODEL statement. This ROC curve summarizes the model as applied to the data used to fit the model (the training data). ODS Graphics must be on in order for PROC LOGISTIC to produce the graph. In the same PROC LOGISTIC step, you can obtain the ROC curve for the fitted model as applied to a separate, validation data set. To do this, add a SCORE statement that also includes the OUTROC= option. If the model, trained on one data sample, is used to score multiple independent samples, the same can be done using multiple SCORE statements.

To make visual comparison easier, you might want to produce a single graph that overlays the two ROC curves. This can be done by combining the OUTROC= data sets and then producing the overlaid plot using PROC SGPLOT.

In the discussion below, a plot is produced that allows visual comparison of the ROC curves. If a statistical comparison among such independent ROC curves is desired, tests can be done as illustrated in this note.

The method shown below can also be used to overlay separate ROC curves produced when the BY statement is used as illustrated in this note.

Overlaid plot of ROC curves fit to training and validation data

The following uses the example shown in this note. As shown in that note, these statements fit the model and produce two separate ROC graphs – one for the training data and one for the validation data.

      proc logistic data=train;
        model y(event="1") = entry / outroc=troc;
        score data=valid out=valpred outroc=vroc;
        run;

The following statements concatenate the training and validation OUTROC= data sets. The Zero data set contains the (0,0) points for the two curves since these points are omitted in the OUTROC= data sets and these are added as well. A character variable, DATA, is added to identify the blocks of observations that come from the training and the validation data.

      data Zero;
        input data $ _1mspec_ _sensit_;
        datalines;
        train 0 0
        valid 0 0
        ;
      data Plotdata; 
        set zero troc(in=train) vroc(in=valid);
        if valid then data="valid"; 
        if train then data="train";
        run;

The graph of overlaid ROC curves is produced using the statements below. In an ROC plot, since the sensitivity and 1-specificity axes both range from zero to one, an ROC plot is conventionally square. Beginning in SAS® 9.4 TS1M0, you can use the ASPECT=1 option to produce a square plot.Note The LINEPARM statement produces the diagonal line that represents a model with no predictive ability. The GROUP= option is specified in the SERIES statement to produce separate curves for the training and validation data as identified by the DATA variable produced above. 

The code below also illustrates how a few alterations can be made to the appearance of the ROC graph produced by PROC LOGISTIC. The STYLEATTRS statement with the WALLCOLOR= option allows the plot area background to be colored as desired. A light shade of gray is used. Similarly, the color and pattern of the diagonal line are specified using the LINEATTRS= option in the LINEPARM statement. The line is also made semi-transparent using the TRANSPARENCY= option.  And by omitting the GRID option in the YAXIS and XAXIS statements, the usual set of light grid lines is removed. Other aspects of the axes could be altered using options in the axis statements. If the legend identifying the curves is not desired, add the NOAUTOLEGEND option in the PROC SGPLOT statement. The INSET statement writes the AUC (area under the ROC curve) values for the training and validation data inside the plot area. Finally, a custom title is specified in the TITLE statement.

      proc sgplot data=Plotdata aspect=1;
        styleattrs wallcolor=grayEE;
        xaxis values=(0 to 1 by 0.25) offsetmin=.05 offsetmax=.05; 
        yaxis values=(0 to 1 by 0.25) offsetmin=.05 offsetmax=.05;
        lineparm x=0 y=0 slope=1 / transparency=.5 lineattrs=(color=black pattern=longdash);
        series x=_1mspec_ y=_sensit_ / group=data;
        inset ("Training AUC" = "0.7193" "Validation AUC" = "0.6350") / 
              border position=bottomright;
        title "ROC curves for training and validation data";
        run;

Overlaid ROC graph

Overlaid plot of ROC curves fit to multiple, independent samples

This same method can be used to overlay the ROC curves from multiple data sets scored by the same model. Add a SCORE statement in the PROC LOGISTIC step for each data set. Combine the data sets using a DATA step similar to the one above, adding a variable that identifies each block of ROC data. PROC SGPLOT can then be used to produce the plot of overlaid ROC curves.

In the DATA step below, a separate data set is created for each Block in the input data. Additionally, a categorized version of the ENTRY variable is created and named ECAT.

      data Block1 Block2 Block3 Block4;
        label Y = 'No. of damaged plants'
              n = 'No. of plants';
        input block entry lat lng n Y @@;
        ecat=1; 
        if 6<entry<=10 then ecat=2;
        if 11<entry<=16 then ecat=3;
        if block=1 then output Block1;
        else if block=2 then output Block2;
        else if block=3 then output Block3;
        else if block=4 then output Block4;
        datalines;
        1 14 1 1  8 2    1 16 1 2  9 1
        1  7 1 3 13 9    1  6 1 4  9 9
        1 13 2 1  9 2    1 15 2 2 14 7
        1  8 2 3  8 6    1  5 2 4 11 8
        1 11 3 1 12 7    1 12 3 2 11 8
        1  2 3 3 10 8    1  3 3 4 12 5
        1 10 4 1  9 7    1  9 4 2 15 8
        1  4 4 3 19 6    1  1 4 4  8 7
        2 15 5 1 15 6    2  3 5 2 11 9
        2 10 5 3 12 5    2  2 5 4  9 9
        2 11 6 1 20 10   2  7 6 2 10 8
        2 14 6 3 12 4    2  6 6 4 10 7
        2  5 7 1  8 8    2 13 7 2  6 0
        2 12 7 3  9 2    2 16 7 4  9 0
        2  9 8 1 14 9    2  1 8 2 13 12
        2  8 8 3 12 3    2  4 8 4 14 7
        3  7 1 5  7 7    3 13 1 6  7 0
        3  8 1 7 13 3    3 14 1 8  9 0
        3  4 2 5 15 11   3 10 2 6  9 7
        3  3 2 7 15 11   3  9 2 8 13 5
        3  6 3 5 16 9    3  1 3 6  8 8
        3 15 3 7  7 0    3 12 3 8 12 8
        3 11 4 5  8 1    3 16 4 6 15 1
        3  5 4 7 12 7    3  2 4 8 16 12
        4  9 5 5 15 8    4  4 5 6 10 6
        4 12 5 7 13 5    4  1 5 8 15 9
        4 15 6 5 17 6    4  6 6 6  8 2
        4 14 6 7 12 5    4  7 6 8 15 8
        4 13 7 5 13 2    4  8 7 6 13 9
        4  3 7 7  9 9    4 10 7 8  6 6
        4  2 8 5 12 8    4 11 8 6  9 7
        4  5 8 7 11 10   4 16 8 8 15 7
      ;

Using the ECAT categorical predictor, the PROC LOGISTIC statements below fit a logistic model to only the data in Block 1. This model is then used to score the data in in each of the four Blocks and the ROC curve is produced for each. Since the data set of predicted values is not needed, OUT=_NULL_ is specified in each SCORE statement to suppress creation of the OUT= data set, and the ROC data is saved using the OUTROC= option. The FITSTAT option is included in each SCORE statement to produce a table containing the areas (AUCs) under the four ROC curves. This table is saved in data set AUC by the ODS OUTPUT statement.

The three DATA steps which follow create (0,0) points for the curves (these are not included in the OUTROC= data sets) and merge together the ROC data sets from all of the Blocks into a single data set for plotting. The SQL step extracts the AUCs from the AUC data set and stores them in macro variables.

Finally, the SGPLOT step produces the plot of overlaid ROC curves and also displays the associated AUC values.

      proc logistic data=Block1;
        class ecat / param=ref;
        model y/n = ecat;
        score data=Block1 out=_null_ outroc=Block1ROC fitstat;
        score data=Block2 out=_null_ outroc=Block2ROC fitstat;
        score data=Block3 out=_null_ outroc=Block3ROC fitstat;
        score data=Block4 out=_null_ outroc=Block4ROC fitstat;
        ods output scorefitstat=AUC;
        run;
      data Zero;
        do Block=1 to 4;
          _1mspec_=0; _sensit_=0;
          output;
        end;
        run;
      data Plotdata; 
        set Block1ROC(in=b1) Block2ROC(in=b2) Block3ROC(in=b3) Block4ROC(in=b4);
        if b1 then Block=1; 
        if b2 then Block=2;
        if b3 then Block=3;
        if b4 then Block=4;
        run;
      data PlotData;
        set zero PlotData;
        by Block;
        run;
      proc sql noprint;
        select distinct(AUC)
        into :auc1 - :auc4
        from AUC
        order by dataset;
        quit;
      proc sgplot data=PlotData aspect=1;
        xaxis values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05; 
        yaxis values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05;
        lineparm x=0 y=0 slope=1 / transparency=.5 lineattrs=(color=black);
        series x=_1mspec_ y=_sensit_ / group=Block;
        inset ("AUC block 1" = "&auc1"
               "AUC block 2" = "&auc2"
               "AUC block 3" = "&auc3"
               "AUC block 4" = "&auc4") / opaque position=bottomright;
        title "ROC Curves scored from Block 1 model";
        run;
Overlaid ROC graph

______________

Note: In earlier releases, you can use equal values in the HEIGHT= and WIDTH= options of the ODS GRAPHICS statement. Specify this statement prior to the PROC SGPLOT statements. For example, the following will produce a square plot that has 480 pixels on each side.

      ods graphics / height=480px width=480px;


Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
Z64
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 8 Enterprise 32-bit
Microsoft Windows 8 Enterprise x64
Microsoft Windows 8 Pro 32-bit
Microsoft Windows 8 Pro x64
Microsoft Windows 8.1 Enterprise 32-bit
Microsoft Windows 8.1 Enterprise x64
Microsoft Windows 8.1 Pro
Microsoft Windows 8.1 Pro 32-bit
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 R2
Microsoft Windows Server 2008 for x64
Microsoft Windows Server 2012 Datacenter
Microsoft Windows Server 2012 R2 Datacenter
Microsoft Windows Server 2012 R2 Std
Microsoft Windows Server 2012 Std
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.