25018 - Plot ROC curve with cutpoint labeling and optimal cutpoint analysis

SUPPORT / SAMPLES & SAS NOTES

Support

Sample 25018: Plot ROC curve with cutpoint labeling and optimal cutpoint analysis

Plot ROC curve with cutpoint labeling and optimal cutpoint analysis

Contents:

Purpose / History / Requirements / Usage / Details / Limitations / References

Beginning in SAS^® Viya^® 2022.10, ROCOPTIONS in the PROC LOGISTIC statement includes options to thin labels and find and display optimal cutpoints.

PURPOSE:

Produce a plot of the Receiver Operating Characteristic (ROC) curve associated with a fitted binary-response model. Label points on the ROC curve using statistic or input variable values. Identify optimal cutpoints on the ROC curve using several optimality criteria such as correct classification, efficiency, cost, and others. Plot optimality criteria against a selected variable.

When only an ROC plot with labeled points is needed, you can often produce the desired plot in PROC LOGISTIC without this macro. Using ODS graphics, PROC LOGISTIC can plot the ROC curve of a model whether applied to the data used to fit the model or to additional data scored using the fitted model. Specify the PLOTS=ROC option in the PROC LOGISTIC statement, or specify the OUTROC= option in the MODEL and/or SCORE statements. PROC LOGISTIC can label the points in the ROC curve with their observation numbers, predicted probabilities, values from one or more variables, or with values from a single statistic (such as the misclassification rate). Use the PLOTS=ROC(ID=keyword | ID) option and ID statement. For details, see the LOGISTIC documentation (SAS Note 22930).

For plotting the ROC curve with labeled points, the ROCPLOT macro provides these capabilities:

The macro enables you to label points with any number of statistics in addition to variable values.
Any given cutpoint can correspond to many observations in the data. When using variable levels to label points, PROC LOGISTIC uses the last value of each variable found within a set of observations having the same cutpoint value. For numeric variables, the ROCPLOT macro enables you to use any statistic available in PROC MEANS to label the points. For character variables, you can use either the first or last value in the set.
When PROC LOGISTIC is used to label the points in an ROC plot and there is a large number of cutpoints (that is, when the INROC= data set has a large number of observations), the point labels can overlap becoming unreadable. In extreme cases, the density of labels results in a solid blob. The ROCPLOT macro enables you to thin the labels using the THINSENS=, THINY=, and the combination of the THINVAR= and MINDIST= options.
The ROCPLOT macro provides more options to control the appearance of the plot (marker symbol, and various aspects of the plot line and label text).

HISTORY:

The version of the ROCPLOT macro that you are using is displayed when you specify version (or any string) as the first argument. For example:

    %rocplot(v)

The ROCPLOT macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active internet connection available), the macro will issue the following message:

    NOTE: Unable to check for newer version

The computations performed by the macro are not affected by the appearance of this message.

Version	Update Notes
1.3	Added ability to display tables of optimal cutpoints using several criteria, to identify the optimal cutpoints on the ROC plot, and to plot the criteria against a variable. Improved default label thinning by requiring a proportion of the maximum Youden criterion to be exceeded (THINY= option). Replaced LINECOLOR=, LINEPATTERN=, and LINETHICKNESS= options with LINESTYLE= option. Replaced MARKERSYMBOL= option with MARKER= option and default marker is now CIRCLEFILLED. Added MARKERSTYLE= option to control appearance of markers. Markers no longer require SAS 9.4 TS1M1 or later.
1.2	Now uses PROC SGPLOT rather than PROC GPLOT to draw the graph. Additional computed statistics for use in labels. Added THINSENS= and THINVAR= options for controlling number of points labeled. Redefined how MINDIST= option operates. Added IDSTAT= option to compute statistics for numeric label variables within observations associated with the same cutpoint. Added OUTROCDATA= option to name the final data used for graphing. Added FORMAT= option specifying the format for numeric values in labels. Added CHARLEN= option specifying the maximum length for character values in labels. Replaced ROFFSET= option with OFFSETMIN= and OFFSETMAX= options to specify additional space at either end of horizontal axis. Renamed OUT= option to INPRED=, OUTROC= option to INROC=, PLOTCHAR= option to MARKERSYMBOL=. Replaced FONT=, SIZE=, and COLOR= options with LABELSTYLE=, LINECOLOR=, LINEPATTERN=, LINETHICKNESS=, and MARKERSYMBOL= options to control appearance of graph elements. Options from previous version are converted as best as possible to new syntax. Now displays markers only at labeled points by default. Added MARKERS= option to turn point markers on or off at labeled points (available in SAS 9.4 TS1M1 or later). Added SPLIT= option to specify a character to separate label values. Added ALTAXISLABEL= option to use true and false positive rates in axis labels. It is no longer necessary to use the ROCEPS= option in PROC LOGISTIC. Grid lines are displayed by default.
1.1	Added automatic check for new version. Added GRID= option to display a grid in the plot. Added version option to display current macro version. Added special variables for sensitivity, specificity, and observation number for use as ID= variables. Added some white space at both ends of each axis (controlled by ROFFSET=). Fixed observations with missing probabilities causing "no match" message. Added (0,0) point to plot to complete the ROC curve. For low-resolution plots, added second plot without labels to make it easier to see the shape of the curve. Fixed multiple observations with same cutpoint values from causing "no match" message. The message should occur only if cutpoints do not match in the OUT= and OUTROC= data sets, presumably because the ROCEPS= was not used as suggested in the message. Removed the position option. Macro controls positioning to avoid overlaps. Labels are now angled to improve clarity. Labels are vertical at the right edge and almost horizontal at the left edge. To reduce label clutter, only top and bottom points in a vertical stack are labeled. Label of top point is above-left. Label of bottom point is below-right. If two points have specificity values that differ by less than the MINDIST= value, one of the points is not labeled. Points more than the MINDIST= value apart are labeled.
1.0	Initial coding.

REQUIREMENTS:

SAS/STAT^® software is needed to create the required input data sets from a modeling procedure. The ROCPLOT macro requires only Base SAS^® software.

USAGE:

Before using the ROCPLOT macro, you must first fit a binary response model and then save the predicted probabilities of the training or scored data. Various procedures such as LOGISTIC, GAM, PROBIT, GENMOD, GLIMMIX can fit binary response models. An OUTROC= data set must also be produced using PROC LOGISTIC.

When using PROC LOGISTIC to fit the model, specify the OUTROC= option in the MODEL statement and the OUT= and P= options in the OUTPUT statement before using the ROCPLOT macro to plot the ROC curve for the training data. Before using the ROCPLOT macro to plot the ROC curve for a scored (or validation) data set, specify the OUT= and OUTROC= options in the SCORE statement.

When using another procedure to fit the model, save the predicted probabilities in a data set using the modeling procedure. Then in PROC LOGISTIC, specify the variable containing the predicted probabilities in the PRED= option of an ROC statement. Also specify the OUTROC= option in the MODEL statement with a WHERE clause. See the Examples tab for an example of plotting the ROC curve for a model fit using PROC GAM.

Follow the instructions in the Downloads tab of this sample to save the ROCPLOT macro definition. Replace the text within quotation marks in the following statement with the location of the ROCPLOT macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the ROCPLOT macro and make it available for use:

   %inc "<location of your file containing the ROCPLOT macro>";

Following this statement, you can call the ROCPLOT macro. See the Results tab for examples.

The following parameters are required when using the ROCPLOT macro:

inpred=data-set-name

Name of the data set containing a variable of predicted probabilities from a fitted binary model.

p=variable

Name of the variable in the INPRED= data set that contains the predicted probabilities.

inroc=data-set-name

Name of the OUTROC= data set from PROC LOGISTIC.

id=list

List of variables and/or statistics to label the cutpoints. At least one variable or statistic must be specified. Separate items with spaces. Note that the cutpoint labels can become quite long if you specify many items. Use the FORMAT= and/or CHARLEN= options to restrict the length of the values within the label. You can specify any of the variables in the INPRED= data set, but such variables should not begin or end with an underscore character (_). You can also specify any of the following statistics, which are displayed using the format specified in the FORMAT= option. The appearance of the label can be controlled with the LABELSTYLE= option.

_OBS_, the observation number
_CUTPT_, the cutpoint (applied to the predicted probabilities to produce predicted events and nonevents)
_SENS_, the sensitivity, or true positive rate (the proportion of observed event responses that were predicted as events)
_SPEC_, the specificity (the proportion of observed nonevent responses that were predicted as nonevents)
_CSPEC_, 1 - specificity, or false positive rate (the proportion of observed nonevent responses that were predicted as events)
_FPOS_, the proportion of false positives (the proportion of predicted event responses that were observed as nonevents)
_FNEG_, the proportion of false negatives (the proportion of predicted nonevent responses that were observed as events)
_PPRED_, the positive predictive value (1-_FPOS_, the proportion of predicted event responses that were observed as events)
_NPRED_, the negative predictive value (1-_FNEG_, the proportion of predicted nonevent responses that were observed as nonevents)
_MISCLASS_, the proportion of misclassified observations (1-_CORRECT_)
_CORRECT_, the proportion of correctly classified observations
_DIST01_, the distance from the cutpoint to the sensitivity=0, 1-specificity=1 point (upper-left corner of the ROC plot).
_YOUDEN_, the Youden index (the vertical distance from the uninformative diagonal to the cutpoint).
_SESPDIFF_, the absolute difference between sensitivity and specificity.

You can also specify any of the following which adds a character symbol in the label of the cutpoint that optimizes each criterion.

_OPTCORR_, adds the symbol C to the label when the cutpoint has the maximum correct classification rate.
_OPTDIST_, adds the symbol D to the label when the cutpoint has the minimum distance to the sensitivity=0, 1-specificity=0 point (upper-left corner of the ROC plot).
_OPTY_, adds the symbol Y to the label when the cutpoint has the maximum Youden index (the vertical distance from the uninformative diagonal to the cutpoint).
_OPTSESP_, adds the symbol = to the label when the cutpoint has the minimum absolute difference between sensitivity and specificity.
_OPTCOST_, adds the symbol $ to the label when the cutpoint has the minimum cost.
_OPTMCT_, adds the symbol M to the label when the cutpoint has the minimum misclassification-cost term.
_OPTEFF_, adds the symbol E to the label when the cutpoint has the maximum efficiency.
_OPTALL_, adds all of the above symbols to the label.

You can also specify the statistics in the OUTROC= data set from PROC LOGISTIC. The FORMAT= option does not apply to these statistics.

_PROB_, the cutpoint (same as _CUTPT_, but unformatted)
_SENSIT_, the sensitivity (same as _SENS_, but unformatted)
_1MSPEC_, the specificity (same as _SPEC_, but unformatted)
_POS_, the number of observed event responses predicted as events
_NEG_, the number of observed nonevent responses predicted as nonevents
_FALPOS_, the number of observed nonevent responses predicted as events
_FALNEG_, the number of observed event responses predicted as nonevents

The following are optional parameters except as noted:

Parameters affecting cutpoint labeling

idstat=keyword-list: Specifies a list of statistics associated with each variable in the ID= list. The computed statistic value is used in cutpoint labels. You can specify any statistic keyword available in PROC MEANS such as mean, median, mode, range, and so on. Each statistic in the list applies to the item in the same position in the ID= option list. For ID= option items that are statistics (such as _SENS_) or optimality symbols (such as _OPTCORR_), specify a missing value (.). If you specify _OPTALL_ in the ID= option, specify seven missing values. If the IDSTAT= option is omitted, the median is computed for numeric variables and the first value is used for character variables.
format=format: Specifies the format for all numeric values in the labels of cutpoints. This format is not applied to statistics in the OUTROC= data set from PROC LOGISTIC (_PROB_, _SENSIT_, _1MSPEC_, _POS_, _NEG_, _FALPOS_, and _FALNEG_). The default is FORMAT=BEST5. .
charlen=value: Specifies the maximum length for all character values in the labels of cutpoints. Value must be an integer greater than or equal to 1. The default is CHARLEN=5.
split=character: Specifies the character used to separate the values in the labels of cutpoints. For example, SPLIT=/ uses the slash character to separate the label values. Do not use quotation marks (' or "). For characters that could be interpreted as part of the macro call syntax (comma, parenthesis), specify the character in %str(). For example: SPLIT=%STR(,) uses the comma as a separator. The default character is a blank space.
thinsens=value: As the data are read in sensitivity order, cutpoints are labeled in the ROC plot when the sensitivity value changes by more than value. Note that labeling also requires a cutpoint to have a Youden index exceeding the proportion specified in the THINY= option. Value must be between 0 and 1. The default is THINSENS=0.05. Specifying THINSENS=0 does no thinning (reduction) of labels by the sensitivity change criterion. A note is printed in the SAS Log indicating the number of points labeled by this method.
thiny=value: Cutpoints are labeled in the ROC plot if their Youden indexes exceed (value×100)% of the maximum observed Youden criterion for all cutpoints. The Youden index is the vertical distance from the uninformative diagonal to the cutpoint. Cutpoints below the diagonal have negative Youden indexes. Value must be between -1 and 1. The default is THINY=0.5 (50% of the maximum). Specifying THINY=-1 does no thinning (reduction) of labels by the Youden criterion. A note is printed in the SAS Log indicating the number of points labeled by this method.
thinvar=variable: Specifies a numeric variable to be used in determining whether additional cutpoints should be labeled in the ROC plot. The THINVAR= variable can be any numeric variable in the INPRED= data set. MINDIST= is required when THINVAR= is specified. As the data are read in the order of the THINVAR= variable, cutpoints are labeled when the value of the THINVAR= variable changes by more than the MINDIST= value. If not specified, cutpoints are labeled only as determined by the THINSENS= and THINY= options. A note is printed in the SAS Log indicating the number of cutpoints labeled by this method.
mindist=value: Required if THINVAR= is specified. As the data are read in the order of the THINVAR= variable, cutpoints are labeled in the ROC plot when the value of the THINVAR= variable changes by value or more. Specifying MINDIST=0 labels every cutpoint.

Parameters for displaying optimal cutpoints

optcrit=correct | dist | youden | sespdiff | cost | mct | eff | all: Specifies one or more criteria for finding optimal cutpoints. Separate multiple criteria with spaces. A symbol is placed on the ROC curve at each optimal cutpoint. A table of optimal cutpoints is also provided. ALL requests all seven criteria. CORRECT maximizes the correct classification rate (symbol: C). DIST minimizes the distance from the ROC curve to the "perfect" point at the upper-left corner where sensitivity=1 and 1-specificity=0 (symbol: D). YOUDEN maximizes the Youden index — the vertical distance from the uninformative diagonal to ROC curve (symbol: Y). SESPDIFF minimizes the absolute difference between sensitivity and specificity (symbol: =). COST maximizes a criterion that minimizes the cost (symbol: $). MCT minimizes the misclassification-cost term (symbol: M). EFF maximizes the efficiency (symbol: E). For the cost and MCT criteria, the COSTRATIO= option must be specified. See the Details section below. If omitted, no optimal cutpoints are presented.
costratio=value | value-list: Required when the cost or MCT optimality criterion is requested. Specifies one or more cost ratios used in the computation of these criteria as discussed in the Details section below. Separate multiple values with spaces.
pevent=value | value-list: Specifies one or more prevalence (rate of event occurrence) values used in computing the cost, MCT, and efficiency optimality criteria. If omitted, the prevalence in the observed data is used.
optbyx=yes | panelall | paneleach | no: OPTBYX=YES produces separate plots of the optimality criteria requested in OPTCRIT= against the variable specified in the X= option. The X= option is required if the OPTBYX= option is specified. OPTBYX=PANELALL produces a single panel containing the criterion plots. OPTBYX=PANELEACH produces separate plots for each criterion like OPTBYX=YES, but for the cost, MCT, and efficiency criteria, a panel of plots for each cost ratio and/or prevalence combination is produced. OPTBYX=NO produces no plots. The default is OPTBYX=NO.
x=variable: Specifies one variable against which the optimality criteria requested in OPTCRIT= are plotted. The X= option is required when the OPTBYX= option is specified. The specified variable can be any variable in the INPRED= data set.
multoptplot=yes | panelall | no: Applies only to the cost, MCT, and efficiency optimality criteria. MULTOPTPLOT=YES produces separate plots of these optimality criteria. For the cost and MCT criteria, each plot graphs the criterion against the cost ratios with separate lines for each prevalence. For the efficiency criterion, the efficiency is plotted against prevalence. Each point in the plot is the optimal cutpoint at each cost ratio and prevalence combination (for cost and MCT criteria) or at each prevalence value (for the efficiency criterion). MULTOPTPLOT=PANELALL produces a single panel containing the criterion plots. The optimal cutpoints in these plots are labeled by the ID= variables. No plot is produced if there is only one cost ratio and one prevalence value. MULTOPTPLOT=NO produces no plots. The default is MULTOPTPLOT=NO.
multoptlist=yes | no: Applies only to the cost, MCT, and efficiency optimality criteria. MULTOPTLIST=YES produces a table of optimal cutpoints for each of these optimality criteria. Each table gives the optimal cutpoints for the criterion at each cost ratio and prevalence combination (for cost and MCT criteria) or at each prevalence value (for the efficiency criterion). For each optimal cutpoint in these tables, its value on the criterion and its values on the ID= variables are given. MULTOPTLIST=NO produces no tables. The default is MULTOPTLIST=NO.

Parameters affecting plot appearance

plottype=high | low | none: Affects only the display of the ROC plot. PLOTTYPE=HIGH produces a high-resolution ROC plot using PROC SGPLOT. PLOTTYPE=LOW produces a pair of low-resolution, character-based graphs using PROC PLOT. The first graph plots the ROC curve without labeled points to more clearly show the shape of the curve. The second graph plots the ROC curve with labeled points. None of the following appearance options affect low-resolution plots except the ALTAXISLABEL= option. PLOTTYPE=NONE causes no ROC plot to be produced. The default is PLOTTYPE=HIGH.
linestyle=attribute-option list: Use this option to specify attributes (COLOR=, PATTERN=, or THICKNESS=) of the line in plots containing a single line. Plots with multiple lines are not affected. Plots of the cost, MCT, and efficiency criteria can have multiple lines when multiple cost ratios and/or prevalences are specified. Separate multiple option=value attributes with spaces. For example: LINESTYLE=COLOR=RED THICKNESS=2. For more on specifying line attributes, see the "Commonly Used Attribute Options" section of the SAS ODS Graphics: Procedures Guide (SAS Note 22930).
labelstyle=attribute-option-list: Use this option to specify text attributes (COLOR=, FAMILY=, SIZE=, STYLE=, or WEIGHT=) of cutpoint labels. Separate multiple option=value attributes with spaces. For example: LABELSTYLE=COLOR=BLUE WEIGHT=BOLD. For more on specifying text attributes, see the "Commonly Used Attribute Options" section of the SAS ODS Graphics: Procedures Guide (SAS Note 22930).
optsymbolstyle=attribute-option-list: Use this option to specify text attributes (COLOR=, FAMILY=, SIZE=, STYLE=, or WEIGHT=) of the character symbols marking optimal cutpoints on the ROC plot. Separate multiple option=value attributes with spaces. For example: OPTSYMBOLSTYLE=SIZE=14 WEIGHT=BOLD. To prevent symbols being drawn on the ROC curve, specify SIZE=0. For more on specifying text attributes, see the "Commonly Used Attribute Options" section of the SAS ODS Graphics: Procedures Guide (SAS Note 22930).
markerstyle=attribute-option-list: Use this option to specify marker attributes (COLOR=, SIZE=, SYMBOL=) in plots containing a single line. Separate multiple option=value attributes with spaces. For example: MARKERSTYLE=COLOR=RED SIZE=4. Note that if you specify the SYMBOL= attribute, it overrides the value of the MARKER= option. For more on specifying marker attributes, see the "Commonly Used Attribute Options" section of the SAS ODS Graphics: Procedures Guide (SAS Note 22930).
markers=yes|no: Specifies whether a symbol is used to mark cutpoints in all plots. The default is MARKERS=YES.
marker=symbol-name: Specifies the symbol used to mark labeled points in all plots. The default is MARKER=CIRCLEFILLED. See the list of available symbol names in the "Commonly Used Attribute Options" section of the SAS ODS Graphics: Procedures Guide (SAS Note 22930).
offsetmin=value: Specifies the amount of extra space added to the left end of the horizontal (1-Specificity) axis in the ROC plot. Value should be a proportion of axis length between 0 and 1. The default is OFFSETMIN=0.1.
offsetmax=value: Specifies the amount of extra space added to the right end of the horizontal (1-Specificity) axis in the ROC plot. Value should be a proportion of axis length between 0 and 1. The default is OFFSETMAX=0.1.
grid=yes | no: Specifies whether a set of grid lines is added to the ROC plot. The default is GRID=YES.
altaxislabel=yes|no: Specifies whether to label the ROC plot axes as Sensitivity and 1-Specificity (ALTAXISLABEL=NO), or as True Positive Rate and False Positive Rate (ALTAXISLABEL=YES). The default is ALTAXISLABEL=NO.

Other parameters

outrocdata=data-set-name: Specifies the name of the data set used for plotting the ROC curve. The default is OUTROCDATA=_ROCPLOT.
round=value: Specifies to what position the probability values in the INPRED= and INROC= data sets are rounded. It also defines the maximum amount that a cutpoint optimality value can differ from the maximum or minimum observed value in order to be considered optimal. Increasing value increases the possibility of ties. The default is ROUND=1E-12.

DETAILS:

The receiver operating characteristic (ROC) curve is a diagnostic tool for assessing the ability of a binary response model to discriminate between events and nonevents. If the model discriminates perfectly, the ROC curve passes through the (0,1) point in the upper-left corner of the plot, and the area below the curve is one. If the model has no discriminating ability, the ROC curve follows a diagonal line from (0,0) to (1,1). Each point on the ROC curve provides the sensitivity (true positive) and 1-specificity (false positive) rates associated with a cutoff on the probability scale (called a cutpoint). Each cutpoint allows all observations to be classified as either predicted events or predicted nonevents.

The ROC plots and analyses available in PROC LOGISTIC and the ROCPLOT macro use the empirical ROC curve. The empirical curve has a finite set of distinct points. Each point corresponds to the predicted probability for an observation in the data set used to fit the model. Consequently, the optimal cutpoint on any criterion found the ROCPLOT macro is restricted to one of the observations. The advantage to this is that no distribution assumptions are made in the ROC analysis and curve construction. The disadvantage is that there could be multiple cutpoints for a given optimality criterion, and as stated, only observed values of the predictor can be selected as optimal. By assuming a distribution for the predictor (typically, a normal distribution), the ROC curve is then continuous and monotonic. Further, an optimal cutpoint is unique and can be associated with an unobserved value of the predictor. See Example 2 in the Results tab and the references provided there regarding this approach.

Cutpoint labeling and label thinning

When there is a large number of cutpoints (that is, when the INROC= data set has a large number of observations), the point labels can overlap and become unreadable. This can happen when the model includes continuous predictors or when there are many distinct predictor values or combinations of values. With a very large number of cutpoints, labeling all cutpoints would result in a solid blob of unreadable text. The ROCPLOT macro offers two methods for thinning (reducing) the labeling of cutpoints so that they are readable. Note that requested optimal cutpoints are labeled even when maximal thinning turns off all cutpoint labeling.

The default method has two labeling requirements. First, a cutpoint adjacent to a labeled cutpoint must have a minimum change in sensitivity (vertical separation) before it is labeled. This is controlled by the THINSENS= option, which requires a sensitivity change of 0.05 by default. THINSENS=0 turns off this requirement. Second, a cutpoint must have a Youden value greater than a specified proportion of the maximum observed Youden value. The maximum Youden index is the largest vertical distance from the uninformative diagonal to the ROC curve. This proportion is specified in the THINY= option, which requires a Youden value greater than 50% of the maximum by default. THINY=-1 turns off this requirement. The number of cutpoints labeled by this method is given in a note displayed in the SAS log. No cutpoints are labeled by this method when THINSENS=1 and THINY=1. When both requirements are turned off, the ROCPLOT macro typically labels the same cutpoints as PROC LOGISTIC when labeling is requested.

Another thinning method is available that can be used in addition to the above method (or instead of, by specifying THINSENS=1 and THINY=1). By specifying the THINVAR= and MINDIST= options, cutpoints are labeled when the value of the THINVAR= variable changes by the MINDIST= amount or more. This is in addition to any points labeled by the above method. You can request that all points be labeled by specifying any variable in the THINVAR= option and MINDIST=0. The number of additional cutpoints labeled by this method is given in a note displayed in the SAS log. When the THINVAR= option is not specified, no additional cutpoints are labeled by this method.

To increase the number of points that are labeled by default:

Decrease the value of THINSENS= and/or THINY=, and/or
Specify the THINVAR= and MINDIST= options. Smaller MINDIST= values causes more cutpoints to be labeled.

Some experimentation with these options might be needed to achieve the desired labeling pattern.

Duplicate labels and using statistics in labels

Note that many observations with different values of the ID= variables can be associated with the same cutpoint. To produce a single label for a cutpoint, a statistic is computed over the observations associated with the cutpoint. For a numeric ID= variable, you can specify any statistic keyword available in PROC MEANS (such as MEAN or MEDIAN). For a character ID= variable, you can specify FIRST or LAST to select the first or last value of the variable after sorting. When there are character ID= variables, the ROCPLOT macro sorts by all of the character variables in the order in which they are specified. For a given cutpoint, the first and last values found can depend on the specified order of the character variables in the ID= option.

It is also possible for two or more points on the ROC curve to have the same label. This can happen if multiple points have the same values of the ID= variables but different predicted probabilities. For example, suppose the fitted model has categorical predictors A and B, and ID=A is specified in the ROCPLOT macro. Observations with a given value of A have different predicted probabilities depending on their values for B. Since the points on the ROC curve are the distinct predicted probabilities for the observed data, there are points for each combination of A and B levels, such as A1B1, A1B2, A2B1, and A2B2. If the points are labeled using only the value of A, then the distinct points A1B1 and A1B2 are both labeled 1.

Optimal Cutpoints

For each observation in the data, a model for a binary response results in a predicted probability of the event (such as disease or death) occurring. While the predicted probability is a continuous value between 0 and 1, it is often desirable to provide a binary prediction of whether the event will occur or not. This amounts to choosing a cutpoint on the predicted probability scale. If the predicted probability exceeds the chosen cutpoint, then the event response is predicted. Otherwise, the predicted response is the nonevent.

The optimal choice of a cutpoint on the predicted probabilities can be made in many ways depending on the choice of criterion to optimize. Seven optimality criteria are available in the ROCPLOT macro. Specify the desired criteria in the OPTCRIT= option.

When a binary prediction is made there are four possible outcomes: correct prediction of an event (TP), correct prediction of a nonevent (TN), incorrect prediction of an event when the true response is a nonevent (FP), and incorrect prediction of a nonevent when the true response is an event (FN). Also, the prevalence, p, of the event can be taken into account if it might differ from what is observed in the data. If it is possible to assign cost or benefit values to these outcomes, then a cutpoint can be chosen that minimizes the cost at a given event prevalence. The following cost criterion, when maximized, minimizes the cost (Metz, 1978):

Sensitivity - m×(1-Specificity) ,

where m = costratio×(1-p)/p, costratio = (c_FP-c_TN)/(c_FN-c_TP), and c_FP, c_TN, c_FN, and c_TP are the above outcome costs. This criterion is requested by specifying OPTCRIT=COST and one or more cost ratio values in the COSTRATIO= option. One or more prevalence values might also be specified in the PEVENT= option, or the observed prevalence is used if not. "$" is added to the label of the cutpoint that optimizes the cost criterion when _OPTCOST_ is specified in the ID= option.

Another cost- and prevalence-based criterion is the misclassification-cost term (MCT):

(c_FN/c_FP)×p×(1-Sensitivity)+(1-p)×(1-Specificity))

Note that when computing MCT, the c_TN and c_TP components of the specified cost ratio(s) are assumed to be zero. Specifying OPTCRIT=MCT requests the MCT criterion, and adding _OPTMCT_ in the ID= option adds "M" to the MCT-optimal cutpoint label.

A criterion that involves the prevalence but not costs is the efficiency:

p×Sensitivity + (1-p)×Specificity

Efficiency is a weighted average of Sensitivity and Specificity. Specifying OPTCRIT=EFF requests the efficiency criterion, and adding _OPTEFF_ in the ID= option adds "E" to the label of the cutpoint with maximum efficiency.

When the prevalence observed in the data is used, the efficiency is equivalent to the correct classification rate. Specifying OPTCRIT=CORRECT requests the correct classification criterion, and adding _OPTCORR_ in the ID= option adds "C" to the label of the cutpoint with the maximum correct classification rate.

Some criteria do not depend on the cost ratio or the prevalence. These criteria weight sensitivity and specificity equally. They effectively assume the false positive to false negative cost ratio is 1 and the event prevalence is 0.5.

Youden index = Sensitivity + Specificity - 1
The Youden index is the height of the cutpoint above the diagonal line that represents an uninformative model. When the cost ratio equals 1 and the prevalence equals 0.5, the cost criterion reduces to the Youden index. When prevalence equals 0.5, the cutpoint that maximizes the Youden index also maximizes the efficiency. Specifying OPTCRIT=YOUDEN requests the Youden criterion, and adding _OPTY_ in the ID= option adds "Y" to the label of the cutpoint with the maximum Youden index.
Distance to (0,1) = √(1-Sensitivity)²+(1-Specificity)²
This is the distance from the "perfect" point at the upper-left corner of the ROC plot where 1-Specificity=0 and Sensitivity=1, . Specifying OPTCRIT=DIST requests the distance criterion, and adding _OPTDIST_ in the ID= option adds "D" to the label of the cutpoint closest to the (0,1) point of the ROC graph.
Sensitivity, Specificity equality = abs(Sensitivity - Specificity)
Specifying OPTCRIT=SESPDIFF requests this criterion, and adding _OPTSESP_ in the ID= option adds "=" to the label of the cutpoint with minimum absolute difference between sensitivity and specificity.

You can ensure that the cutpoint(s) optimal for a given criterion are labeled in the ROC plot by specifying the corresponding criterion variable in the ID= option. For example, adding _OPTY_ in the ID= option will ensure that the cutpoint(s) with maximum Youden index are labeled and that their labels include the symbol, Y. By default, specifying the criterion in the OPTCRIT= option also places the symbol directly on the cutpoint in the ROC curve.

For the cost and MCT criteria, there is an optimal cutpoint for each combination of cost ratio and prevalence. These cutpoints are presented in the "Cost-Based Optimal Cutpoints" and "MCT-Based Optimal Cutpoints" tables. Similarly for the efficiency criterion which has an optimal cutpoint at each prevalence, these cutpoints are presented in the "Efficiency-Based Optimal Cutpoints" table. Note that if multiple cutpoints are tied for the optimal criterion value at each cost ratio and/or prevalence, all such cutpoints are presented. For these criteria the single cutpoint (or multiple if tied) that optimizes the criterion over all cost ratios and/or prevalences is presented in the "Optimal Cutpoints" table that appears after the ROC plot. For the other criteria, the single cutpoint (or multiple if tied) that optimizes the criterion also appears in this table.

Each of the criteria requested in the OPTCRIT= option can be plotted against a variable in the INPRED= data set by specifying the OPTBYX=YES, OPTBYX=PANELALL, or OPTBYX=PANELEACH option. Specify the variable to plot the criteria against in the X= option.

Additionally for each of the cost, MCT and efficiency criteria, all of the optimal cutpoints for the specified cost ratio and/or prevalences can be plotted by specifying the MULTOPTPLOT=YES or MULTOPTPLOT=PANELALL option. Lists of all optimal cutpoints for these criteria are provided by specifying MULTOPTLIST=YES.

The OUTROCDATA= data set

The OUTROCDATA= data set contains all the information needed to produce the ROC plot. The _THINID variable in this data set contains the labels for the points that are labeled in the plot. The _ID variable contains labels for all points. To create only this data set and prevent production of the ROC plot, specify the PLOTTYPE=NONE option.

LIMITATIONS:

The ROCPLOT macro cannot handle input data resulting from the use of BY processing in the model fitting procedure. FOOTNOTE statements specified before calling the ROCPLOT macro are ignored.

REFERENCES:

DeLong, E.R., DeLong, D.M., and Clarke-Pearson, D.L. (1988), "Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach," Biometrics, 44, 837-845.

Greiner, M., Pfeiffer D., Smith R.D. (2000), "Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests," Preventive Veterinary Medicine, 45, 23-41.

Hanley, J.A., and McNeil, B.J. (1982), "The meaning and use of the area under a receiver operating characteristic (ROC) curve," Radiology, 143, 29-36.

Hanley, J.A., and McNeil, B.J. (1983), "A method of comparing the areas under receiver operating characteristic curves derived from the same cases," Radiology, 148, 839-843.

Metz, C.E. (1978), "Basic principles of ROC analysis," Seminars in Nuclear Medicine, 8(4), 283-298.

Zweig, M.H., and Campbell, G. (1993), "Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine," Clinical Chemistry, 39(4), 561-577.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

Date Modified:	2022-12-04 08:12:24
Date Created:	2005-01-13 15:03:46

Product Family	Product	Host	SAS Release
Product Family	Product	Host	Starting	Ending
SAS System	SAS/STAT	z/OS	9.3 TS1M0
		Microsoft® Windows® for x64	9.3 TS1M0
		64-bit Enabled HP-UX	9.3 TS1M0
		Windows 7 Enterprise 32 bit	9.3 TS1M0
		Windows 7 Enterprise x64	9.3 TS1M0
		Windows 7 Home Premium 32 bit	9.3 TS1M0
		Windows 7 Home Premium x64	9.3 TS1M0
		Windows 7 Professional 32 bit	9.3 TS1M0
		Windows 7 Professional x64	9.3 TS1M0
		Windows 7 Ultimate 32 bit	9.3 TS1M0
		Windows 7 Ultimate x64	9.3 TS1M0
		Windows Vista	9.3 TS1M0
		Windows Vista for x64	9.3 TS1M0
		64-bit Enabled AIX	9.3 TS1M0
		Microsoft Windows XP Professional	9.3 TS1M0
		Microsoft Windows Server 2008 for x64	9.3 TS1M0
		Microsoft Windows Server 2008 R2	9.3 TS1M0
		Microsoft Windows Server 2008	9.3 TS1M0
		Microsoft Windows Server 2003 for x64	9.3 TS1M0
		Microsoft Windows Server 2003 Standard Edition	9.3 TS1M0
		Microsoft Windows Server 2003 Enterprise Edition	9.3 TS1M0
		Microsoft Windows Server 2003 Datacenter Edition	9.3 TS1M0
		64-bit Enabled Solaris	9.3 TS1M0
		HP-UX IPF	9.3 TS1M0
		Linux	9.3 TS1M0
		Linux for x64	9.3 TS1M0
		Solaris for x64	9.3 TS1M0

Support

Sample 25018: Plot ROC curve with cutpoint labeling and optimal cutpoint analysis

Plot ROC curve with cutpoint labeling and optimal cutpoint analysis

Operating System and Release Information

Follow Us

What is...