![]() | ![]() | ![]() | ![]() | ![]() | ![]() |
Contents: | Purpose / History / Requirements / Usage / Details / Limitations / References |
Beginning in SAS® Viya® 2022.10, ROCOPTIONS in the PROC LOGISTIC statement includes options to thin labels and find and display optimal cutpoints.
When only an ROC plot with labeled points is needed, you can often produce the desired plot in PROC LOGISTIC without this macro. Using ODS graphics, PROC LOGISTIC can plot the ROC curve of a model whether applied to the data used to fit the model or to additional data scored using the fitted model. Specify the PLOTS=ROC option in the PROC LOGISTIC statement, or specify the OUTROC= option in the MODEL and/or SCORE statements. PROC LOGISTIC can label the points in the ROC curve with their observation numbers, predicted probabilities, values from one or more variables, or with values from a single statistic (such as the misclassification rate). Use the PLOTS=ROC(ID=keyword | ID) option and ID statement. For details, see the LOGISTIC documentation (SAS Note 22930).
For plotting the ROC curve with labeled points, the ROCPLOT macro provides these capabilities:
The version of the ROCPLOT macro that you are using is displayed when you specify version (or any string) as the first argument. For example:
%rocplot(v)
The ROCPLOT macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active internet connection available), the macro will issue the following message:
NOTE: Unable to check for newer version
The computations performed by the macro are not affected by the appearance of this message.
Version
|
Update Notes
|
1.3 |
|
1.2 |
|
1.1 |
|
1.0 | Initial coding. |
When using PROC LOGISTIC to fit the model, specify the OUTROC= option in the MODEL statement and the OUT= and P= options in the OUTPUT statement before using the ROCPLOT macro to plot the ROC curve for the training data. Before using the ROCPLOT macro to plot the ROC curve for a scored (or validation) data set, specify the OUT= and OUTROC= options in the SCORE statement.
When using another procedure to fit the model, save the predicted probabilities in a data set using the modeling procedure. Then in PROC LOGISTIC, specify the variable containing the predicted probabilities in the PRED= option of an ROC statement. Also specify the OUTROC= option in the MODEL statement with a WHERE clause. See the Examples tab for an example of plotting the ROC curve for a model fit using PROC GAM.
Follow the instructions in the Downloads tab of this sample to save the ROCPLOT macro definition. Replace the text within quotation marks in the following statement with the location of the ROCPLOT macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the ROCPLOT macro and make it available for use:
%inc "<location of your file containing the ROCPLOT macro>";
Following this statement, you can call the ROCPLOT macro. See the Results tab for examples.
The following parameters are required when using the ROCPLOT macro:
You can also specify any of the following which adds a character symbol in the label of the cutpoint that optimizes each criterion.
You can also specify the statistics in the OUTROC= data set from PROC LOGISTIC. The FORMAT= option does not apply to these statistics.
The following are optional parameters except as noted:
Parameters affecting cutpoint labeling
Parameters for displaying optimal cutpoints
Parameters affecting plot appearance
Other parameters
The ROC plots and analyses available in PROC LOGISTIC and the ROCPLOT macro use the empirical ROC curve. The empirical curve has a finite set of distinct points. Each point corresponds to the predicted probability for an observation in the data set used to fit the model. Consequently, the optimal cutpoint on any criterion found the ROCPLOT macro is restricted to one of the observations. The advantage to this is that no distribution assumptions are made in the ROC analysis and curve construction. The disadvantage is that there could be multiple cutpoints for a given optimality criterion, and as stated, only observed values of the predictor can be selected as optimal. By assuming a distribution for the predictor (typically, a normal distribution), the ROC curve is then continuous and monotonic. Further, an optimal cutpoint is unique and can be associated with an unobserved value of the predictor. See Example 2 in the Results tab and the references provided there regarding this approach.
Cutpoint labeling and label thinning
When there is a large number of cutpoints (that is, when the INROC= data set has a large number of observations), the point labels can overlap and become unreadable. This can happen when the model includes continuous predictors or when there are many distinct predictor values or combinations of values. With a very large number of cutpoints, labeling all cutpoints would result in a solid blob of unreadable text. The ROCPLOT macro offers two methods for thinning (reducing) the labeling of cutpoints so that they are readable. Note that requested optimal cutpoints are labeled even when maximal thinning turns off all cutpoint labeling.
The default method has two labeling requirements. First, a cutpoint adjacent to a labeled cutpoint must have a minimum change in sensitivity (vertical separation) before it is labeled. This is controlled by the THINSENS= option, which requires a sensitivity change of 0.05 by default. THINSENS=0 turns off this requirement. Second, a cutpoint must have a Youden value greater than a specified proportion of the maximum observed Youden value. The maximum Youden index is the largest vertical distance from the uninformative diagonal to the ROC curve. This proportion is specified in the THINY= option, which requires a Youden value greater than 50% of the maximum by default. THINY=-1 turns off this requirement. The number of cutpoints labeled by this method is given in a note displayed in the SAS log. No cutpoints are labeled by this method when THINSENS=1 and THINY=1. When both requirements are turned off, the ROCPLOT macro typically labels the same cutpoints as PROC LOGISTIC when labeling is requested.
Another thinning method is available that can be used in addition to the above method (or instead of, by specifying THINSENS=1 and THINY=1). By specifying the THINVAR= and MINDIST= options, cutpoints are labeled when the value of the THINVAR= variable changes by the MINDIST= amount or more. This is in addition to any points labeled by the above method. You can request that all points be labeled by specifying any variable in the THINVAR= option and MINDIST=0. The number of additional cutpoints labeled by this method is given in a note displayed in the SAS log. When the THINVAR= option is not specified, no additional cutpoints are labeled by this method.
To increase the number of points that are labeled by default:
Some experimentation with these options might be needed to achieve the desired labeling pattern.
Duplicate labels and using statistics in labels
Note that many observations with different values of the ID= variables can be associated with the same cutpoint. To produce a single label for a cutpoint, a statistic is computed over the observations associated with the cutpoint. For a numeric ID= variable, you can specify any statistic keyword available in PROC MEANS (such as MEAN or MEDIAN). For a character ID= variable, you can specify FIRST or LAST to select the first or last value of the variable after sorting. When there are character ID= variables, the ROCPLOT macro sorts by all of the character variables in the order in which they are specified. For a given cutpoint, the first and last values found can depend on the specified order of the character variables in the ID= option.
It is also possible for two or more points on the ROC curve to have the same label. This can happen if multiple points have the same values of the ID= variables but different predicted probabilities. For example, suppose the fitted model has categorical predictors A and B, and ID=A is specified in the ROCPLOT macro. Observations with a given value of A have different predicted probabilities depending on their values for B. Since the points on the ROC curve are the distinct predicted probabilities for the observed data, there are points for each combination of A and B levels, such as A1B1, A1B2, A2B1, and A2B2. If the points are labeled using only the value of A, then the distinct points A1B1 and A1B2 are both labeled 1.
Optimal Cutpoints
For each observation in the data, a model for a binary response results in a predicted probability of the event (such as disease or death) occurring. While the predicted probability is a continuous value between 0 and 1, it is often desirable to provide a binary prediction of whether the event will occur or not. This amounts to choosing a cutpoint on the predicted probability scale. If the predicted probability exceeds the chosen cutpoint, then the event response is predicted. Otherwise, the predicted response is the nonevent.
The optimal choice of a cutpoint on the predicted probabilities can be made in many ways depending on the choice of criterion to optimize. Seven optimality criteria are available in the ROCPLOT macro. Specify the desired criteria in the OPTCRIT= option.
When a binary prediction is made there are four possible outcomes: correct prediction of an event (TP), correct prediction of a nonevent (TN), incorrect prediction of an event when the true response is a nonevent (FP), and incorrect prediction of a nonevent when the true response is an event (FN). Also, the prevalence, p, of the event can be taken into account if it might differ from what is observed in the data. If it is possible to assign cost or benefit values to these outcomes, then a cutpoint can be chosen that minimizes the cost at a given event prevalence. The following cost criterion, when maximized, minimizes the cost (Metz, 1978):
Sensitivity - m×(1-Specificity) ,
where m = costratio×(1-p)/p, costratio = (cFP-cTN)/(cFN-cTP), and cFP, cTN, cFN, and cTP are the above outcome costs. This criterion is requested by specifying OPTCRIT=COST and one or more cost ratio values in the COSTRATIO= option. One or more prevalence values might also be specified in the PEVENT= option, or the observed prevalence is used if not. "$" is added to the label of the cutpoint that optimizes the cost criterion when _OPTCOST_ is specified in the ID= option.
Another cost- and prevalence-based criterion is the misclassification-cost term (MCT):
(cFN/cFP)×p×(1-Sensitivity)+(1-p)×(1-Specificity))
Note that when computing MCT, the cTN and cTP components of the specified cost ratio(s) are assumed to be zero. Specifying OPTCRIT=MCT requests the MCT criterion, and adding _OPTMCT_ in the ID= option adds "M" to the MCT-optimal cutpoint label.
A criterion that involves the prevalence but not costs is the efficiency:
p×Sensitivity + (1-p)×Specificity
Efficiency is a weighted average of Sensitivity and Specificity. Specifying OPTCRIT=EFF requests the efficiency criterion, and adding _OPTEFF_ in the ID= option adds "E" to the label of the cutpoint with maximum efficiency.
When the prevalence observed in the data is used, the efficiency is equivalent to the correct classification rate. Specifying OPTCRIT=CORRECT requests the correct classification criterion, and adding _OPTCORR_ in the ID= option adds "C" to the label of the cutpoint with the maximum correct classification rate.
Some criteria do not depend on the cost ratio or the prevalence. These criteria weight sensitivity and specificity equally. They effectively assume the false positive to false negative cost ratio is 1 and the event prevalence is 0.5.
The Youden index is the height of the cutpoint above the diagonal line that represents an uninformative model. When the cost ratio equals 1 and the prevalence equals 0.5, the cost criterion reduces to the Youden index. When prevalence equals 0.5, the cutpoint that maximizes the Youden index also maximizes the efficiency. Specifying OPTCRIT=YOUDEN requests the Youden criterion, and adding _OPTY_ in the ID= option adds "Y" to the label of the cutpoint with the maximum Youden index.
This is the distance from the "perfect" point at the upper-left corner of the ROC plot where 1-Specificity=0 and Sensitivity=1, . Specifying OPTCRIT=DIST requests the distance criterion, and adding _OPTDIST_ in the ID= option adds "D" to the label of the cutpoint closest to the (0,1) point of the ROC graph.
Specifying OPTCRIT=SESPDIFF requests this criterion, and adding _OPTSESP_ in the ID= option adds "=" to the label of the cutpoint with minimum absolute difference between sensitivity and specificity.
You can ensure that the cutpoint(s) optimal for a given criterion are labeled in the ROC plot by specifying the corresponding criterion variable in the ID= option. For example, adding _OPTY_ in the ID= option will ensure that the cutpoint(s) with maximum Youden index are labeled and that their labels include the symbol, Y. By default, specifying the criterion in the OPTCRIT= option also places the symbol directly on the cutpoint in the ROC curve.
For the cost and MCT criteria, there is an optimal cutpoint for each combination of cost ratio and prevalence. These cutpoints are presented in the "Cost-Based Optimal Cutpoints" and "MCT-Based Optimal Cutpoints" tables. Similarly for the efficiency criterion which has an optimal cutpoint at each prevalence, these cutpoints are presented in the "Efficiency-Based Optimal Cutpoints" table. Note that if multiple cutpoints are tied for the optimal criterion value at each cost ratio and/or prevalence, all such cutpoints are presented. For these criteria the single cutpoint (or multiple if tied) that optimizes the criterion over all cost ratios and/or prevalences is presented in the "Optimal Cutpoints" table that appears after the ROC plot. For the other criteria, the single cutpoint (or multiple if tied) that optimizes the criterion also appears in this table.
Each of the criteria requested in the OPTCRIT= option can be plotted against a variable in the INPRED= data set by specifying the OPTBYX=YES, OPTBYX=PANELALL, or OPTBYX=PANELEACH option. Specify the variable to plot the criteria against in the X= option.
Additionally for each of the cost, MCT and efficiency criteria, all of the optimal cutpoints for the specified cost ratio and/or prevalences can be plotted by specifying the MULTOPTPLOT=YES or MULTOPTPLOT=PANELALL option. Lists of all optimal cutpoints for these criteria are provided by specifying MULTOPTLIST=YES.
The OUTROCDATA= data set
The OUTROCDATA= data set contains all the information needed to produce the ROC plot. The _THINID variable in this data set contains the labels for the points that are labeled in the plot. The _ID variable contains labels for all points. To create only this data set and prevent production of the ROC plot, specify the PLOTTYPE=NONE option.
Greiner, M., Pfeiffer D., Smith R.D. (2000), "Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests," Preventive Veterinary Medicine, 45, 23-41.
Hanley, J.A., and McNeil, B.J. (1982), "The meaning and use of the area under a receiver operating characteristic (ROC) curve," Radiology, 143, 29-36.
Hanley, J.A., and McNeil, B.J. (1983), "A method of comparing the areas under receiver operating characteristic curves derived from the same cases," Radiology, 148, 839-843.
Metz, C.E. (1978), "Basic principles of ROC analysis," Seminars in Nuclear Medicine, 8(4), 283-298.
Zweig, M.H., and Campbell, G. (1993), "Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine," Clinical Chemistry, 39(4), 561-577.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.