Contents: | Purpose / History / Requirements / Usage / Details / Limitations / References |
Beginning in SAS® Viya® 2022.10, ROCOPTIONS in the PROC LOGISTIC statement includes options to thin labels and find and display optimal cutpoints.
When only an ROC plot with labeled points is needed, you can often produce the desired plot in PROC LOGISTIC without this macro. Using ODS graphics, PROC LOGISTIC can plot the ROC curve of a model whether applied to the data used to fit the model or to additional data scored using the fitted model. Specify the PLOTS=ROC option in the PROC LOGISTIC statement, or specify the OUTROC= option in the MODEL and/or SCORE statements. PROC LOGISTIC can label the points in the ROC curve with their observation numbers, predicted probabilities, values from one or more variables, or with values from a single statistic (such as the misclassification rate). Use the PLOTS=ROC(ID=keyword | ID) option and ID statement. For details, see the LOGISTIC documentation (SAS Note 22930).
For plotting the ROC curve with labeled points, the ROCPLOT macro provides these capabilities:
The version of the ROCPLOT macro that you are using is displayed when you specify version (or any string) as the first argument. For example:
%rocplot(v)
The ROCPLOT macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active internet connection available), the macro will issue the following message:
NOTE: Unable to check for newer version
The computations performed by the macro are not affected by the appearance of this message.
Version
|
Update Notes
|
1.3 |
|
1.2 |
|
1.1 |
|
1.0 | Initial coding. |
When using PROC LOGISTIC to fit the model, specify the OUTROC= option in the MODEL statement and the OUT= and P= options in the OUTPUT statement before using the ROCPLOT macro to plot the ROC curve for the training data. Before using the ROCPLOT macro to plot the ROC curve for a scored (or validation) data set, specify the OUT= and OUTROC= options in the SCORE statement.
When using another procedure to fit the model, save the predicted probabilities in a data set using the modeling procedure. Then in PROC LOGISTIC, specify the variable containing the predicted probabilities in the PRED= option of an ROC statement. Also specify the OUTROC= option in the MODEL statement with a WHERE clause. See the Examples tab for an example of plotting the ROC curve for a model fit using PROC GAM.
Follow the instructions in the Downloads tab of this sample to save the ROCPLOT macro definition. Replace the text within quotation marks in the following statement with the location of the ROCPLOT macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the ROCPLOT macro and make it available for use:
%inc "<location of your file containing the ROCPLOT macro>";
Following this statement, you can call the ROCPLOT macro. See the Results tab for examples.
The following parameters are required when using the ROCPLOT macro:
You can also specify any of the following which adds a character symbol in the label of the cutpoint that optimizes each criterion.
You can also specify the statistics in the OUTROC= data set from PROC LOGISTIC. The FORMAT= option does not apply to these statistics.
The following are optional parameters except as noted:
Parameters affecting cutpoint labeling
Parameters for displaying optimal cutpoints
Parameters affecting plot appearance
Other parameters
The ROC plots and analyses available in PROC LOGISTIC and the ROCPLOT macro use the empirical ROC curve. The empirical curve has a finite set of distinct points. Each point corresponds to the predicted probability for an observation in the data set used to fit the model. Consequently, the optimal cutpoint on any criterion found the ROCPLOT macro is restricted to one of the observations. The advantage to this is that no distribution assumptions are made in the ROC analysis and curve construction. The disadvantage is that there could be multiple cutpoints for a given optimality criterion, and as stated, only observed values of the predictor can be selected as optimal. By assuming a distribution for the predictor (typically, a normal distribution), the ROC curve is then continuous and monotonic. Further, an optimal cutpoint is unique and can be associated with an unobserved value of the predictor. See Example 2 in the Results tab and the references provided there regarding this approach.
Cutpoint labeling and label thinning
When there is a large number of cutpoints (that is, when the INROC= data set has a large number of observations), the point labels can overlap and become unreadable. This can happen when the model includes continuous predictors or when there are many distinct predictor values or combinations of values. With a very large number of cutpoints, labeling all cutpoints would result in a solid blob of unreadable text. The ROCPLOT macro offers two methods for thinning (reducing) the labeling of cutpoints so that they are readable. Note that requested optimal cutpoints are labeled even when maximal thinning turns off all cutpoint labeling.
The default method has two labeling requirements. First, a cutpoint adjacent to a labeled cutpoint must have a minimum change in sensitivity (vertical separation) before it is labeled. This is controlled by the THINSENS= option, which requires a sensitivity change of 0.05 by default. THINSENS=0 turns off this requirement. Second, a cutpoint must have a Youden value greater than a specified proportion of the maximum observed Youden value. The maximum Youden index is the largest vertical distance from the uninformative diagonal to the ROC curve. This proportion is specified in the THINY= option, which requires a Youden value greater than 50% of the maximum by default. THINY=-1 turns off this requirement. The number of cutpoints labeled by this method is given in a note displayed in the SAS log. No cutpoints are labeled by this method when THINSENS=1 and THINY=1. When both requirements are turned off, the ROCPLOT macro typically labels the same cutpoints as PROC LOGISTIC when labeling is requested.
Another thinning method is available that can be used in addition to the above method (or instead of, by specifying THINSENS=1 and THINY=1). By specifying the THINVAR= and MINDIST= options, cutpoints are labeled when the value of the THINVAR= variable changes by the MINDIST= amount or more. This is in addition to any points labeled by the above method. You can request that all points be labeled by specifying any variable in the THINVAR= option and MINDIST=0. The number of additional cutpoints labeled by this method is given in a note displayed in the SAS log. When the THINVAR= option is not specified, no additional cutpoints are labeled by this method.
To increase the number of points that are labeled by default:
Some experimentation with these options might be needed to achieve the desired labeling pattern.
Duplicate labels and using statistics in labels
Note that many observations with different values of the ID= variables can be associated with the same cutpoint. To produce a single label for a cutpoint, a statistic is computed over the observations associated with the cutpoint. For a numeric ID= variable, you can specify any statistic keyword available in PROC MEANS (such as MEAN or MEDIAN). For a character ID= variable, you can specify FIRST or LAST to select the first or last value of the variable after sorting. When there are character ID= variables, the ROCPLOT macro sorts by all of the character variables in the order in which they are specified. For a given cutpoint, the first and last values found can depend on the specified order of the character variables in the ID= option.
It is also possible for two or more points on the ROC curve to have the same label. This can happen if multiple points have the same values of the ID= variables but different predicted probabilities. For example, suppose the fitted model has categorical predictors A and B, and ID=A is specified in the ROCPLOT macro. Observations with a given value of A have different predicted probabilities depending on their values for B. Since the points on the ROC curve are the distinct predicted probabilities for the observed data, there are points for each combination of A and B levels, such as A1B1, A1B2, A2B1, and A2B2. If the points are labeled using only the value of A, then the distinct points A1B1 and A1B2 are both labeled 1.
Optimal Cutpoints
For each observation in the data, a model for a binary response results in a predicted probability of the event (such as disease or death) occurring. While the predicted probability is a continuous value between 0 and 1, it is often desirable to provide a binary prediction of whether the event will occur or not. This amounts to choosing a cutpoint on the predicted probability scale. If the predicted probability exceeds the chosen cutpoint, then the event response is predicted. Otherwise, the predicted response is the nonevent.
The optimal choice of a cutpoint on the predicted probabilities can be made in many ways depending on the choice of criterion to optimize. Seven optimality criteria are available in the ROCPLOT macro. Specify the desired criteria in the OPTCRIT= option.
When a binary prediction is made there are four possible outcomes: correct prediction of an event (TP), correct prediction of a nonevent (TN), incorrect prediction of an event when the true response is a nonevent (FP), and incorrect prediction of a nonevent when the true response is an event (FN). Also, the prevalence, p, of the event can be taken into account if it might differ from what is observed in the data. If it is possible to assign cost or benefit values to these outcomes, then a cutpoint can be chosen that minimizes the cost at a given event prevalence. The following cost criterion, when maximized, minimizes the cost (Metz, 1978):
Sensitivity - m×(1-Specificity) ,
where m = costratio×(1-p)/p, costratio = (cFP-cTN)/(cFN-cTP), and cFP, cTN, cFN, and cTP are the above outcome costs. This criterion is requested by specifying OPTCRIT=COST and one or more cost ratio values in the COSTRATIO= option. One or more prevalence values might also be specified in the PEVENT= option, or the observed prevalence is used if not. "$" is added to the label of the cutpoint that optimizes the cost criterion when _OPTCOST_ is specified in the ID= option.
Another cost- and prevalence-based criterion is the misclassification-cost term (MCT):
(cFN/cFP)×p×(1-Sensitivity)+(1-p)×(1-Specificity))
Note that when computing MCT, the cTN and cTP components of the specified cost ratio(s) are assumed to be zero. Specifying OPTCRIT=MCT requests the MCT criterion, and adding _OPTMCT_ in the ID= option adds "M" to the MCT-optimal cutpoint label.
A criterion that involves the prevalence but not costs is the efficiency:
p×Sensitivity + (1-p)×Specificity
Efficiency is a weighted average of Sensitivity and Specificity. Specifying OPTCRIT=EFF requests the efficiency criterion, and adding _OPTEFF_ in the ID= option adds "E" to the label of the cutpoint with maximum efficiency.
When the prevalence observed in the data is used, the efficiency is equivalent to the correct classification rate. Specifying OPTCRIT=CORRECT requests the correct classification criterion, and adding _OPTCORR_ in the ID= option adds "C" to the label of the cutpoint with the maximum correct classification rate.
Some criteria do not depend on the cost ratio or the prevalence. These criteria weight sensitivity and specificity equally. They effectively assume the false positive to false negative cost ratio is 1 and the event prevalence is 0.5.
The Youden index is the height of the cutpoint above the diagonal line that represents an uninformative model. When the cost ratio equals 1 and the prevalence equals 0.5, the cost criterion reduces to the Youden index. When prevalence equals 0.5, the cutpoint that maximizes the Youden index also maximizes the efficiency. Specifying OPTCRIT=YOUDEN requests the Youden criterion, and adding _OPTY_ in the ID= option adds "Y" to the label of the cutpoint with the maximum Youden index.
This is the distance from the "perfect" point at the upper-left corner of the ROC plot where 1-Specificity=0 and Sensitivity=1, . Specifying OPTCRIT=DIST requests the distance criterion, and adding _OPTDIST_ in the ID= option adds "D" to the label of the cutpoint closest to the (0,1) point of the ROC graph.
Specifying OPTCRIT=SESPDIFF requests this criterion, and adding _OPTSESP_ in the ID= option adds "=" to the label of the cutpoint with minimum absolute difference between sensitivity and specificity.
You can ensure that the cutpoint(s) optimal for a given criterion are labeled in the ROC plot by specifying the corresponding criterion variable in the ID= option. For example, adding _OPTY_ in the ID= option will ensure that the cutpoint(s) with maximum Youden index are labeled and that their labels include the symbol, Y. By default, specifying the criterion in the OPTCRIT= option also places the symbol directly on the cutpoint in the ROC curve.
For the cost and MCT criteria, there is an optimal cutpoint for each combination of cost ratio and prevalence. These cutpoints are presented in the "Cost-Based Optimal Cutpoints" and "MCT-Based Optimal Cutpoints" tables. Similarly for the efficiency criterion which has an optimal cutpoint at each prevalence, these cutpoints are presented in the "Efficiency-Based Optimal Cutpoints" table. Note that if multiple cutpoints are tied for the optimal criterion value at each cost ratio and/or prevalence, all such cutpoints are presented. For these criteria the single cutpoint (or multiple if tied) that optimizes the criterion over all cost ratios and/or prevalences is presented in the "Optimal Cutpoints" table that appears after the ROC plot. For the other criteria, the single cutpoint (or multiple if tied) that optimizes the criterion also appears in this table.
Each of the criteria requested in the OPTCRIT= option can be plotted against a variable in the INPRED= data set by specifying the OPTBYX=YES, OPTBYX=PANELALL, or OPTBYX=PANELEACH option. Specify the variable to plot the criteria against in the X= option.
Additionally for each of the cost, MCT and efficiency criteria, all of the optimal cutpoints for the specified cost ratio and/or prevalences can be plotted by specifying the MULTOPTPLOT=YES or MULTOPTPLOT=PANELALL option. Lists of all optimal cutpoints for these criteria are provided by specifying MULTOPTLIST=YES.
The OUTROCDATA= data set
The OUTROCDATA= data set contains all the information needed to produce the ROC plot. The _THINID variable in this data set contains the labels for the points that are labeled in the plot. The _ID variable contains labels for all points. To create only this data set and prevent production of the ROC plot, specify the PLOTTYPE=NONE option.
Greiner, M., Pfeiffer D., Smith R.D. (2000), "Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests," Preventive Veterinary Medicine, 45, 23-41.
Hanley, J.A., and McNeil, B.J. (1982), "The meaning and use of the area under a receiver operating characteristic (ROC) curve," Radiology, 143, 29-36.
Hanley, J.A., and McNeil, B.J. (1983), "A method of comparing the areas under receiver operating characteristic curves derived from the same cases," Radiology, 148, 839-843.
Metz, C.E. (1978), "Basic principles of ROC analysis," Seminars in Nuclear Medicine, 8(4), 283-298.
Zweig, M.H., and Campbell, G. (1993), "Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine," Clinical Chemistry, 39(4), 561-577.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.
This example appears in the LOGISTIC documentation. The probability of disease is modeled as a function of Age. The following statements fit the binary logistic model and save the predicted probabilities (variable PHAT) in data set OUT and the ROC information in data set ROC1.
data Data1; input Disease N Age; datalines; 0 14 25 0 20 35 0 19 45 7 18 55 6 12 65 17 17 75 ; proc logistic data=data1; model Disease/N = Age / outroc=roc1; output out=out p=phat; run;
PROC LOGISTIC can plot the ROC curve and label points using Age or using a statistic like the misclassification rate, but it cannot label points using both. The following call of the ROCPLOT macro plots the ROC curve and labels the cutpoints on the curve using the corresponding value of Age, cutpoint value, and indicator symbols for several optimality criteria. Before calling the macro, the %INC statement defines the macro and makes it available for use. It is needed only once in a SAS session. By including _OPTCORR_, _OPTDIST_, _OPTY_, and _OPTSESP_ in the ID= option, a symbol is added to the label of the cutpoints that optimize the correct classification rate, the distance to the (0,1) point, the Youden index, and the absolute difference between sensitivity and specificity. These optimality criteria do not require specification of a cost ratio or event prevalence rate.
Additional plots of the optimality criteria are produced by specifying the OPTCRIT=, OPTBYX=, and X= options. The same four optimality criteria as above are specified in the OPTCRIT= option making them available for plotting. The X=Age option requests plots of these optimality criteria against Age. The OPTBYX=PANELALL option requests a single panel of the four plots.
By default, a symbol is placed on the ROC curve for each criterion specified in the OPTCRIT= option. When the same cutpoint optimizes multiple criteria, multiple symbols are overlaid and are hard to distinguish. In such cases, it is desirable to include the symbols in the cutpoint labels as requested above, and omit the symbols on the ROC curve. Specifying SIZE=0 in the OPTSYMBOLSTYLE= option effectively turns off the symbols on the ROC curve.
%inc "<location of your file containing the ROCPLOT macro>"; title "ROC plot for Disease = Age"; title2 " "; %rocplot( inroc = roc1, inpred = out, p = phat, id = age _cutpt_ _optcorr_ _optdist_ _opty_ _optsesp_, optcrit = correct dist youden sespdiff, optsymbolstyle = size=0, optbyx = panelall, x = Age)
Note that the cutpoints at Age 65 and 75 (0.716 and 0.952) give the highest correct classification rate (87%). The cutpoint at Age 65 also minimizes the difference between the sensitivity and specificity (0.148). The cutpoint at Age 55 (0.242) minimizes the distance to the "perfect" point at the upper-left of the plot (0.243) and maximizes the height above the uninformative diagonal (0.757).
|
When the model involves a single predictor and the cutpoint that is optimal under a given criterion is unique among the observations in your data, then the following shows how to obtain a confidence interval for the optimal cutpoint on the scale of the predictor. In Example 1, for instance, you might want to know the range of Age values that can be expected to yield the cutpoint that is optimal on the correct classification criterion. But under that criterion, note that there are two cutpoints that produce the maximum (see the discussion in the Details section). As a result, there is no single value of Age, and no single interval, that produces the highest correct classification rate. Similarly, if the model involves multiple predictors, then even if an observation's cutpoint is uniquely optimal under a given criterion, it is a function of more than just a single predictor so a confidence interval around one of the predictors is not helpful.
One way to obtain a confidence interval for optimal cutpoints on the scale of the predictor is with the INVERSECL(PROB=) option in PROC PROBIT.† Simply specify one or more cutpoint values (on the probability scale). Be careful to reproduce the original model. If the original model is a logistic model, specify the DIST=LOGISTIC option in the MODEL statement.
The following statements refit the logistic model from Example 1 and compute 95% confidence intervals (known as fiducial limits in this situation) for the cutpoints optimal on the Youden and equality (SESPDIFF) criteria. Note in the Example 1 results that each of these optimal cutpoints is unique among the set of observations. The cutpoint (predicted event probability) values of the two optimal cutpoints are specified in the INVERSECL(PROB=) option.
proc probit data=Data1; model Disease/N = Age / d=logistic inversecl(prob=0.24244 0.71637); run;
After verifying that the fitted model from PROC PROBIT matches the original model from PROC LOGISTIC, the following table provides 95% confidence intervals for the Age values that produce the event probabilities (cutpoints) specified in the INVERSECL(PROB=) option. Note that since there are only six data points to produce the empirical ROC curve and since only these points can be selected as optimal points, these might not be very good estimates. Better estimates would be produced if a large number of Age values were observed. Alternatively, if Age can be assumed to be normally distributed (or be transformed to be so), then better optimal cutpoint estimates might be obtained by using that assumption to produce a smooth and monotonic ROC curve.
|
___________
†The following references make use of distributional assumptions to produce the ROC curve or estimate an optimal cutpoint and confidence interval.
Gönen, M. (2007), Analyzing Receiver Operating Characteristic Curves with SAS. Cary, NC: SAS Institute Inc.
Lai C., Tian L., Schisterman E.F., "Exact confidence interval estimation for the Youden index and its corresponding optimal cut-point", Comput. Stat. Data Anal. 2012 May 1; 56(5): 1103-1114.
Greiner et. al. (2000) present and analyze a set of antibody data from 25 infected (INF=1) and 75 uninfected (INF=0) cattle. A diagnostic test (EC) was used as a predictor of infection. An ROC analysis is done to assess the diagnostic and find an optimal cutpoint for determining when the infection is present. The following statements create the data set of results from the cattle.
data elisac; input ec inf @@; datalines; 0.000 0 0.159 0 0.020 0 0.254 1 0.067 0 0.001 0 0.229 0 0.364 1 0.000 0 0.164 0 0.031 0 0.490 1 0.075 0 0.002 0 0.233 0 0.509 1 0.000 0 0.169 0 0.036 0 0.650 1 0.081 0 0.002 0 0.233 0 0.702 1 0.000 0 0.180 0 0.039 0 0.716 1 0.087 0 0.003 0 0.248 0 0.743 1 0.000 0 0.183 0 0.043 0 0.752 1 0.095 0 0.003 0 0.294 0 0.879 1 0.000 0 0.184 0 0.043 0 0.899 1 0.112 0 0.005 0 0.318 0 0.927 1 0.000 0 0.192 0 0.047 0 0.937 1 0.119 0 0.009 0 0.341 0 1.057 1 0.001 0 0.194 0 0.048 0 1.064 1 0.119 0 0.009 0 0.401 0 1.081 1 0.001 0 0.210 0 0.048 0 1.116 1 0.129 0 0.010 0 0.431 0 1.263 1 0.001 0 0.216 0 0.051 0 1.346 1 0.140 0 0.011 0 0.482 0 1.402 1 0.001 0 0.216 0 0.055 0 1.665 1 0.140 0 0.016 0 0.696 0 1.698 1 0.001 0 0.222 0 0.056 0 1.799 1 0.144 0 0.018 0 0.058 0 1.801 1 0.001 0 0.222 0 0.067 0 1.934 1 ;
These statements fit the logistic model and save the data sets of predicted values and ROC information for the ROCPLOT macro.
proc logistic data=elisac; model inf(event="1")=ec / outroc=ecroc; output out=ecpred p=ecp; run;
The following ROCPLOT macro call provides a plot of the ROC curve. The OPTCRIT= option finds the cutpoint that minimizes the difference between sensitivity and specificity. The ID= option displays the EC, sensitivity, and specificity values on the labeled cutpoints and places a symbol (=) on the cutpoint that is optimal under this criterion. The OPTSYMBOLSTYLE= option specifies that this symbol be 18 pixels in size, red, and use a bold font.
%rocplot(inroc = ecroc, inpred = ecpred, p = ecp, id = ec _sens_ _spec_, optcrit = sespdiff, optsymbolstyle = size=18 color=red weight=bold)
The "Optimal Cutpoints" table shows that the optimal cutpoint occurs at EC=0.364, with sensitivity and specificity values of 0.96 and 0.947 for a difference of 0.013.
|
The following ROCPLOT macro call finds the optimal cutpoints on all of the optimality criteria. The prevalence of infection in the sample data is 0.25 but results at a prevalence of 0.05 are also desired. The PEVENT= option requests both prevalences. The COSTRATIO= option specifies two ratios (0.5 and 1) of false positive to false negative costs. The ID= option labels cutpoints with levels of the predictor, EC, and includes symbols indicating the criteria the cutpoints optimize. The OPTSYMBOLSTYLE= option specifies a size of zero to prevent criteria symbols from appearing directly on the ROC curve. Plots of all criteria against EC are requested by the OPTBYX= and X= options. OPTBYX=PANELEACH produces separate criterion plots with a panel of plots at each cost ratio and/or prevalence for cost, MCT, and efficiency. The MULTOPTPLOT= requests a panel summarizing the three prevalence-dependent criteria (cost, MCT, and efficiency). The MULTOPTLIST= option requests tables of the optima at each cost and/or prevalence for these criteria.
%rocplot(inroc = ecroc, inpred = ecpred, p = ecp, id = ec _optall_, optsymbolstyle = size=0, optcrit = all, costratio = 0.5 1, pevent=.25 .05, optbyx = paneleach, x = ec, multoptplot = panelall, multoptlist = yes)
Three cutpoints are tied for the maximum Youden index (0.907) at EC values of 0.254, 0.364, and 0.49. The distance to the "perfect" point in the upper-left corner of the plot is minimum (0.067) at EC=0.364 The "MCT-Based Optimal Cutpoints" and "Cost-Based Optimal Cutpoints" tables show that the cost and MCT criteria are minimized at EC=0.49 when prevalence=0.25 and at EC=0.702 when prevalence=0.05 irrespective of cost ratio. The "Efficiency-Based Optimal Cutpoints" table shows that the efficiency is maximized at the same EC values as cost and MCT. The "Optimal Cutpoints" table shows that across both prevalences and both cost ratios, EC=0.702 optimizes MCT and efficiency while EC=.49 optimizes cost and the correct classification rate.
|
Using the cancer remission example from the LOGISTIC documentation, the following statements fit a logistic model with CELL as the predictor.
data Remission; input remiss cell smear infil li blast temp; datalines; 1 .8 .83 .66 1.9 1.1 .996 1 .9 .36 .32 1.4 .74 .992 0 .8 .88 .7 .8 .176 .982 0 1 .87 .87 .7 1.053 .986 1 .9 .75 .68 1.3 .519 .98 0 1 .65 .65 .6 .519 .982 1 .95 .97 .92 1 1.23 .992 0 .95 .87 .83 1.9 1.354 1.02 0 1 .45 .45 .8 .322 .999 0 .95 .36 .34 .5 0 1.038 0 .85 .39 .33 .7 .279 .988 0 .7 .76 .53 1.2 .146 .982 0 .8 .46 .37 .4 .38 1.006 0 .2 .39 .08 .8 .114 .99 0 1 .9 .9 1.1 1.037 .99 1 1 .84 .84 1.9 2.064 1.02 0 .65 .42 .27 .5 .114 1.014 0 1 .75 .75 1 1.322 1.004 0 .5 .44 .22 .6 .114 .99 1 1 .63 .63 1.1 1.072 .986 0 1 .33 .33 .4 .176 1.01 0 .9 .93 .84 .6 1.591 1.02 1 1 .58 .58 1 .531 1.002 0 .95 .32 .3 1.6 .886 .988 1 1 .6 .6 1.7 .964 .99 1 1 .69 .69 .9 .398 .986 0 1 .73 .73 .7 .398 .986 ; proc logistic data=Remission; model remiss = cell / outroc=roc; output out=outp p=p; run;
The ROC plot produced by the following ROCPLOT macro call labels the cutpoints with the median of the predictor (CELL) values, the mean of SMEAR, and the maximum value of BLAST. Each label also includes the cutpoint and positive predictive value. Since CELL is the only predictor in the model, there is one predicted probability and therefore one cutpoint for each value of CELL. But there can be one or more observations at any given value of CELL as can be seen in these results from PROC MEANS.
proc means data=outp mean max; class cell; var smear blast; run;
|
In the following call of the ROCPLOT macro, the IDSTAT= option is used to specify the statistics to be computed for each variable from the input data set (CELL, SMEAR, and BLAST). The statistics are computed over the set of observations corresponding to each cutpoint. The THINY=0 option allows all cutpoints above the diagonal to be labeled if they also meet the default THINSENS=0.05 requirement. Adding this option labels the CELL=0.9 cutpoint whose height above the diagonal does not exceed the default THINY=0.5 requirement.
title "ROC plot for remiss = cell"; title2 " "; %rocplot(inroc = roc, inpred = outp, p = p, id = cell smear blast _cutpt_ _ppred_ , idstat = median mean max . . , thiny = 0)
The labeled plot shows that sensitivity increases as CELL increases. However, the positive predictive value decreases as CELL increases. The labels also show the mean SMEAR and maximum BLAST values at each cutpoint matching the results seen in the PROC MEANS results above.
The following data and model appear in the example titled "Logistic Modeling with Categorical Predictors" in the PROC LOGISTIC documentation. These statements create the input data and fit a logistic model to the probability of no pain.
data Neuralgia; input Treatment $ Sex $ Age Duration Pain $ @@; datalines; Placebo Female 68 1 No TrtA Female 69 12 No Placebo Male 66 26 Yes TrtB Male 67 23 No TrtA Female 71 12 No TrtB Male 77 1 Yes TrtA Male 71 17 Yes Placebo Female 65 29 No TrtB Female 66 12 No TrtB Male 75 21 Yes TrtA Female 64 17 No Placebo Female 70 13 Yes Placebo Male 70 1 Yes Placebo Female 68 27 Yes TrtA Female 64 30 No TrtB Male 70 22 No TrtB Female 78 1 No TrtA Male 67 10 No TrtB Male 75 30 Yes TrtB Male 80 21 Yes TrtA Male 70 12 No Placebo Female 67 30 No TrtB Male 70 1 No TrtB Female 77 16 No Placebo Male 78 12 Yes TrtB Female 76 9 Yes Placebo Male 66 4 Yes TrtA Female 69 18 Yes TrtA Male 78 15 Yes Placebo Female 64 1 Yes Placebo Female 72 27 No TrtA Female 72 25 No TrtB Female 65 7 No TrtB Male 59 29 No Placebo Male 67 17 Yes TrtA Male 69 1 No Placebo Female 67 1 Yes TrtB Female 69 42 No TrtA Female 74 1 No Placebo Female 79 20 Yes TrtB Male 74 16 No TrtB Female 65 14 No TrtB Female 67 28 No TrtA Male 76 25 Yes TrtB Female 72 50 No TrtB Female 69 24 No TrtA Female 63 27 No Placebo Male 60 26 Yes TrtA Male 62 42 No TrtA Female 67 11 No Placebo Male 74 4 No TrtA Male 75 6 Yes TrtB Male 66 19 No Placebo Male 68 11 Yes TrtA Male 70 28 No TrtA Male 65 15 No Placebo Male 83 1 Yes Placebo Female 72 11 Yes Placebo Male 77 29 Yes TrtA Female 69 3 No ; proc logistic data=Neuralgia; class Treatment Sex; model Pain(event="No") = Treatment Sex Treatment*Sex Age Duration / outroc=NeurOR; output out=NeurProbs p=pNoPain; run;
This call of the ROCPLOT macro produces an ROC plot of the fitted model using the default appearance. The THINSENS=0 option allows cutpoints to be labeled even if there is no vertical separation (sensitivity). Notice that the default CHARLEN=5 option truncates the character values in the labels to the first five characters. Similarly, the default FORMAT=BEST5. option presents numeric values in the labels with a width of five.
%rocplot(inpred = NeurProbs, inroc = NeurOR, p = pNoPain, id = Treatment Sex Age Duration, thinsens = 0)
The following ROCPLOT call alters the plot appearance by limiting the length of the values of the character variables (Treatment and Sex) to 4, separating values in the label with a comma, making the label text bold, blue, and 10 pixels in size, using green, filled stars as cutpoint marker symbols, and using a red, dotted line that is 3 pixels thick. The THINSENS=0 option is removed to reduce crowding of labels.
%rocplot(inpred = NeurProbs, inroc = NeurOR, p = pNoPain, id = Treatment Sex Age Duration, charlen = 4, split = %str(,), labelstyle = color=blue size=10 weight=bold, marker = starfilled, markerstyle = color=green, linestyle = color=red thickness=2 pattern=3)
In this example, data on plants were gathered from four blocks. Data set TRAIN contains the data from the first three blocks and is used as the training data set. Data set VALID contains data from the fourth block and is used as the validation data set.
data train valid; input block entry lat lng n r @@; if block < 4 then output train; else output valid; datalines; 1 14 1 1 8 2 1 16 1 2 9 1 1 7 1 3 13 9 1 6 1 4 9 9 1 13 2 1 9 2 1 15 2 2 14 7 1 8 2 3 8 6 1 5 2 4 11 8 1 11 3 1 12 7 1 12 3 2 11 8 1 2 3 3 10 8 1 3 3 4 12 5 1 10 4 1 9 7 1 9 4 2 15 8 1 4 4 3 19 6 1 1 4 4 8 7 2 15 5 1 15 6 2 3 5 2 11 9 2 10 5 3 12 5 2 2 5 4 9 9 2 11 6 1 20 10 2 7 6 2 10 8 2 14 6 3 12 4 2 6 6 4 10 7 2 5 7 1 8 8 2 13 7 2 6 0 2 12 7 3 9 2 2 16 7 4 9 0 2 9 8 1 14 9 2 1 8 2 13 12 2 8 8 3 12 3 2 4 8 4 14 7 3 7 1 5 7 7 3 13 1 6 7 0 3 8 1 7 13 3 3 14 1 8 9 0 3 4 2 5 15 11 3 10 2 6 9 7 3 3 2 7 15 11 3 9 2 8 13 5 3 6 3 5 16 9 3 1 3 6 8 8 3 15 3 7 7 0 3 12 3 8 12 8 3 11 4 5 8 1 3 16 4 6 15 1 3 5 4 7 12 7 3 2 4 8 16 12 4 9 5 5 15 8 4 4 5 6 10 6 4 12 5 7 13 5 4 1 5 8 15 9 4 15 6 5 17 6 4 6 6 6 8 2 4 14 6 7 12 5 4 7 6 8 15 8 4 13 7 5 13 2 4 8 7 6 13 9 4 3 7 7 9 9 4 10 7 8 6 6 4 2 8 5 12 8 4 11 8 6 9 7 4 5 8 7 11 10 4 16 8 8 15 7 ;
These statements fit the logistic model with three predictors. The OUTROC= option in both the MODEL and SCORE statements causes PROC LOGISTIC to display ROC curves for the TRAIN and VALID data sets.
proc logistic data=train; model r/n = entry lat lng / outroc=troc; score data=valid out=valpred outroc=vroc; run;
The following ROCPLOT macro call plots the ROC curve for the validation data set and labels the cutpoints with the predictor values at each cutpoint as well as the correct classification rate (_CORRECT_). The THINY= and THINSENS= options allow all cutpoints above the diagonal to be labeled provided that they are separated vertically by at least 0.01 in sensitivity. The ALTAXISLABEL=YES option is specified to relabel the axes as True Positive Rate and False Positive Rate. The cutpoint label text is made larger and colored dark blue by the LABELSTYLE= option.
The cutpoint that is closest to the "perfect" point at the upper-left corner of the ROC plot is found by specifying the OPTCRIT=DIST option. This point is identified on the ROC plot by the symbol, D. This symbol is colored red and made larger and bolder by the OPTSYMBOLSTYLE= option. By including _OPTCORR_ in the ID= option, the cutpoint with maximum correct classification rate is also identified in the ROC plot by adding the symbol, C, in that cutpoint's label.
%rocplot(inpred = valpred, inroc = vroc, p = p_event, id = entry lat lng _correct_ _optcorr_, thiny = 0, thinsens = 0.01, altaxislabel = yes, labelstyle = size=10 color=darkblue, optcrit = dist, optsymbolstyle = size=13 color=red weight=bold)
The results show that the cutpoint at ENTRY=10, LAT=7, LNG=8 maximizes the correct classification rate (0.658) and also minimizes the distance to the "perfect" point (0.531). This cutpoint has predicted probability 0.219 and is identified by the red "D" symbol on the ROC curve and by "C" in its label.
|
The data for the following appears in the example titled "Generalized Additive Model with Binary Data" in the PROC GAM documentation. These statements fit the same logistic generalized additive model as shown in the PROC GAM example. The OUTPUT statement is added to save the predicted probabilities from the model. By default, the variable containing the predicted probabilities is named P_KYPHOSIS by PROC GAM.
proc gam data=kyphosis; model Kyphosis (event="1") = spline(Age ,df=3) spline(StartVert,df=3) spline(NumVert ,df=3) / dist=binomial; output out=gamout predicted; run;
When the model is fit by a procedure other than PROC LOGISTIC, it is necessary to use the predicted probabilities from the fitted model in the PRED= option in the ROC statement of PROC LOGISTIC. No predictors need to be specified in the MODEL statement. The PROC LOGISTIC statements that follow use the OUTROC= option to save ROC information from the model specified in the MODEL statement (in this case, an intercept-only model) and from the generalized additive model. A WHERE clause is needed to retain only the ROC information from the generalized additive model. The value used in the WHERE clause ("gam") should be the same as the label in the ROC statement.
proc logistic data=gamout; model Kyphosis (event="1") = / outroc=groc(where=(_source_="gam")); roc "gam" pred=P_Kyphosis; run;
This call to the ROCPLOT macro plots the ROC curve and labels the points using the predicted probabilities (the cutpoint values) and Age values.
%rocplot(inpred=gamout, inroc=groc, p=P_Kyphosis, id=_cutpt_ age)
|
Right-click on the link below and select Save to save the ROCPLOT macro definition to a file. It is recommended that you name the file rocplot.sas.
Type: | Sample |
Topic: | SAS Reference ==> Procedures ==> LOGISTIC Analytics ==> Categorical Data Analysis Analytics ==> Regression SAS Reference ==> Procedures ==> ADAPTIVEREG SAS Reference ==> Procedures ==> GAM SAS Reference ==> Procedures ==> GEE SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> GLIMMIX SAS Reference ==> Procedures ==> HPGENSELECT SAS Reference ==> Procedures ==> PROBIT SAS Reference ==> Procedures ==> SURVEYLOGISTIC |
Date Modified: | 2022-12-04 08:12:24 |
Date Created: | 2005-01-13 15:03:46 |
Product Family | Product | Host | SAS Release | |
Starting | Ending | |||
SAS System | SAS/STAT | z/OS | 9.3 TS1M0 | |
Microsoft® Windows® for x64 | 9.3 TS1M0 | |||
64-bit Enabled HP-UX | 9.3 TS1M0 | |||
Windows 7 Enterprise 32 bit | 9.3 TS1M0 | |||
Windows 7 Enterprise x64 | 9.3 TS1M0 | |||
Windows 7 Home Premium 32 bit | 9.3 TS1M0 | |||
Windows 7 Home Premium x64 | 9.3 TS1M0 | |||
Windows 7 Professional 32 bit | 9.3 TS1M0 | |||
Windows 7 Professional x64 | 9.3 TS1M0 | |||
Windows 7 Ultimate 32 bit | 9.3 TS1M0 | |||
Windows 7 Ultimate x64 | 9.3 TS1M0 | |||
Windows Vista | 9.3 TS1M0 | |||
Windows Vista for x64 | 9.3 TS1M0 | |||
64-bit Enabled AIX | 9.3 TS1M0 | |||
Microsoft Windows XP Professional | 9.3 TS1M0 | |||
Microsoft Windows Server 2008 for x64 | 9.3 TS1M0 | |||
Microsoft Windows Server 2008 R2 | 9.3 TS1M0 | |||
Microsoft Windows Server 2008 | 9.3 TS1M0 | |||
Microsoft Windows Server 2003 for x64 | 9.3 TS1M0 | |||
Microsoft Windows Server 2003 Standard Edition | 9.3 TS1M0 | |||
Microsoft Windows Server 2003 Enterprise Edition | 9.3 TS1M0 | |||
Microsoft Windows Server 2003 Datacenter Edition | 9.3 TS1M0 | |||
64-bit Enabled Solaris | 9.3 TS1M0 | |||
HP-UX IPF | 9.3 TS1M0 | |||
Linux | 9.3 TS1M0 | |||
Linux for x64 | 9.3 TS1M0 | |||
Solaris for x64 | 9.3 TS1M0 |