- EXAMPLE 1: Plot PR curve, compute AUC, and optimal points for logistic model
- The following uses the data from the example titled "Logistic Regression" in the GENMOD documentation. PROC LOGISTIC fits the logistic model, plots the ROC curve, and saves the ROC plot data in the data set OR. The PRcurve macro then plots the precision-recall (PR) curve and computes the area under the PR curve.
proc logistic data=drug;
class drug;
model r/n = drug x / outroc=or;
run;
%prcurve()
The resulting ROC plot and area under the ROC curve (AUC = 0.84) show evidence of moderately good model fit. Note that an ROC curve lying close to the diagonal line with AUC near 0.5 indicates a poorly fitting model that essentially randomly classifies observations as positive or negative. The ROC curve of a perfectly fitting model that classifies all observations correctly is a curve from (0,0) to (0,1) to (1,1) with AUC = 1.
|
Note that the overall proportion of positives in the data is 0.42. That indicates that the positive and negative responses are reasonably balanced, making the PR curve less necessary for assessing the model. As with the ROC curve and its AUC, the PR curve and AUC = 0.75 suggest a moderately good fitting model. Note that a PR curve lying close to the horizontal, positive proportion line with AUC equal to that proportion indicates a poor, randomly classifying model. The PR curve of a model that fits and classifies perfectly is a curve from (0,1) to (1,1) to (1,0) with AUC = 1.
The following macro call computes and identifies optimal points on the PR curve and displays their corresponding thresholds on the predicted probabilities.
%prcurve(options=optimal)
Points at the maximum F(1) score and Matthews correlation coefficient (MCC) are identified with colors on the PR curve. The achieved values on these optimality criteria and the predicted probability thresholds at which they occur are given in the legend below the plot. Note that the F score using the default parameter value, β = 1, is the harmonic mean of precision and recall.
In order to display the predictor values corresponding to the optimal PR points, the data set containing predicted probabilities under the fitted model must be provided in inpred=. The variable containing the predicted probabilities is specified in pred=. Also, specify some or all of the predictor variables to display in optvars=. The following statements refit the model, save the necessary data sets, and repeat the previous plot:
proc logistic data=drug;
class drug;
model r/n = drug x / outroc=or;
output out=preds pred=p;
run;
%prcurve(data=or, inpred=preds, pred=p, options=optimal, optvars=drug x)
In addition to repeating the plot above, the following table of optimal points is displayed.
- EXAMPLE 2: PR curve for data from previous model or classifier
- If a model or classifier has already been used providing a data set of actual responses and predicted event probabilities, you can use PROC LOGISTIC to create the OUTROC= data set needed by the PRcurve macro. Specify the variable containing the predicted event probabilities in the PRED= option in the ROC statement. In the MODEL statement, specify the value representing the event of interest in the EVENT= option following the response variable. Also specify the NOFIT and OUTROC= options. No predictors need to be specified.
The following simple example illustrates how this is done:
data x;
input p y;
datalines;
0.9 1
0.8 1
0.7 0
0.6 1
0.55 1
0.54 1
0.53 0
0.52 0
0.51 1
0.505 0
;
proc logistic data=x;
model y(event='1')= / nofit outroc=xor;
roc pred=p;
run;
%prcurve()
Below are the ROC and PR curves for these data.
The same can be done for a model fit by another SAS procedure that uses any modeling method appropriate for a binary response and provides a data set of the predicted event probabilities and actual responses. The ROC curve and AUC can be obtained as illustrated in this note. The following produces the PR curve for the Generalized Additive Model (GAM) example shown in that note. Notice that the OUTROC= option is added so that the data set can be used in the PRcurve macro. The OUT= data set from the GAM procedure is specified in inpred= to obtain optimal points on the PR curve and the corresponding predictor values.
proc gam data=Kyphosis;
model Kyphosis(event="1") = spline(Age ,df=3)
spline(StartVert,df=3)
spline(NumVert ,df=3) / dist=binomial;
output out=gamout predicted;
run;
proc logistic data=gamout;
model Kyphosis(event="1") = / nofit outroc=gamor;
roc "GAM model" pred=P_Kyphosis;
run;
%prcurve(data=gamor, inpred=gamout, pred=P_Kyphosis,
optvars=age startvert numvert, options=optimal nomarkers)
Below is the ROC produced by PROC LOGISTIC and the PR curve from the PRcurve macro. Notice that the proportion of events (positive proportion) is 0.2169. Optimal points on the PR curve, identified by values on the three predictors in the generalized additive model, are provided following the plots.
|
|
The point on the above PR curve that is optimal under the F score criterion assumes equal importance of precision (PPV) and recall (sensitivity). That is the implication of the default β = 1 parameter for the F(β) score. With β = 1, the F score computed at each threshold is the harmonic mean of precision and recall. If, however, recall is considered more important, then a larger value of β can be used. Conversely, if precision is considered more important, a smaller value of β can be used. The following finds the optimal points, under the F score criterion, for β = 0.5 and 2. For both cases, the predictor values associated with the optimal points could be identified by also specifying inpred=, pred=, optvars=, and options=optimal. Note that there is no corresponding parameter for the MCC criterion, so the optimal point under that criterion does not change.
%prcurve(data=gamor, options=optimal nomarkers, beta=0.5)
%prcurve(data=gamor, options=optimal nomarkers, beta=2)
Notice that the optimal point with maximum value for F(0.5) has moved to a value with considerably larger precision. However, the point optimal under F(2) did not move from the point found under the default F(1) criterion – notice the identical threshold value, 0.5699, associated with the F(1) and F(2) maxima. That threshold has a fairly large recall value under both criteria.
- EXAMPLE 3: Compare ROC and PR curves with balanced and unbalanced data
- This example illustrates the invariance of the ROC curve to the level of imbalance among events and nonevents in the data. This invariance makes the ROC curve and AUC misleading in cases where the imbalance is strong, such as when the event is rare in the population.
The following statements generate balanced and unbalanced data sets with a predictor, X, that moderately separates the events and nonevents. The balanced data set contains 2,000 observations with equal numbers of events and nonevents. Another data set of 2,000 observations is created in which only 5% of the observations are events.
data bal;
do i=1 to 1000; y=1; x=rannor(2315)+0.5; output; end;
do i=1 to 1000; y=0; x=rannor(2315); output; end;
run;
data unbal;
do i=1 to 50; y=1; x=rannor(2315)+0.5; output; end;
do i=1 to 1000; y=0; x=rannor(2315); output; end;
run;
These statements fit the model to the balanced data and produce the ROC and PR curves. The nomarkers option is used since there are a large number of points in the OUTROC= data sets. The tr option is used to move the positive proportion and area information to the top right corner of the plot.
proc logistic data=bal;
model y(event="1")=x / outroc=or;
run;
%prcurve(options=nomarkers tr)
Note that for these balanced data, both curves and the areas under them give similar results. Both suggest a moderate model performance.
Balanced Data, Moderate Separation
|
|
These statements fit the model to the unbalanced data:
proc logistic data=unbal;
model y(event="1")=x / outroc=or;
run;
%prcurve(options=nomarkers tr)
For the unbalanced data, note that the ROC curve and AUC are little changed. However, the PR curve differs dramatically from the PR curve for the balanced data, and the AUC has dropped from 0.63 to 0.09. That is because the precision (or positive predictive value, PPV) at each threshold on the predicted probabilities is very low. That is, the probability at each threshold of a predicted event being an actual event is low in the unbalanced data. This information is not captured by the ROC curve.
Unbalanced Data, Moderate Separation
|
|
Next, balanced and unbalanced data sets are generated with the predictor providing good separation between events and nonevents. As before, both data sets contain 2,000 observations with 50% events in the balanced data set and only 5% events in the unbalanced data set.
data bal;
do i=1 to 1000; y=1; x=rannor(2315)+2; output; end;
do i=1 to 1000; y=0; x=rannor(2315); output; end;
run;
data unbal;
do i=1 to 50; y=1; x=rannor(2315)+2; output; end;
do i=1 to 1000; y=0; x=rannor(2315); output; end;
run;
These statements produce the ROC and PR curves and areas:
proc logistic data=bal;
model y(event="1")=x / outroc=or;
run;
%prcurve(options=nomarkers)
proc logistic data=unbal;
model y(event="1")=x / outroc=or;
run;
%prcurve(options=nomarkers tr)
Again, the ROC and PR curves and AUCs give similar results when the data are balanced, but even in this case with relatively well-separated event and nonevent populations, the two curves give very different assessments of the model performance in the unbalanced data. The ROC curve and AUC are essentially unchanged, missing the much lower proportions of true positives among the predicted positives in the unbalanced data. The PR curve reveals those lower proportions, particularly at the moderate and higher sensitivities occurring at lower thresholds on the predicted probabilities. The reduced AUC value summarizes this decrease in performance.
Balanced Data, Good Separation
|
|
Unbalanced Data, Good Separation
|
|
- EXAMPLE 4: Cross-validated precision-recall curve
- The following uses the cancer remission data in the example titled "Stepwise Logistic Regression and Predicted Values" in the PROC LOGISTIC documentation. A cross-validated PR curve can be produced from the CTABLE option in PROC LOGISTIC. That option uses an approximate leave-one-out cross-validation method to produce the classification tables over a range of thresholds on the predicted probabilities. Cross validation reduces the bias that results from using the same data to both fit the model and assess its performance. By saving the table produced by the CTABLE option and specifying it as the data= data set in the PRcurve macro, a cross-validated PR curve is produced.
The first PROC LOGISTIC step below fits a logistic model, produces the ROC curve, and saves both the ROC and CTABLE results in data sets for use in the PRcurve macro. The second PROC LOGISTIC step applies the cross-validation method to the ROC curve and produces a cross-validated ROC curve. Creating ROC curves using both validation and cross validation is further discussed in this note. The PRcurve macro is then applied to the ROC and CTABLE data to produce ordinary and cross-validated PR curves with optimal points identified.
proc logistic data=remission;
model remiss(event='1') = smear blast / ctable outroc=remor;
ods output classification=remct;
run;
proc logistic data=remission rocoptions(crossvalidate) plots(only)=roc;
Crossvalidation: model remiss(event='1') = smear blast;
run;
%prcurve(data=remor, options=optimal)
%prcurve(data=remct, options=optimal)
Notice that the cross-validated ROC curve on the right has noticeably reduced area from the ordinary ROC curve on the left. That reflects the optimistic bias resulting from using the training data to assess the model.
Similar to the ROC curves, the cross-validated PR curve has lower area as a result of the bias-reducing cross-validation method.
- EXAMPLE 5: BY-group processing
- While the PRcurve macro does not support BY processing directly, the RunBY macro can be used to run the macro on BY groups in the data. The following uses the data in the example titled "Binomial Counts in Randomized Blocks." See the description of the RunBY macro for details about its use and links to several examples. The PROC LOGISTIC step below uses the BY statement to fit separate models to the levels of the BLOCK variable in the data and saves the ROC data. The OUTROC= data set similarly contains the BLOCK variable identifying the ROC data for each block.
proc logistic data=HessianFly;
by block;
model y/n = entry / outroc=blockor;
run;
The RunBY macro is then used to run the PRcurve macro for each block in turn. This action is done by specifying the code to be run on each BY group in a macro that you create that is named CODE. The statements below create the CODE macro containing a DATA step that includes a subsetting WHERE statement that specifies the special macro variables, _BYx and _LVLx, which are used by the RunBY macro to process each BY group. The PRcurve macro is then called to run on the subset data. The BYlabel macro variable is also used to label the displayed results from each BY group. Since the PRcurve macro writes its own titles, a FOOTNOTE statement is used instead of a TITLE statement to provide the label.
%macro code();
data byor; set blockor; where &_BY1=&_LVL1; run;
footnote "Above plot for &BYlabel";
%prcurve(data=byor)
%mend;
%RunBY(data=blockor, by=block)