SUPPORT / SAMPLES & SAS NOTES
 

Support

Sample 68077: Precision-recall curve for imbalanced, rare event data

DetailsResultsDownloadsAboutRate It

Precision-recall curve for imbalanced, rare event data

Contents: Purpose / History / Requirements / Usage / Details / Missing Values / See Also / References

 

Note: Beginning in SAS® Viya® platform 2022.09, PROC LOGISTIC can plot the precision-recall curve, compute the area beneath it, and find optimal points using two criteria.

PURPOSE:
The PRcurve macro plots the precision-recall curve, computes the area beneath it, and can find optimal points on the curve that correspond to maximized values of the F score and Matthews correlation coefficient criteria.
HISTORY:
The version of the PRcurve macro that you are using is displayed when you specify version (or any string) as the first argument. Here is an example:
    %prcurve(version, <other options>)

The PRcurve macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active internet connection available), the macro issues the following message:

   NOTE: Unable to check for newer version

The computations that are performed by the macro are not affected by the appearance of this message.

Version
Update Notes
1.1 Print Note if SAS/ETS® is not available for computing/displaying area under PR curve.
1.0 Initial coding
REQUIREMENTS:
Base SAS®. Data required by the macro must be created by the LOGISTIC procedure in SAS/STAT®. SAS/ETS® software is required when options=area is specified.
USAGE:
Follow the instructions on the Downloads tab of this sample to save the PRcurve macro definition. Replace the text within quotation marks in the following statement with the location of the PRcurve macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the PRcurve macro and make it available for use:
   %inc "<location of your file containing the PRcurve macro>";

Following this statement, you can call the PRcurve macro. See the Results tab for examples.

The following parameters are available in the PRcurve macro:

data=data-set-name
Specifies the name of a data set created by either the OUTROC= option or the CTABLE option in PROC LOGISTIC. If the available data consists only of actual responses and predicted event probabilities from a model or classifier, then use PROC LOGISTIC to create the OUTROC= data set by using the PRED= option in the ROC statement to read the predicted event probabilities and the NOFIT and OUTROC= options in the MODEL statement. If the data= parameter is omitted, the data set that was last created is used. Running the macro does not change the last-created data set, which means that successive runs of the macro omitting data= can operate on the same input data set.
inpred=data-set-name
Specifies the name of a data set created by the OUT= and PRED= options in the OUTPUT statement of PROC LOGISTIC. It is ignored if options=optimal and pred= are not also specified.
pred=variable-name
Specifies the name of the variable created by the PRED= option in the data set created by the OUT= option in the OUTPUT statement of PROC LOGISTIC. It is ignored if options=optimal and inpred= are not also specified.
optvars=variable-list
Specifies the names of one or more variables in the data set created by the OUT= option in the OUTPUT statement of PROC LOGISTIC that will be used to identify optimal points on the PR curve. Separate multiple variable names with spaces. Variable lists, such as x1-x9, are not supported. It is ignored if options=optimal, inpred=, and pred= are not also specified.
npoints=integer
Specifies a positive integer value representing the number of interpolated points between 0 and 1 that the macro will generate along the PR curve. The default value is npoints=100.
beta=value
Specifies a positive value representing the parameter of the F score optimality criterion. Typical values are 0.5, 1, or 2 that control the relative importance of precision and recall in the F score. It is ignored if options=optimal is not also specified. The default value is beta=1.
sensdelta=value
Specifies a small positive value that represents the minimum difference in sensitivity values between an interpolated point and a point from the data= data set. The default value is sensdelta=1e-10.
sensinc=value
Specifies a small positive value that slightly increases successive sensitivity values in the data= data set so that the area under the curve can be computed. The default value is sensinc=1e-14.
options=list-of-options
Specifies desired options separated by spaces. The default value is options=pprob area markers br ppvzero nooptimal. Valid options are as follows:
pprob | nopprob
Requests drawing a horizontal reference line at the observed, overall event proportion and displaying its value on the plot. This line represents the PR curve of a random, "no skill" model. Specifying nopprob omits the line and value.
area | noarea
Requests computation of the area below the PR curve and display of its value on the plot. Area computation requires SAS/ETS. Specify noarea to not compute or display the area.
tl | tr | bl | br
Specifies in which corner of the plot to display the area and event proportion, if requested.
markers | nomarkers
Requests the display of markers at each of the threshold points provided in the data= data set. When there are a large number of threshold points, it might be desirable to specify nomarkers.
optimal | nooptimal
Requests computation of the F score and Matthews correlation coefficient (MCC) at each threshold point in the data= data set, and if optvars= is specified, it displays a table of the optimal values and corresponding values of the variables that are specified in optvars=. Specify nooptimal if computation of optimal points is not desired.
ppvzero | noppvzero
Requests that the PR curve include a point at (0,0) when a threshold point in the data= data set contains zero true positives and nonzero false positives. Specifying noppvzero ignores such points, which results in the PR curve being constant (flat) from recall=0 to the first precision (PPV) of the first threshold point in which the number of true positives exceeds zero. The area, if requested, is affected by this choice.
DETAILS:
The Receiver Operating Characteristic (ROC) curve and the area beneath it (AUC-ROC) are perhaps the most commonly used graphic and statistic for assessing the performance of binary-response models or classifiers. However, as described by Saito and Rehmsmeier (2015) and others, the ROC curve and AUC-ROC can be misleading when the proportions of events and nonevents become very imbalanced, such as when there are very few observed events. They note that the statistics behind the ROC curve, sensitivity and specificity, are invariant to the degree of imbalance. When interest focuses on the model predicting a high proportion of true events among the predicted events, a plot involving that statistic, known as the precision or positive predictive value (PPV), can be more informative. The fact that the precision, unlike specificity, is sensitive to the degree of imbalance allows a plot of precision to more accurately reflect the measure of interest, regardless of the degree of imbalance. This is the advantage of the precision-recall (PR) curve and its area (AUC-PR).

Saito and Rehmsmeier (2015) provide a good comparison of ROC and PR curves and show that the PR curve gives a better assessment of model performance when the proportions of events and nonevents in the data are imbalanced.

Davis and Goadrich (2006) discuss how the ROC space and the PR space are related and particularly how interpolation must be done among the observed points in PR space. They note that, while linear interpolation can be used between points in ROC space, it is not the case in PR space. The PRcurve macro applies their interpolation method when plotting the PR curve and computing the area under it.

The PRcurve macro plots precision against sensitivity and computes the area under the curve (AUC-PR). It can also find optimal points on the PR curve that have maximum values on the F score and Matthews correlation coefficient (MCC). When predicted event probabilities (from inpred= and pred=) and one or more predictor variables in the model (from optvars=) are specified, the macro displays a table of the optimal points based on the above optimality criteria and the predictor values corresponding to those points.

The F score and MCC combine the precision and recall measures. A parameter, β, can be specified for the F score to control the relative balance of the importance of precision to recall. At the default, β=1, the F score is the harmonic mean of the two and they are treated as equally important. When β > 1 (β=2 is commonly used), more importance is given to recall. When β < 1 (β=0.5 is commonly used), precision is more important. The F score ranges between 0 and 1 and equals 1 when precision and recall both equal 1. MCC is considered by some to be a better statistic. It is the geometric mean of precision and recall and is equivalent to the phi coefficient. It ranges from -1 to 1 and equals 1 when all observations are correctly classified.

BY-group processing

While the PRcurve macro does not directly support BY-group processing, this capability can be provided by the RunBY macro, which can run the PRcurve macro repeatedly for each of the BY groups in your data. See the RunBY macro documentation (SAS Note 66249) for details about its use. Also see the example titled "BY-group processing" on the Results tab.

Output data sets

Two data sets are automatically created:

_PR
Contains the points on the PR curve including both those from the data= data set and the interpolated points that are generated by the macro.
_PROpt
Produced if options=optimal, inpred=, pred=, and optvars= are all specified. Contains the optimal points on the PR curve. For each, the associated values of the optimality criteria and of the variables specified in optvars= are given. When the data= data set is a CTABLE data set, the value of the threshold in the inpred= data set that most closely matches the threshold of the optimal point from the data= data set is also given.
MISSING VALUES:
Missing values should not appear in the data= data set. Missing values are tolerated in the inpred= data set and pred= variable and do not affect the results.
SEE ALSO:
The ROC curve, the area under the ROC curve, and comparisons of ROC curves and their areas are available in PROC LOGISTIC using the ROC and ROCCONTRAST statements. See the example in the LOGISTIC documentation. Comparison of independent ROC curves, such as curves fit to independent samples, is described in SAS Note 45339. Computation of optimal values on the ROC curve, using various optimality criteria, can be done using the ROCPLOT macro (SAS Note 25018).
REFERENCES:
Davis, J., and Goadrich, M.H. 2006. "The relationship between Precision-Recall and ROC curves." Proceedings of the 23rd international Conference on Machine Learning.

Fawcett, T. 2004. ROC Graphs: Notes and Practical Considerations for Researchers. Norwell, MA: Kluwer Academic Publishers.

Saito T., and Rehmsmeier M. 2015. "The Precision-Recall Plot Is More Informative Than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." PLoS ONE. 10(3): e0118432.

Williams, C.K.I. 2021. "The Effect of Class Imbalance on Precision-Recall Curves." Neural Computation 33(4): 853–857.




These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.