68077 - Precision-recall curve for imbalanced, rare event data

SUPPORT / SAMPLES & SAS NOTES

Support

Sample 68077: Precision-recall curve for imbalanced, rare event data

Precision-recall curve for imbalanced, rare event data

Contents:

Purpose / History / Requirements / Usage / Details / Missing Values / See Also / References

Note: Beginning in SAS^® Viya^® platform 2022.09, PROC LOGISTIC can plot the precision-recall curve, compute the area beneath it, and find optimal points using two criteria.

PURPOSE:

The PRcurve macro plots the precision-recall curve, computes the area beneath it, and can find optimal points on the curve that correspond to maximized values of the F score and Matthews correlation coefficient criteria.

HISTORY:

The version of the PRcurve macro that you are using is displayed when you specify version (or any string) as the first argument. Here is an example:

    %prcurve(version, <other options>)

The PRcurve macro always attempts to check for a later version of itself. If it is unable to do this (such as if there is no active internet connection available), the macro issues the following message:

   NOTE: Unable to check for newer version

The computations that are performed by the macro are not affected by the appearance of this message.

Version	Update Notes
1.1	Print Note if SAS/ETS^® is not available for computing/displaying area under PR curve.
1.0	Initial coding

REQUIREMENTS:

Base SAS^®. Data required by the macro must be created by the LOGISTIC procedure in SAS/STAT^®. SAS/ETS^® software is required when options=area is specified.

USAGE:

Follow the instructions on the Downloads tab of this sample to save the PRcurve macro definition. Replace the text within quotation marks in the following statement with the location of the PRcurve macro definition file on your system. In your SAS program or in the SAS editor window, specify this statement to define the PRcurve macro and make it available for use:

   %inc "<location of your file containing the PRcurve macro>";

Following this statement, you can call the PRcurve macro. See the Results tab for examples.

The following parameters are available in the PRcurve macro:

data=data-set-name

Specifies the name of a data set created by either the OUTROC= option or the CTABLE option in PROC LOGISTIC. If the available data consists only of actual responses and predicted event probabilities from a model or classifier, then use PROC LOGISTIC to create the OUTROC= data set by using the PRED= option in the ROC statement to read the predicted event probabilities and the NOFIT and OUTROC= options in the MODEL statement. If the data= parameter is omitted, the data set that was last created is used. Running the macro does not change the last-created data set, which means that successive runs of the macro omitting data= can operate on the same input data set.

inpred=data-set-name

Specifies the name of a data set created by the OUT= and PRED= options in the OUTPUT statement of PROC LOGISTIC. It is ignored if options=optimal and pred= are not also specified.

pred=variable-name

Specifies the name of the variable created by the PRED= option in the data set created by the OUT= option in the OUTPUT statement of PROC LOGISTIC. It is ignored if options=optimal and inpred= are not also specified.

optvars=variable-list

Specifies the names of one or more variables in the data set created by the OUT= option in the OUTPUT statement of PROC LOGISTIC that will be used to identify optimal points on the PR curve. Separate multiple variable names with spaces. Variable lists, such as x1-x9, are not supported. It is ignored if options=optimal, inpred=, and pred= are not also specified.

npoints=integer

Specifies a positive integer value representing the number of interpolated points between 0 and 1 that the macro will generate along the PR curve. The default value is npoints=100.

beta=value

Specifies a positive value representing the parameter of the F score optimality criterion. Typical values are 0.5, 1, or 2 that control the relative importance of precision and recall in the F score. It is ignored if options=optimal is not also specified. The default value is beta=1.

sensdelta=value

Specifies a small positive value that represents the minimum difference in sensitivity values between an interpolated point and a point from the data= data set. The default value is sensdelta=1e-10.

sensinc=value

Specifies a small positive value that slightly increases successive sensitivity values in the data= data set so that the area under the curve can be computed. The default value is sensinc=1e-14.

options=list-of-options

Specifies desired options separated by spaces. The default value is options=pprob area markers br ppvzero nooptimal. Valid options are as follows:

pprob | nopprob: Requests drawing a horizontal reference line at the observed, overall event proportion and displaying its value on the plot. This line represents the PR curve of a random, "no skill" model. Specifying nopprob omits the line and value.
area | noarea: Requests computation of the area below the PR curve and display of its value on the plot. Area computation requires SAS/ETS. Specify noarea to not compute or display the area.
tl | tr | bl | br: Specifies in which corner of the plot to display the area and event proportion, if requested.
markers | nomarkers: Requests the display of markers at each of the threshold points provided in the data= data set. When there are a large number of threshold points, it might be desirable to specify nomarkers.
optimal | nooptimal: Requests computation of the F score and Matthews correlation coefficient (MCC) at each threshold point in the data= data set, and if optvars= is specified, it displays a table of the optimal values and corresponding values of the variables that are specified in optvars=. Specify nooptimal if computation of optimal points is not desired.
ppvzero | noppvzero: Requests that the PR curve include a point at (0,0) when a threshold point in the data= data set contains zero true positives and nonzero false positives. Specifying noppvzero ignores such points, which results in the PR curve being constant (flat) from recall=0 to the first precision (PPV) of the first threshold point in which the number of true positives exceeds zero. The area, if requested, is affected by this choice.

DETAILS:

The Receiver Operating Characteristic (ROC) curve and the area beneath it (AUC-ROC) are perhaps the most commonly used graphic and statistic for assessing the performance of binary-response models or classifiers. However, as described by Saito and Rehmsmeier (2015) and others, the ROC curve and AUC-ROC can be misleading when the proportions of events and nonevents become very imbalanced, such as when there are very few observed events. They note that the statistics behind the ROC curve, sensitivity and specificity, are invariant to the degree of imbalance. When interest focuses on the model predicting a high proportion of true events among the predicted events, a plot involving that statistic, known as the precision or positive predictive value (PPV), can be more informative. The fact that the precision, unlike specificity, is sensitive to the degree of imbalance allows a plot of precision to more accurately reflect the measure of interest, regardless of the degree of imbalance. This is the advantage of the precision-recall (PR) curve and its area (AUC-PR).

Saito and Rehmsmeier (2015) provide a good comparison of ROC and PR curves and show that the PR curve gives a better assessment of model performance when the proportions of events and nonevents in the data are imbalanced.

Davis and Goadrich (2006) discuss how the ROC space and the PR space are related and particularly how interpolation must be done among the observed points in PR space. They note that, while linear interpolation can be used between points in ROC space, it is not the case in PR space. The PRcurve macro applies their interpolation method when plotting the PR curve and computing the area under it.

The PRcurve macro plots precision against sensitivity and computes the area under the curve (AUC-PR). It can also find optimal points on the PR curve that have maximum values on the F score and Matthews correlation coefficient (MCC). When predicted event probabilities (from inpred= and pred=) and one or more predictor variables in the model (from optvars=) are specified, the macro displays a table of the optimal points based on the above optimality criteria and the predictor values corresponding to those points.

The F score and MCC combine the precision and recall measures. A parameter, β, can be specified for the F score to control the relative balance of the importance of precision to recall. At the default, β=1, the F score is the harmonic mean of the two and they are treated as equally important. When β > 1 (β=2 is commonly used), more importance is given to recall. When β < 1 (β=0.5 is commonly used), precision is more important. The F score ranges between 0 and 1 and equals 1 when precision and recall both equal 1. MCC is considered by some to be a better statistic. It is the geometric mean of precision and recall and is equivalent to the phi coefficient. It ranges from -1 to 1 and equals 1 when all observations are correctly classified.

BY-group processing

While the PRcurve macro does not directly support BY-group processing, this capability can be provided by the RunBY macro, which can run the PRcurve macro repeatedly for each of the BY groups in your data. See the RunBY macro documentation (SAS Note 66249) for details about its use. Also see the example titled "BY-group processing" on the Results tab.

Output data sets

Two data sets are automatically created:

_PR: Contains the points on the PR curve including both those from the data= data set and the interpolated points that are generated by the macro.
_PROpt: Produced if options=optimal, inpred=, pred=, and optvars= are all specified. Contains the optimal points on the PR curve. For each, the associated values of the optimality criteria and of the variables specified in optvars= are given. When the data= data set is a CTABLE data set, the value of the threshold in the inpred= data set that most closely matches the threshold of the optimal point from the data= data set is also given.

MISSING VALUES:

Missing values should not appear in the data= data set. Missing values are tolerated in the inpred= data set and pred= variable and do not affect the results.

SEE ALSO:

The ROC curve, the area under the ROC curve, and comparisons of ROC curves and their areas are available in PROC LOGISTIC using the ROC and ROCCONTRAST statements. See the example in the LOGISTIC documentation. Comparison of independent ROC curves, such as curves fit to independent samples, is described in SAS Note 45339. Computation of optimal values on the ROC curve, using various optimality criteria, can be done using the ROCPLOT macro (SAS Note 25018).

REFERENCES:

Davis, J., and Goadrich, M.H. 2006. "The relationship between Precision-Recall and ROC curves." Proceedings of the 23rd international Conference on Machine Learning.

Fawcett, T. 2004. ROC Graphs: Notes and Practical Considerations for Researchers. Norwell, MA: Kluwer Academic Publishers.

Saito T., and Rehmsmeier M. 2015. "The Precision-Recall Plot Is More Informative Than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." PLoS ONE. 10(3): e0118432.

Williams, C.K.I. 2021. "The Effect of Class Imbalance on Precision-Recall Curves." Neural Computation 33(4): 853–857.

These sample files and code examples are provided by SAS Institute Inc. "as is" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose. Recipients acknowledge and agree that SAS Institute shall not be liable for any damages whatsoever arising out of their use of this material. In addition, SAS Institute will provide no support for the materials contained herein.

Type:	Sample
Topic:	Analytics ==> Categorical Data Analysis Analytics ==> Regression Analytics ==> Statistical Graphics

Date Modified:	2023-09-07 18:30:36
Date Created:	2021-06-23 14:51:48

Product Family	Product	Host	SAS Release
Product Family	Product	Host	Starting	Ending
SAS System	N/A	Aster Data nCluster on Linux x64
		DB2 Universal Database on AIX
		DB2 Universal Database on Linux x64
		Netezza TwinFin 32-bit SMP Hosts
		Netezza TwinFin 32bit blade
		Netezza TwinFin 64-bit S-Blades
		Netezza TwinFin 64-bit SMP Hosts
		Teradata on Linux
		Cloud Foundry
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for AArch64
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX
		z/OS
		z/OS 64-bit
		IBM AS/400
		OpenVMS VAX
		N/A
		Android Operating System
		Apple Mobile Operating System
		Chrome Web Browser
		Macintosh
		Macintosh on x64
		Microsoft Windows 10
		Microsoft Windows 7
		Microsoft Windows 8 Enterprise 32-bit
		Microsoft Windows 8 Enterprise x64
		Microsoft Windows 8 Pro 32-bit
		Microsoft Windows 8 Pro x64
		Microsoft Windows 8 x64
		Microsoft Windows Server 2008 R2
		Microsoft Windows Server 2012 R2 Datacenter
		Microsoft Windows Server 2012 R2 Std
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		OS/2
		SAS Cloud
		Microsoft Windows 8.1 Enterprise 32-bit
		Microsoft Windows 8.1 Enterprise x64
		Microsoft Windows 8.1 Pro 32-bit
		Microsoft Windows 8.1 Pro x64
		Microsoft Windows 95/98
		Microsoft Windows 2000 Advanced Server
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Server
		Microsoft Windows 2000 Professional
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 for x64
Microsoft Windows Server 2012 Datacenter
Microsoft Windows Server 2012 Std
Microsoft Windows Server 2016
Microsoft Windows Server 2019
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64

Support

Sample 68077: Precision-recall curve for imbalanced, rare event data

Precision-recall curve for imbalanced, rare event data

BY-group processing

Output data sets

Operating System and Release Information

Follow Us

What is...