"Rank Correlation of Observed Responses and Predicted Probabilities" in the Details section of the PROC LOGISTIC documentation describes the binning of predicted probabilities and how pairs of observations are determined to be concordant, discordant, or tied. From these, the association statistics — Somers' D (Gini coefficient), gamma, tau-a, and c (the concordance index and area under the ROC curve) — can be computed using the formulas shown in the documentation.
Note that binning the predicted probabilities is more efficient and reduces execution time for large data sets, but produces a rougher approximation to these statistics. Binning can be turned off by specifying the BINWIDTH=0 option in the MODEL statement, or by specifying any of the following:
A more accurate approximation of the association statistics, such as the area under the ROC curve (c statistic), is obtained by using any of the above to turn off binning.
Note that, beginning in SAS 9.4 TS1M3, no binning is done if the response is binary and there are fewer than 5,000,000 observations in the input data set.
The following example uses the described method to optionally bin the predicted probabilities and compute the association statistics. The code shown applies only to binary response models.
These statements produce an example data set for which the association statistics will be computed.
data a; input Seq Score Outcome; datalines; 1 1525 0 2 1641 0 3 1706 0 4 1722 0 5 1738 1 6 1758 1 7 1770 0 8 1775 0 9 1798 0 10 1839 1 11 1842 0 12 1848 0 13 1848 0 14 1856 0 15 1864 0 16 1864 0 17 1879 0 18 1895 0 19 1909 0 20 1917 0 21 1944 0 22 1975 0 23 2002 0 ;
These statements fit a binary logistic model to the OUTCOME variable. The EVENT="1" response variable option ensures that the probability of OUTCOME=1 is modeled. Since the BINWIDTH= option is not specified, the default bin width of 1/500 = 0.002 is used in computing the association statistics. The predicted probabilities computed by the PREDPROBS=INDIVIDUAL option are not binned and are saved in data set OUT.
proc logistic data=a; model outcome(event="1") = score; output out=out predprobs=individual; run;
The resulting "Association of Predicted Probabilities and Observed Responses" table from the model fit is shown below.
|
By specifying the BINWIDTH=0 option (or any of the other options or statements mentioned above), binning is turned off.
proc logistic data=a; model outcome(event="1") = score / binwidth=0; run;
Following is the table of statistics when the predicted probabilities are not binned.
|
The following statements define the macro CONCDISC which applies the binning method described in the LOGISTIC documentation. It creates the data set _PAIRS which contains an observation for each possible pair of event and nonevent observations and indicates whether each pair is concordant, discordant, or tied. The macro requires the data set of predicted probabilities, the name of the response variable, and the values of the event and nonevent levels of the response. If the BINWIDTH= option was not specified in the PROC LOGISTIC step, then it can be omitted when calling the macro. The macro will then use the same default bin width. Otherwise, specify the same value in the BINWIDTH= macro option as was specified in the PROC LOGISTIC step.
%macro concdisc(data=, event=, nonevent=, response=, binwidth=0.002); %global n; /* bin the predicted probabilities and give all in the bin the lowest value */ data _r&data; set &data; %if &binwidth ne 0 %then pbin=floor((1/&binwidth)*ip_&event); %else pbin=ip_&event; ; run; proc sql; reset noprint; /* number of observations in original data set into macro variable N */ select count(*) into :N from _r&data; /* create data set of nonevents */ create table _nonevents as select pbin as pbin0 from _r&data where &response=&nonevent; /* create data set of events */ create table _events as select pbin as pbin1 from _r&data where &response=&event; /* create data set of all event-nonevent pairs and determine concordance */ create table _pairs as select *, (pbin1>pbin0) as conc, (pbin1=pbin0) as tie, (pbin0>pbin1) as disc from _nonevents, _events; %mend;
This statement calls the CONCDISC macro using the default bin width of 0.002.
%concdisc(data=out, event=1, nonevent=0, response=outcome)
The following statements compute the proportions and counts of concordant, discordant, and tied observations.
proc summary data=_pairs; var conc disc tie; output out=cd mean=propconc propdisc proptie sum=nc nd n=t; run;
Finally, these statements use the formulas shown in the documentation to compute and display the association statistics.
data rankcorrs; set cd; Concordant=100*propconc; Discordant=100*propdisc; Tied=100*proptie; Pairs=t; SomersD=(nc-nd)/t; Gamma=(nc-nd)/(nc+nd); Tau_a=(nc-nd)/(0.5*&N*(&N-1)); c=(nc + 0.5*(t-nc-nd))/t; run; proc print noobs; var Concordant Discordant Tied Pairs SomersD Gamma Tau_a c; format Concordant Discordant Tied Pairs 5.1 SomersD Gamma Tau_a c 6.3; run;
Note that the recomputed association statistics match those produced by PROC LOGISTIC when the default binning was used.
|
The association statistics resulting from not binning the predicted probabilities can be obtained by using the above code with the BINWIDTH=0 option in the CONCDISC macro.
%concdisc(data=out, event=1, nonevent=0, response=outcome, binwidth=0)
|
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | N/A | Aster Data nCluster on Linux x64 | ||
DB2 Universal Database on AIX | ||||
DB2 Universal Database on Linux x64 | ||||
Greenplum on Linux x64 | ||||
Netezza TwinFin 32bit blade | ||||
Netezza TwinFin 32-bit SMP Hosts | ||||
Netezza TwinFin 64-bit S-Blades | ||||
Netezza TwinFin 64-bit SMP Hosts | ||||
Teradata on Linux | ||||
z/OS | ||||
Z64 | ||||
OpenVMS VAX | ||||
Macintosh | ||||
Macintosh on x64 | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows Server 2003 for x64 | ||||
Microsoft Windows Server 2008 | ||||
Microsoft Windows Server 2008 for x64 | ||||
Microsoft Windows XP Professional | ||||
Windows 7 Enterprise 32 bit | ||||
Windows 7 Enterprise x64 | ||||
Windows 7 Home Premium 32 bit | ||||
Windows 7 Home Premium x64 | ||||
Windows 7 Professional 32 bit | ||||
Windows 7 Professional x64 | ||||
Windows 7 Ultimate 32 bit | ||||
Windows 7 Ultimate x64 | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
Windows Vista for x64 | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | Analytics ==> Categorical Data Analysis Analytics ==> Regression SAS Reference ==> Procedures ==> LOGISTIC |
Date Modified: | 2016-02-10 13:56:51 |
Date Created: | 2012-02-22 16:33:34 |