![]() | ![]() | ![]() |
Values of the predicted probability columns (variables with prefix P_) in the SCORE OUT= data set of PROC LOGISTIC will be missing if any of the following conditions occurs:
To illustrate, consider a study of the analgesic effects of treatments on elderly patients with neuralgia. Two test treatments and a placebo are compared. The response variable is whether the patient reported pain or not. Researchers recorded age and gender of the patients and the duration of complaint before the treatment began. The data, consisting of 60 patients, are contained in the data set Neuralgia. The variable Pain is the response variable. A specification of Pain=Yes indicates there was pain, and Pain=No indicates no pain. The variable Treatment is a categorical variable with three levels: A and B represent the two test treatments, and P represents the placebo treatment. The variable Age is the age of the patients, in years, when treatment began.
/* Training data set */
Data Neuralgia;
input Treatment $ Sex $ Age Duration Pain $ @@;
datalines;
P F 68 1 No B M 74 16 No P F 67 30 No
P M 66 26 Yes B F 67 28 No B F 77 16 No
A F 71 12 No B F 72 50 No B F 76 9 Yes
A M 71 17 Yes A F 63 27 No A F 69 18 Yes
B F 66 12 No A M 62 42 No P F 64 1 Yes
A F 64 17 No P M 74 4 No A F 72 25 No
P M 70 1 Yes B M 66 19 No B M 59 29 No
A F 64 30 No A M 70 28 No A M 69 1 No
B F 78 1 No P M 83 1 Yes B F 69 42 No
B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No A F 69 12 No B F 65 14 No
B M 70 1 No B M 67 23 No A M 76 25 Yes
P M 78 12 Yes B M 77 1 Yes B F 69 24 No
P M 66 4 Yes P F 65 29 No P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No
P F 72 27 No P F 70 13 Yes A M 75 6 Yes
B F 65 7 No P F 68 27 Yes P M 68 11 Yes
P M 67 17 Yes B M 70 22 No A M 65 15 No
P F 67 1 Yes A M 67 10 No P F 72 11 Yes
A F 74 1 No B M 80 21 Yes A F 69 3 No
;
The following validation data set will be used in the SCORE statement to obtain predicted probabilities for the specified combinations of Treatment and Age. Notice that the first three observations use Treatments A, B, and P all of which appeared in the training data set, Neuralgia, and nonmissing values of Age. However, observation 4 contains a missing value (.) for Age in Treatment A. In observation 5, a nonmissing value of Age is specified, but the specified treatment, Z, is not one that appeared in the training data set.
/* Validation data set */
Data Validate;
input Treatment $ Age;
datalines;
A 65
B 72
P 80
A .
Z 68
;
proc print data=Validate;
run;
|
The following statements train the model using the training data set, Neuralgia, and then score the Validate data set. The EVENT="No" option specifies that the probability of Pain=No is to be modeled.
proc logistic data=Neuralgia;
class Treatment;
model Pain (event="No") = Treatment Age;
score data=Validate out=Preds;
run;
Notice that predictions are given for the first three observations, but not for the fourth because of the missing value of Age, and not for the fifth because the Treatment value didn't appear in the training data set.
proc print data=Preds;
run;
|
To see why this occurs it helps to know what the fitted model is. Following is the table of parameter estimates from the trained model:
| ||||||||||||||||||||||||||||||||||||||||||
From this table, the model can written as follows:
Logit(p) = 18.5356 + 0.7033*TA + 1.2759*TB - 0.2581*Age ,
where Logit(p) is the log odds of Pain=No. TA and TB are design variables representing the CLASS predictor, Treatment, and are coded as shown in the "Class Level Information" table below. The first Design Variable column is TA, the second column is TB.
| ||||||||||||||||||||
Using the model, the first observation in the Validate data set can be scored as follows. From the "Class Level Information" table, Treatment=A is represented in the model by TA=1 and TB=0.
Logit(p) = 18.5356 + 0.7033*1 + 1.2759*0 - 0.2581*65 = 2.4624 ,
The probability of Pain=No can be obtained from the logit by the following transformation:
Pr(Pain=No) = 1 / (1+exp(-logit))
For the first observation, the predicted probability of Pain=No is 1 / (1+exp(-2.4624)) = 0.9215 and therefore the predicted probability of Pain=Yes is 1-0.9215 = 0.0785. (The slight difference from the SAS results is due to using rounded values here. The results from PROC LOGISTIC are more precise.)
For observation 2:
Logit(p) = 18.5356 + 0.7033*0 + 1.2759*1 - 0.2581*72 = 1.2283,
Pr(Pain=No) = 0.7735 and Pr(Pain=Yes) = 0.2265 .
For observation 3:
Logit(p) = 18.5356 + 0.7033*-1 + 1.2759*-1 - 0.2581*80 = -4.0916,
Pr(Pain=No) = 0.0164 and Pr(Pain=Yes) = 0.9836.
For the fourth observation:
Logit(p) = 18.5356 + 0.7033*1 + 1.2759*0 - 0.2581*.
Because the value of Age is missing, the model equation is incomplete and the logit and predicted probabilities cannot be computed. Note that simply ignoring the Age term in the model and computing the logit as 18.5356 + 0.7033*1 + 1.2759*0 is not valid because this is equivalent to setting Age=0 which is almost certainly not intended.
For the fifth observation:
Logit(p) = 18.5356 + 0.7033*. + 1.2759*. - 0.2581*68
Because Treatment Z does not appear in the training data set, there are no corresponding values of the design variables, TA and TB, so again the model equation is incomplete and the logit and predicted probabilities cannot be computed. Simply ignoring the two treatment terms and computing the logit as 18.5356 - 0.2581*68 is not valid because this is equivalent to setting TA=TB=0 and this represents no known Treatment. The only valid treatments are coded as shown in the "Class Level Information" table.
| Product Family | Product | System | SAS Release | |
| Reported | Fixed* | |||
| SAS System | SAS/STAT | z/OS | ||
| OpenVMS VAX | ||||
| Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
| Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
| Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
| Microsoft Windows XP 64-bit Edition | ||||
| Microsoft® Windows® for x64 | ||||
| OS/2 | ||||
| Microsoft Windows 95/98 | ||||
| Microsoft Windows 2000 Advanced Server | ||||
| Microsoft Windows 2000 Datacenter Server | ||||
| Microsoft Windows 2000 Server | ||||
| Microsoft Windows 2000 Professional | ||||
| Microsoft Windows NT Workstation | ||||
| Microsoft Windows Server 2003 Datacenter Edition | ||||
| Microsoft Windows Server 2003 Enterprise Edition | ||||
| Microsoft Windows Server 2003 Standard Edition | ||||
| Microsoft Windows XP Professional | ||||
| Windows Millennium Edition (Me) | ||||
| Windows Vista | ||||
| 64-bit Enabled AIX | ||||
| 64-bit Enabled HP-UX | ||||
| 64-bit Enabled Solaris | ||||
| ABI+ for Intel Architecture | ||||
| AIX | ||||
| HP-UX | ||||
| HP-UX IPF | ||||
| IRIX | ||||
| Linux | ||||
| Linux for x64 | ||||
| Linux on Itanium | ||||
| OpenVMS Alpha | ||||
| OpenVMS on HP Integrity | ||||
| Solaris | ||||
| Solaris for x64 | ||||
| Tru64 UNIX | ||||
| Type: | Usage Note |
| Priority: | |
| Topic: | SAS Reference ==> Procedures ==> LOGISTIC Analytics ==> Categorical Data Analysis |
| Date Modified: | 2008-06-02 14:25:18 |
| Date Created: | 2008-06-02 12:17:42 |



