SUPPORT / SAMPLES & SAS NOTES
 

Support

Usage Note 32304: Missing predicted values when scoring new data using a fitted model

DetailsAboutRate It

SAS® modeling procedures provide several ways for scoring new observations using a fitted model as described in this note. When scoring new data, the predicted value for an observation will be missingNote1 if any of the following conditions occurs:

  • The value of any predictor is missing. Predictor variables are typically specified in a MODEL statement, but depending on the procedure, may be specified in other statements such as VAR, RANDOM, ZEROMODEL, DISPMODEL, or others. All predictor variables must be nonmissing in order to compute a predicted value. Otherwise predicted values and various other computed statistics in the output data set will be missing. This condition is illustrated in the example below. The ADAPTIVEREG and PLS procedures are exceptions. PROC ADAPTIVEREG can provide predicted values when predictors are missing. The MISSING= option can be used in PROC PLS to impute missing predictors so that predicted values can be computed. In procedures that provide effect selection methods (forward, stepwise, and others), a missing value in a candidate variable will not cause the predicted value to be missing if that variable is not included in the final model. Imputation methods are available in the MI procedure that can be used in many situations to replace missing predictor values.
  • The value is missing for any variable in a statement (such as CLASS, STRATA, REPEATED, or other) that further defines the model. As with model predictors, variables in such statements must also be nonmissing. In some cases, the procedure cannot fit the model when missing values exist. Generally, observations with missing CLASS variable values are ignored by modeling procedures when fitting the model, and many proceduresNote2 also do not compute predicted values for such observations. Therefore, it is best practice to not specify variables in the CLASS statement unless they are also specified in another model-defining statement such as the MODEL, ZEROMODEL, REPEATED, or other statement.
  • The value of any CLASS predictor in the data set being scored does not appear in the set of observations that was used to fit (train) the model. The model only has parameters for the values of a CLASS predictor that appear in the data set that trains the model. A new CLASS predictor value has no model parameter corresponding to it, so a predicted value cannot be computed. This condition is illustrated in the example below.
  • The value of the OFFSET= variable is missing, if an OFFSET= option is available in the procedure and is specified. An offset variable is just another predictor in the model and therefore must be nonmissing in order to compute a predicted value.
  • The predicted value is invalid. This issue can occur when the model should produce values in a restricted range. For example, predicted values from a logistic or probit model should be estimates of binomial means and therefore be between 0 and 1. In some cases, usually resulting from an inappropriate model specification or pathological data, the predicted value might fall outside the valid range and be reported as a missing value. Check the log for messages indicating problems.
  • For nonparametric modeling procedures GAM and LOESS, predicted values cannot be computed for any data point that is not within the range of predictor values found in the data used to fit (train) the model. This is an issue when using the SCORE statement. For observations in the SCORE DATA= data set that are outside this range, predicted values are set to missing. See the Extrapolation section of this note for spline models in other procedures created using the EFFECT statement.
  • For survival analysis procedures LIFEREG and PHREG, the survival or cumulative distribution function estimate is missing if the observed response is missing.
  • In PROC DISCRIM, a data set specified in the TESTDATA= option can be scored using the TESTOUT= option. The predicted classification variable in the TESTOUT= data set, _INTO_, will be missing for an observation if the THRESHOLD= option is specified and the largest posterior probability for the observation is less than the THRESHOLD= value.

Example

To illustrate, consider a study of the analgesic effects of treatments on elderly patients with neuralgia. Two test treatments and a placebo are compared. The presence or absence of pain is recorded. The probability of pain is to be modeled using logistic regression.

Researchers recorded age and gender of the patients and the duration of complaint before the treatment began. The training data consisting of 60 patients are contained in the data set Neuralgia. The binary variable Pain is the response variable. A specification of Pain=Yes indicates that pain was present, and Pain=No indicates no pain. The variable Treatment is a categorical variable with three levels: A and B represent the two test treatments, and P represents the placebo treatment. The variable Age is the age of the patients, in years, when treatment began.

      /* Training data set */
      Data Neuralgia;
         input Treatment $ Sex $ Age Duration Pain $ @@;
         datalines;
      P  F  68   1  No   B  M  74  16  No  P  F  67  30  No
      P  M  66  26  Yes  B  F  67  28  No  B  F  77  16  No
      A  F  71  12  No   B  F  72  50  No  B  F  76   9  Yes
      A  M  71  17  Yes  A  F  63  27  No  A  F  69  18  Yes
      B  F  66  12  No   A  M  62  42  No  P  F  64   1  Yes
      A  F  64  17  No   P  M  74   4  No  A  F  72  25  No
      P  M  70   1  Yes  B  M  66  19  No  B  M  59  29  No
      A  F  64  30  No   A  M  70  28  No  A  M  69   1  No
      B  F  78   1  No   P  M  83   1  Yes B  F  69  42  No
      B  M  75  30  Yes  P  M  77  29  Yes P  F  79  20  Yes
      A  M  70  12  No   A  F  69  12  No  B  F  65  14  No
      B  M  70   1  No   B  M  67  23  No  A  M  76  25  Yes
      P  M  78  12  Yes  B  M  77   1  Yes B  F  69  24  No
      P  M  66   4  Yes  P  F  65  29  No  P  M  60  26  Yes
      A  M  78  15  Yes  B  M  75  21  Yes A  F  67  11  No
      P  F  72  27  No   P  F  70  13  Yes A  M  75   6  Yes
      B  F  65   7  No   P  F  68  27  Yes P  M  68  11  Yes
      P  M  67  17  Yes  B  M  70  22  No  A  M  65  15  No
      P  F  67   1  Yes  A  M  67  10  No  P  F  72  11  Yes
      A  F  74   1  No   B  M  80  21  Yes A  F  69   3  No
      ;

The following validation data set will be used in the SCORE statement in PROC LOGISTIC to obtain predicted probabilities for the specified combinations of Treatment and Age. Notice that the first three observations use Treatments A, B, and P all of which appear in the training data set and nonmissing values of Age. However, the fourth observation contains a missing value (.) for Age in Treatment A. In the fifth observation, a nonmissing value of Age is specified, but the specified treatment, Z, is not one that appeared in the training data set.

      /* Validation data set */
      Data Validate;
         input Treatment $ Age;
         datalines;
      A 65
      B 72
      P 80
      A .
      Z 68
      ;

The following statements train the logistic model using the training data set and then score the validation data set. The EVENT="No" option specifies that the probability of Pain=No is to be modeled.

      proc logistic data=Neuralgia;
         class Treatment;
         model Pain (event="No") = Treatment Age;
         score data=Validate out=Preds;
         run;

Notice that predictions are given for the first three observations in the validation data set, but not for the fourth because of the missing value of Age, and not for the fifth because the Treatment value does not appear in the training data set.

      proc print data=Preds;
         id Treatment Age;
         run;

To see why this occurs it helps to know what the fitted model is. Following is the table of parameter estimates from the trained model:

From this table, the model can written as follows:

Logit(p) = 18.5356 + 0.7033*TA + 1.2759*TB - 0.2581*Age ,

where Logit(p) is the log odds of Pain=No (log odds = log(Pr(Pain=No)/Pr(Pain=Yes)). TA and TB are design variables representing the CLASS predictor, Treatment, and are coded as shown in the "Class Level Information" table below. The first Design Variable column is TA, the second column is TB.

Using the model, the first observation in the Validate data set can be scored as follows. From the "Class Level Information" table, Treatment=A is represented in the model by TA=1 and TB=0.

Logit(p) = 18.5356 + 0.7033*1 + 1.2759*0 - 0.2581*65 = 2.4624 ,

The probability of Pain=No can be obtained from the logit by the following transformation:

Pr(Pain=No) = 1 / (1+exp(-logit))

For the first observation, the predicted probability of Pain=No is 1 / (1+exp(-2.4624)) = 0.9215 and therefore the predicted probability of Pain=Yes is 1-0.9215 = 0.0785. (The slight difference from the SAS results is due to using rounded values here. The results from PROC LOGISTIC are more precise.)

For observation 2:

Logit(p) = 18.5356 + 0.7033*0 + 1.2759*1 - 0.2581*72 = 1.2283,
Pr(Pain=No) = 0.7735 and Pr(Pain=Yes) = 0.2265 .

For observation 3:

Logit(p) = 18.5356 + 0.7033*-1 + 1.2759*-1 - 0.2581*80 = -4.0916,
Pr(Pain=No) = 0.0164 and Pr(Pain=Yes) = 0.9836.

For the fourth observation:

Logit(p) = 18.5356 + 0.7033*1 + 1.2759*0 - 0.2581*.

Because the value of Age is missing, the model equation is incomplete and the logit and predicted probabilities cannot be computed. Note that simply ignoring the Age term in the model and computing the logit as 18.5356 + 0.7033*1 + 1.2759*0 is not valid because this is equivalent to setting Age=0 which is almost certainly not intended.

For the fifth observation:

Logit(p) = 18.5356 + 0.7033*. + 1.2759*. - 0.2581*68

Because Treatment Z does not appear in the training data set, there are no corresponding values of the design variables, TA and TB, so again the model equation is incomplete and the logit and predicted probabilities cannot be computed. Simply ignoring the two treatment terms and computing the logit as 18.5356 - 0.2581*68 is not valid because this is equivalent to setting TA=TB=0 and this represents no known Treatment. The only valid treatments are coded as shown in the "Class Level Information" table.

__________

NOTE 1: Some previous problems caused predicted values to incorrectly be set to missing.

  • In PROC LOGISTIC prior to SAS 9.4, if the FORMAT statement is used and appears before the SCORE statement, then missing predicted values appear in the SCORE OUT= data set. Move the FORMAT statement to follow the SCORE statement to avoid this problem.
  • The problem described in this note affects PROC GENMOD prior to SAS 9.4 TS1M2.
  • In PROC QUANTSELECT prior to SAS 9.4 TS1M3, a missing value in a predictor that is not included in the final model causes the predicted value to be missing.

NOTE 2: GLM, GENMOD, PROBIT, PHREG, LIFEREG, QUANTREG, QUANTSELECT, ROBUSTREG, SURVEYREG, SURVEYPHREG, HPLOGISTIC, HPMIXED, COUNTREG, QLIM, and possibly others.



Operating System and Release Information

Product FamilyProductSystemSAS Release
ReportedFixed*
SAS SystemSAS/STATz/OS
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Microsoft Windows Server 2003 Datacenter 64-bit Edition
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
OS/2
Microsoft Windows 95/98
Microsoft Windows 2000 Advanced Server
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows XP Professional
Windows Millennium Edition (Me)
Windows Vista
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
SAS SystemSAS/ETSMicrosoft Windows 2000 Advanced Server
Microsoft Windows 95/98
Microsoft Windows 8.1 Pro 32-bit
Microsoft Windows 8.1 Pro
Microsoft Windows 8.1 Enterprise x64
Microsoft Windows 8 Pro x64
Microsoft Windows 8.1 Enterprise 32-bit
Microsoft Windows 8 Pro 32-bit
Microsoft Windows 8 Enterprise 32-bit
Microsoft Windows 8 Enterprise x64
OS/2
Microsoft Windows XP 64-bit Edition
Microsoft® Windows® for x64
Microsoft Windows Server 2003 Enterprise 64-bit Edition
Microsoft Windows Server 2003 Datacenter 64-bit Edition
OpenVMS VAX
Microsoft® Windows® for 64-Bit Itanium-based Systems
Z64
z/OS
Microsoft Windows 2000 Datacenter Server
Microsoft Windows 2000 Server
Microsoft Windows 2000 Professional
Microsoft Windows NT Workstation
Microsoft Windows Server 2003 Datacenter Edition
Microsoft Windows Server 2003 Enterprise Edition
Microsoft Windows Server 2003 Standard Edition
Microsoft Windows Server 2003 for x64
Microsoft Windows Server 2008
Microsoft Windows Server 2008 R2
Microsoft Windows Server 2008 for x64
Microsoft Windows Server 2012 Datacenter
Microsoft Windows Server 2012 R2 Datacenter
Microsoft Windows Server 2012 R2 Std
Microsoft Windows Server 2012 Std
Microsoft Windows XP Professional
Windows 7 Enterprise 32 bit
Windows 7 Enterprise x64
Windows 7 Home Premium 32 bit
Windows 7 Home Premium x64
Windows 7 Professional 32 bit
Windows 7 Professional x64
Windows 7 Ultimate 32 bit
Windows 7 Ultimate x64
Windows Millennium Edition (Me)
Windows Vista
Windows Vista for x64
64-bit Enabled AIX
64-bit Enabled HP-UX
64-bit Enabled Solaris
ABI+ for Intel Architecture
AIX
HP-UX
HP-UX IPF
IRIX
Linux
Linux for x64
Linux on Itanium
OpenVMS Alpha
OpenVMS on HP Integrity
Solaris
Solaris for x64
Tru64 UNIX
* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.