32304 - Missing predicted values when scoring new data using a fitted model

SUPPORT / SAMPLES & SAS NOTES

Support

Usage Note 32304: Missing predicted values when scoring new data using a fitted model

SAS^® modeling procedures provide several ways for scoring new observations using a fitted model as described in this note. When scoring new data, the predicted value for an observation will be missing^Note1 if any of the following conditions occurs:

The value of any predictor is missing. Predictor variables are typically specified in a MODEL statement, but depending on the procedure, may be specified in other statements such as VAR, RANDOM, ZEROMODEL, DISPMODEL, or others. All predictor variables must be nonmissing in order to compute a predicted value. Otherwise predicted values and various other computed statistics in the output data set will be missing. This condition is illustrated in the example below. The ADAPTIVEREG and PLS procedures are exceptions. PROC ADAPTIVEREG can provide predicted values when predictors are missing. The MISSING= option can be used in PROC PLS to impute missing predictors so that predicted values can be computed. In procedures that provide effect selection methods (forward, stepwise, and others), a missing value in a candidate variable will not cause the predicted value to be missing if that variable is not included in the final model. Imputation methods are available in the MI procedure that can be used in many situations to replace missing predictor values.
The value is missing for any variable in a statement (such as CLASS, STRATA, REPEATED, or other) that further defines the model. As with model predictors, variables in such statements must also be nonmissing. In some cases, the procedure cannot fit the model when missing values exist. Generally, observations with missing CLASS variable values are ignored by modeling procedures when fitting the model, and many procedures^Note2 also do not compute predicted values for such observations. Therefore, it is best practice to not specify variables in the CLASS statement unless they are also specified in another model-defining statement such as the MODEL, ZEROMODEL, REPEATED, or other statement.
The value of any CLASS predictor in the data set being scored does not appear in the set of observations that was used to fit (train) the model. The model only has parameters for the values of a CLASS predictor that appear in the data set that trains the model. A new CLASS predictor value has no model parameter corresponding to it, so a predicted value cannot be computed. This condition is illustrated in the example below.
The value of the OFFSET= variable is missing, if an OFFSET= option is available in the procedure and is specified. An offset variable is just another predictor in the model and therefore must be nonmissing in order to compute a predicted value.
The predicted value is invalid. This issue can occur when the model should produce values in a restricted range. For example, predicted values from a logistic or probit model should be estimates of binomial means and therefore be between 0 and 1. In some cases, usually resulting from an inappropriate model specification or pathological data, the predicted value might fall outside the valid range and be reported as a missing value. Check the log for messages indicating problems.
For nonparametric modeling procedures GAM and LOESS, predicted values cannot be computed for any data point that is not within the range of predictor values found in the data used to fit (train) the model. This is an issue when using the SCORE statement. For observations in the SCORE DATA= data set that are outside this range, predicted values are set to missing. See the Extrapolation section of this note for spline models in other procedures created using the EFFECT statement.
For survival analysis procedures LIFEREG and PHREG, the survival or cumulative distribution function estimate is missing if the observed response is missing.
In PROC DISCRIM, a data set specified in the TESTDATA= option can be scored using the TESTOUT= option. The predicted classification variable in the TESTOUT= data set, _INTO_, will be missing for an observation if the THRESHOLD= option is specified and the largest posterior probability for the observation is less than the THRESHOLD= value.

Example

To illustrate, consider a study of the analgesic effects of treatments on elderly patients with neuralgia. Two test treatments and a placebo are compared. The presence or absence of pain is recorded. The probability of pain is to be modeled using logistic regression.

Researchers recorded age and gender of the patients and the duration of complaint before the treatment began. The training data consisting of 60 patients are contained in the data set Neuralgia. The binary variable Pain is the response variable. A specification of Pain=Yes indicates that pain was present, and Pain=No indicates no pain. The variable Treatment is a categorical variable with three levels: A and B represent the two test treatments, and P represents the placebo treatment. The variable Age is the age of the patients, in years, when treatment began.

      /* Training data set */
      Data Neuralgia;
         input Treatment $ Sex $ Age Duration Pain $ @@;
         datalines;
      P  F  68   1  No   B  M  74  16  No  P  F  67  30  No
      P  M  66  26  Yes  B  F  67  28  No  B  F  77  16  No
      A  F  71  12  No   B  F  72  50  No  B  F  76   9  Yes
      A  M  71  17  Yes  A  F  63  27  No  A  F  69  18  Yes
      B  F  66  12  No   A  M  62  42  No  P  F  64   1  Yes
      A  F  64  17  No   P  M  74   4  No  A  F  72  25  No
      P  M  70   1  Yes  B  M  66  19  No  B  M  59  29  No
      A  F  64  30  No   A  M  70  28  No  A  M  69   1  No
      B  F  78   1  No   P  M  83   1  Yes B  F  69  42  No
      B  M  75  30  Yes  P  M  77  29  Yes P  F  79  20  Yes
      A  M  70  12  No   A  F  69  12  No  B  F  65  14  No
      B  M  70   1  No   B  M  67  23  No  A  M  76  25  Yes
      P  M  78  12  Yes  B  M  77   1  Yes B  F  69  24  No
      P  M  66   4  Yes  P  F  65  29  No  P  M  60  26  Yes
      A  M  78  15  Yes  B  M  75  21  Yes A  F  67  11  No
      P  F  72  27  No   P  F  70  13  Yes A  M  75   6  Yes
      B  F  65   7  No   P  F  68  27  Yes P  M  68  11  Yes
      P  M  67  17  Yes  B  M  70  22  No  A  M  65  15  No
      P  F  67   1  Yes  A  M  67  10  No  P  F  72  11  Yes
      A  F  74   1  No   B  M  80  21  Yes A  F  69   3  No
      ;

The following validation data set will be used in the SCORE statement in PROC LOGISTIC to obtain predicted probabilities for the specified combinations of Treatment and Age. Notice that the first three observations use Treatments A, B, and P all of which appear in the training data set and nonmissing values of Age. However, the fourth observation contains a missing value (.) for Age in Treatment A. In the fifth observation, a nonmissing value of Age is specified, but the specified treatment, Z, is not one that appeared in the training data set.

      /* Validation data set */
      Data Validate;
         input Treatment $ Age;
         datalines;
      A 65
      B 72
      P 80
      A .
      Z 68
      ;

The following statements train the logistic model using the training data set and then score the validation data set. The EVENT="No" option specifies that the probability of Pain=No is to be modeled.

      proc logistic data=Neuralgia;
         class Treatment;
         model Pain (event="No") = Treatment Age;
         score data=Validate out=Preds;
         run;

Notice that predictions are given for the first three observations in the validation data set, but not for the fourth because of the missing value of Age, and not for the fifth because the Treatment value does not appear in the training data set.

      proc print data=Preds;
         id Treatment Age;
         run;

To see why this occurs it helps to know what the fitted model is. Following is the table of parameter estimates from the trained model:

From this table, the model can written as follows:

Logit(p) = 18.5356 + 0.7033*T_A + 1.2759*T_B - 0.2581*Age ,

where Logit(p) is the log odds of Pain=No (log odds = log(Pr(Pain=No)/Pr(Pain=Yes)). T_A and T_B are design variables representing the CLASS predictor, Treatment, and are coded as shown in the "Class Level Information" table below. The first Design Variable column is T_A, the second column is T_B.

Using the model, the first observation in the Validate data set can be scored as follows. From the "Class Level Information" table, Treatment=A is represented in the model by T_A=1 and T_B=0.

Logit(p) = 18.5356 + 0.7033*1 + 1.2759*0 - 0.2581*65 = 2.4624 ,

The probability of Pain=No can be obtained from the logit by the following transformation:

Pr(Pain=No) = 1 / (1+exp(-logit))

For the first observation, the predicted probability of Pain=No is 1 / (1+exp(-2.4624)) = 0.9215 and therefore the predicted probability of Pain=Yes is 1-0.9215 = 0.0785. (The slight difference from the SAS results is due to using rounded values here. The results from PROC LOGISTIC are more precise.)

For observation 2:

Logit(p) = 18.5356 + 0.7033*0 + 1.2759*1 - 0.2581*72 = 1.2283,
Pr(Pain=No) = 0.7735 and Pr(Pain=Yes) = 0.2265 .

For observation 3:

Logit(p) = 18.5356 + 0.7033*-1 + 1.2759*-1 - 0.2581*80 = -4.0916,
Pr(Pain=No) = 0.0164 and Pr(Pain=Yes) = 0.9836.

For the fourth observation:

Logit(p) = 18.5356 + 0.7033*1 + 1.2759*0 - 0.2581*.

Because the value of Age is missing, the model equation is incomplete and the logit and predicted probabilities cannot be computed. Note that simply ignoring the Age term in the model and computing the logit as 18.5356 + 0.7033*1 + 1.2759*0 is not valid because this is equivalent to setting Age=0 which is almost certainly not intended.

For the fifth observation:

Logit(p) = 18.5356 + 0.7033*. + 1.2759*. - 0.2581*68

Because Treatment Z does not appear in the training data set, there are no corresponding values of the design variables, T_A and T_B, so again the model equation is incomplete and the logit and predicted probabilities cannot be computed. Simply ignoring the two treatment terms and computing the logit as 18.5356 - 0.2581*68 is not valid because this is equivalent to setting T_A=T_B=0 and this represents no known Treatment. The only valid treatments are coded as shown in the "Class Level Information" table.

__________

NOTE 1: Some previous problems caused predicted values to incorrectly be set to missing.

In PROC LOGISTIC prior to SAS 9.4, if the FORMAT statement is used and appears before the SCORE statement, then missing predicted values appear in the SCORE OUT= data set. Move the FORMAT statement to follow the SCORE statement to avoid this problem.
The problem described in this note affects PROC GENMOD prior to SAS 9.4 TS1M2.
In PROC QUANTSELECT prior to SAS 9.4 TS1M3, a missing value in a predictor that is not included in the final model causes the predicted value to be missing.

NOTE 2: GLM, GENMOD, PROBIT, PHREG, LIFEREG, QUANTREG, QUANTSELECT, ROBUSTREG, SURVEYREG, SURVEYPHREG, HPLOGISTIC, HPMIXED, COUNTREG, QLIM, and possibly others.

Operating System and Release Information

Product Family	Product	System	SAS Release
Product Family	Product	System	Reported	Fixed*
SAS System	SAS/STAT	z/OS
		OpenVMS VAX
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		OS/2
		Microsoft Windows 95/98
		Microsoft Windows 2000 Advanced Server
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Server
		Microsoft Windows 2000 Professional
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
		Microsoft Windows Server 2003 Enterprise Edition
		Microsoft Windows Server 2003 Standard Edition
		Microsoft Windows XP Professional
		Windows Millennium Edition (Me)
		Windows Vista
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX
SAS System	SAS/ETS	Microsoft Windows 2000 Advanced Server
		Microsoft Windows 95/98
		Microsoft Windows 8.1 Pro 32-bit
		Microsoft Windows 8.1 Pro
		Microsoft Windows 8.1 Enterprise x64
		Microsoft Windows 8 Pro x64
		Microsoft Windows 8.1 Enterprise 32-bit
		Microsoft Windows 8 Pro 32-bit
		Microsoft Windows 8 Enterprise 32-bit
		Microsoft Windows 8 Enterprise x64
		OS/2
		Microsoft Windows XP 64-bit Edition
		Microsoft® Windows® for x64
		Microsoft Windows Server 2003 Enterprise 64-bit Edition
		Microsoft Windows Server 2003 Datacenter 64-bit Edition
		OpenVMS VAX
		Microsoft® Windows® for 64-Bit Itanium-based Systems
		Z64
		z/OS
		Microsoft Windows 2000 Datacenter Server
		Microsoft Windows 2000 Server
		Microsoft Windows 2000 Professional
		Microsoft Windows NT Workstation
		Microsoft Windows Server 2003 Datacenter Edition
		Microsoft Windows Server 2003 Enterprise Edition
		Microsoft Windows Server 2003 Standard Edition
		Microsoft Windows Server 2003 for x64
		Microsoft Windows Server 2008
		Microsoft Windows Server 2008 R2
		Microsoft Windows Server 2008 for x64
		Microsoft Windows Server 2012 Datacenter
		Microsoft Windows Server 2012 R2 Datacenter
		Microsoft Windows Server 2012 R2 Std
		Microsoft Windows Server 2012 Std
		Microsoft Windows XP Professional
		Windows 7 Enterprise 32 bit
		Windows 7 Enterprise x64
		Windows 7 Home Premium 32 bit
		Windows 7 Home Premium x64
		Windows 7 Professional 32 bit
		Windows 7 Professional x64
		Windows 7 Ultimate 32 bit
		Windows 7 Ultimate x64
		Windows Millennium Edition (Me)
		Windows Vista
		Windows Vista for x64
		64-bit Enabled AIX
		64-bit Enabled HP-UX
		64-bit Enabled Solaris
		ABI+ for Intel Architecture
		AIX
		HP-UX
		HP-UX IPF
		IRIX
		Linux
		Linux for x64
		Linux on Itanium
		OpenVMS Alpha
		OpenVMS on HP Integrity
		Solaris
		Solaris for x64
		Tru64 UNIX

* For software releases that are not yet generally available, the Fixed Release is the software release in which the problem is planned to be fixed.

Type:	Usage Note
Priority:
Topic:	Analytics ==> Categorical Data Analysis SAS Reference ==> Procedures ==> ADAPTIVEREG SAS Reference ==> Procedures ==> ANOVA SAS Reference ==> Procedures ==> COUNTREG SAS Reference ==> Procedures ==> DISCRIM SAS Reference ==> Procedures ==> FMM SAS Reference ==> Procedures ==> GAM SAS Reference ==> Procedures ==> GEE SAS Reference ==> Procedures ==> GENMOD SAS Reference ==> Procedures ==> GLM SAS Reference ==> Procedures ==> GLMSELECT SAS Reference ==> Procedures ==> HPCOUNTREG SAS Reference ==> Procedures ==> HPGENSELECT SAS Reference ==> Procedures ==> HPLMIXED SAS Reference ==> Procedures ==> HPLOGISTIC SAS Reference ==> Procedures ==> HPMIXED SAS Reference ==> Procedures ==> HPQLIM SAS Reference ==> Procedures ==> HPQUANTSELECT SAS Reference ==> Procedures ==> HPREG SAS Reference ==> Procedures ==> ICPHREG SAS Reference ==> Procedures ==> LIFEREG SAS Reference ==> Procedures ==> LOESS SAS Reference ==> Procedures ==> LOGISTIC SAS Reference ==> Procedures ==> NLIN SAS Reference ==> Procedures ==> ORTHOREG SAS Reference ==> Procedures ==> PHREG SAS Reference ==> Procedures ==> PLS SAS Reference ==> Procedures ==> PROBIT SAS Reference ==> Procedures ==> QLIM SAS Reference ==> Procedures ==> QUANTREG SAS Reference ==> Procedures ==> QUANTSELECT SAS Reference ==> Procedures ==> REG SAS Reference ==> Procedures ==> ROBUSTREG SAS Reference ==> Procedures ==> SURVEYLOGISTIC SAS Reference ==> Procedures ==> SURVEYPHREG SAS Reference ==> Procedures ==> SURVEYREG

Date Modified:	2016-03-11 14:29:47
Date Created:	2008-06-02 12:17:42

Support

Usage Note 32304: Missing predicted values when scoring new data using a fitted model

Operating System and Release Information

Follow Us

What is...