![]() | ![]() | ![]() | ![]() |
In an analysis performed by a modeling procedure, observations can be excluded for various reasons. When an observation is excluded, it might or might not be possible to compute its predicted value (where prediction is applicable). The numbers of observations read and used (excluded) are typically reported in a table displayed by the analysis procedure.
The following discusses reasons for observations to be excluded and for predictions to be missing. It also shows how to get an analysis of missing values by looking simultaneously at all variables involved in the model to help understand why and which observations are excluded and possibly have missing predictions. It is important to understand that investigation of individual variables in the model is not sufficient since different observations can be excluded for different reasons.
Also shown below are how you can produce data sets of observations that were used or were excluded from fitting the model as well as data sets of observations for which predictions could and could not be produced.
SAS® modeling procedures provide several ways for scoring observations using a fitted model as described in SAS Note 33307. When scoring new data or data used in training the model, the observation will be excluded (not used in fitting the model) and its predicted value will be missingNote1 if any of the following conditions occurs.
Example
Consider the neuralgia data in the example titled "Logistic Modeling with Categorical Predictors" in the PROC LOGISTIC documentation and the following model. The Treatment predictor has three possible levels that appear in the data, A, B, and P.
proc logistic data=Neuralgia; class Treatment; model Pain (event="No") = Treatment Age; store Neurmod; run;
Suppose that you want to use the fitted model to score the following data, which can be done using the SCORE statement in PROC PLM:
data New; input Treatment $ Age; datalines; A 65 B 72 P 80 A . Z 68 ; proc plm restore=Neurmod; score data=New out=New_scored; run;
Since none of the above causes apply to the first three observations in New, a predicted value can be computed for each of them. However, because the fourth observation has a missing predictor value, no predicted value is computed. While it seems as though prediction is possible for the last observation, note that the Treatment value, Z, is not one of the levels for Treatment in the Neuralgia data set that trained the model. As a result, there is no parameter estimate for this nonexistent treatment and therefore no predicted value is possible.
Observations will also be excluded from the analysis under any of the following conditions. However, if none of the above conditions applies to an observation, then the predicted value can be computed.
Produce a table summarizing all missing values
An analysis of missing values can provide a table showing the numbers of observations with various patterns of missing values across the variables involved in the model. This can be used to show groups of observations excluded from the model due to missing values. But note that the table does not indicate observations that might be excluded due to invalid values.
First, use the desired procedure to fit the model and add the necessary option or statement to output a data set that adds a variable containing predicted values to your input data set. Assume this variable is called PRED. Then run PROC MI using your variables as indicated.
proc mi data=<your-data-set> nimpute=0 displaypattern=nomeans; var PRED <response variable(s)> <all other variables involved in the model>; class <all character variables in VAR statement>; fcs logistic; ods select MissPattern; run;
In the table that it produces, X means a value is present, O or . means it is missing. Only those observations in Group 1, with no missing values in any variable, can potentially be used in the modeling procedure, though some might still be excluded due to nonmissing but invalid values as indicated in the second list above.
Example of missing value analysis
The following statements generate 100 observations with missing values randomly appearing across all of the variables.
data nomiss; call streaminit(453); drop i; do i=1 to 100; y=rand('table',.25,.25,.25,.25); x1=rand('uniform'); x2=rand('binomial',.5,1); off=rand('uniform')*5; w=rand('uniform')*10; f=floor(rand('uniform')*10); output; end; run; data miss; set nomiss; array v (*) y x1 x2 off w f; do i=1 to dim(v); if ranuni(342)<.3 then v(i)=.; end; drop i; run;
PROC GENMOD is used to fit a model to the data and output a data set of that includes the predicted values. PROC MI is used as described above to obtain an analysis of missing values.
proc genmod data=miss; class x2; model y=x1 x2 / dist=poisson offset=off; weight w; freq f; output out=out pred=pred; run; proc mi data=out nimpute=0 displaypattern=nomeans; var pred y x1 x2 off w f; ods select MissPattern; run;
The first table, produced by GENMOD, shows that only 11 of the 100 observations were used in fitting the model. The next table, from MI, shows that 12 observations were nonmissing on all variables involved in the model and for which predicted values were produced. Observations in the remaining groups were all excluded from contributing when fitting the model because of missing values, but predictions could be obtained for those in Groups 2-8.
![]() The MI Procedure
|
As noted above, this analysis by MI is only of missing values. It does not address the additional causes for exclusion when fitting the model. The following statements create and display a data set containing all of the observations in Group 1 which have no missing values.
data group1; set out; if cmiss(of y x1 x2 off w f)=0; run; proc print noobs; title "Group 1"; run;
Note that there is one observation in which the FREQ statement variable, F, is zero. As noted in the list of additional causes for exclusion above, zero or negative values of a FREQ statement variable are invalid and exclude the observation. This is why GENMOD reported 11 observations used and not 12.
![]() |
By also checking for the additional causes for exclusion, it is possible to create data sets of used and unused observations. For this model, the only applicable additional causes for exclusion are invalid WEIGHT or FREQ values. These statements produce the Used data set with 11 observations and the NotUsed data set with 89 observations, consistent with the counts shown in table from GENMOD above.
data Used; set out; /* all model variables nonmissing; valid weights, freqs */ if cmiss(of y x1 x2 off w f)=0 and w>0 and f>=1; run; data NotUsed; set out; /* some model variables missing or invalid weight, freq */ if cmiss(of pred y x1 x2 off w f) or w<=0 or f<1; run;
Data sets of observations that could and could not be predicted can be produced by simply using the variable of predicted values saved from the modeling procedure. Data set PRED contains 34 observations, the sum of sizes of Groups 1-8, and data set NotPred contains 66 observations.
data Pred; set out; where pred ne .; run; data NotPred; set out; where pred=.; run;
If needed, a data set of any one of the groups from the PROC MI analysis can be produced for examination using a DATA step similar to the one above. For example, the Group 9 observations can be obtained using this step.
data group9; set out; if cmiss(of y x1 x2 w f)=0 /* none of these is missing and */ and cmiss(of off) /* any of these is missing */ ; run;
__________
NOTE 1: Some previous problems caused predicted values to incorrectly be set to missing.
NOTE 2: GLM, GENMOD, PROBIT, PHREG, LIFEREG, QUANTREG, QUANTSELECT, ROBUSTREG, SURVEYREG, SURVEYPHREG, HPLOGISTIC, HPMIXED, COUNTREG, QLIM, and possibly others.
Product Family | Product | System | SAS Release | |
Reported | Fixed* | |||
SAS System | SAS/STAT | z/OS | ||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
OS/2 | ||||
Microsoft Windows 95/98 | ||||
Microsoft Windows 2000 Advanced Server | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows XP Professional | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX | ||||
SAS System | SAS/ETS | Microsoft Windows 2000 Advanced Server | ||
Microsoft Windows 95/98 | ||||
Microsoft Windows 8.1 Pro 32-bit | ||||
Microsoft Windows 8.1 Pro | ||||
Microsoft Windows 8.1 Enterprise x64 | ||||
Microsoft Windows 8 Pro x64 | ||||
Microsoft Windows 8.1 Enterprise 32-bit | ||||
Microsoft Windows 8 Pro 32-bit | ||||
Microsoft Windows 8 Enterprise 32-bit | ||||
Microsoft Windows 8 Enterprise x64 | ||||
OS/2 | ||||
Microsoft Windows XP 64-bit Edition | ||||
Microsoft® Windows® for x64 | ||||
Microsoft Windows Server 2003 Enterprise 64-bit Edition | ||||
Microsoft Windows Server 2003 Datacenter 64-bit Edition | ||||
OpenVMS VAX | ||||
Microsoft® Windows® for 64-Bit Itanium-based Systems | ||||
Z64 | ||||
z/OS | ||||
Microsoft Windows 2000 Datacenter Server | ||||
Microsoft Windows 2000 Server | ||||
Microsoft Windows 2000 Professional | ||||
Microsoft Windows NT Workstation | ||||
Microsoft Windows Server 2003 Datacenter Edition | ||||
Microsoft Windows Server 2003 Enterprise Edition | ||||
Microsoft Windows Server 2003 Standard Edition | ||||
Microsoft Windows Server 2003 for x64 | ||||
Microsoft Windows Server 2008 | ||||
Microsoft Windows Server 2008 R2 | ||||
Microsoft Windows Server 2008 for x64 | ||||
Microsoft Windows Server 2012 Datacenter | ||||
Microsoft Windows Server 2012 R2 Datacenter | ||||
Microsoft Windows Server 2012 R2 Std | ||||
Microsoft Windows Server 2012 Std | ||||
Microsoft Windows XP Professional | ||||
Windows 7 Enterprise 32 bit | ||||
Windows 7 Enterprise x64 | ||||
Windows 7 Home Premium 32 bit | ||||
Windows 7 Home Premium x64 | ||||
Windows 7 Professional 32 bit | ||||
Windows 7 Professional x64 | ||||
Windows 7 Ultimate 32 bit | ||||
Windows 7 Ultimate x64 | ||||
Windows Millennium Edition (Me) | ||||
Windows Vista | ||||
Windows Vista for x64 | ||||
64-bit Enabled AIX | ||||
64-bit Enabled HP-UX | ||||
64-bit Enabled Solaris | ||||
ABI+ for Intel Architecture | ||||
AIX | ||||
HP-UX | ||||
HP-UX IPF | ||||
IRIX | ||||
Linux | ||||
Linux for x64 | ||||
Linux on Itanium | ||||
OpenVMS Alpha | ||||
OpenVMS on HP Integrity | ||||
Solaris | ||||
Solaris for x64 | ||||
Tru64 UNIX |
Type: | Usage Note |
Priority: | |
Topic: | Analytics ==> analytics |
Date Modified: | 2025-06-03 15:56:21 |
Date Created: | 2008-06-02 12:17:42 |