For decision trees,
missing values are not problematic. Surrogate splitting rules enable
you to use the values of other input variables to perform a split
for observations with missing values. In SAS Enterprise Miner, however, models
such as regressions and neural networks ignore altogether observations
that contain missing values, which reduces the size of the training
data set. Less training data can substantially weaken the predictive
power of these models. To overcome this obstacle of missing data,
you can impute missing values before you fit the models.
Tip
It is a particularly good
idea to impute missing values before fitting a model that ignores
observations with missing values if you plan to compare those models
with a decision tree. Model comparison is most appropriate between
models that are fit with the same set of observations.
To use the Impute node to impute missing
values:
-
Select the
Modify tab
on the Toolbar.
-
Select the
Impute node
icon. Drag the node into the Diagram Workspace.
-
Connect the
Control
Point node to the
Impute node.
-
Select the
Impute node.
In the Properties Panel, scroll down to view the
Train properties:
-
For class variables, click on the
value of
Default Input Method and select
Tree
Surrogate from the drop-down menu that appears.
-
For interval variables, click on
the value of
Default Input Method and select
Median from
the drop-down menu that appears.
The default input method
specifies which default statistic to use to impute missing values.
In this example, the values of missing interval variables are replaced
by the median of the nonmissing values. This statistic is less sensitive
to extreme values than the mean or midrange and is therefore useful
for imputation of missing values from skewed distributions. The values
of missing class variables, in this example, are imputed using predicted
values from a decision tree. For each class variable, SAS Enterprise
Miner builds a decision tree (in this case, potentially using surrogate splitting
rules) with that variable as the target and the other input variables
as predictors.
-
In the Diagram Workspace,
right-click the
Impute node, and select
Run from
the resulting menu. Click
Yes in the
Confirmation window
that opens.
-
In the window that appears
when processing completes, click
OK.
Note: In the data that is exported
from the Impute node, a new variable is created for each variable
for which missing values are imputed. The original variable is not
overwritten. Instead, the new variable has the same name as the original
variable but is prefaced with IMP_. The original version of each variable
also exists in the exported data and has the role
Rejected
.
In this example, SES and URBANICITY have values that are replaced
and then imputed. Therefore, in addition to the original version,
each of these variables has a version in the exported data that is
prefaced by IMP_REP_.