For decision
trees, missing values are not problematic. Surrogate splitting rules
enable you to use the values of other input variables to perform a
split for observations with missing values. In SAS
Enterprise Miner, however, models such as regressions and neural networks
ignore observations altogether that contain missing values, which
reduces the size of the training data set. Less training data can
substantially weaken the predictive power of these models. To overcome
this obstacle of missing data, you can impute missing values before
you fit the models.
Tip
It is a particularly
good idea to impute missing values before fitting a model that ignores
observations with missing values if you plan to compare those models
with a decision tree. Model comparison is most appropriate between
models that are fit with the same set of observations.
To use
the Impute node to impute missing values, complete the
following steps:
-
Select
the
Modify tab on the Toolbar.
-
Select
the Impute node icon. Drag the node into the Diagram Workspace.
-
Connect
the Replacement node to the Impute node.
-
Select
the Impute node. In the Properties Panel, scroll down to view the
Train properties:
-
For class variables, click on the
value of
Default Input Method and select
Tree Surrogate from the drop-down menu that appears.
-
For interval variables, click on
the value of
Default Input Method and select
Median from the drop-down menu that appears.
The default
input method specifies which is the default statistic to use to impute
missing values. In this example, the values of missing interval variables
are replaced by the median of the nonmissing values. This statistic
is less sensitive to extreme values than the mean or midrange and
is therefore useful for imputation of missing values from skewed distributions.
The values of missing class variables, in this example, are imputed
using predicted values from a decision tree. For each class variable,
SAS Enterprise Miner builds a decision tree (in this case, potentially
using surrogate splitting rules) with that variable as the target
and the other input variables as predictors.
-
In the
Diagram Workspace, right-click the Impute node, and select
Run from the resulting menu. Click
Yes in the confirmation window that opens.
-
In the
window that appears when processing completes, click
OK.
Note: In the data
that is exported from the Impute node, a new variable is created for
each variable for which missing values are imputed. The original variable
is not overwritten. Instead, the new variable has the same name as
the original variable but is prefaced with IMP_. The original version
of each variable also exists in the exported data and has the role
Rejected
. In this example, SES and URBANICITY have
values replaced and then imputed. Therefore, in addition to the original
version, each of these variables has a version in the exported data
that is prefaced by IMP_REP_.