For decision trees,
missing values are not problematic. Surrogate splitting rules enable
you to use the values of other input variables to perform a split
for observations with missing values. In SAS Enterprise Miner, however, models such
as regressions and neural networks
ignore altogether observations that contain missing values, which reduces the size
of the
training data set. Less training data can substantially weaken the predictive power of these models.
To overcome this obstacle of missing data, you can impute missing values before you
fit the models.
Tip
It is a particularly good
idea to impute missing values before fitting a model that ignores
observations with missing values if you plan to compare those models
with a decision tree. Model comparison is most appropriate between
models that are fit with the same set of observations.
To use the Impute node to impute missing
values:
-
Select the
Modify tab
on the Toolbar.
-
Select the
Impute node
icon. Drag the node into the Diagram Workspace.
-
Connect the
Control
Point node to the
Impute node.
-
Select the
Impute node.
In the Properties Panel, scroll down to view the
Train properties:
-
For class variables, click on the
value of Default Input Method and select Tree
Surrogate from the drop-down menu that appears.
-
For interval variables, click on
the value of Default Input Method and select Median from
the drop-down menu that appears.
The default input method specifies which default statistic to use to impute missing
values. In this example, the values of missing interval variables are replaced by
the median of the nonmissing values. This statistic is less sensitive to extreme values
than the mean or midrange and is therefore useful for
imputation of missing values from skewed distributions. The values of missing class variables,
in this example, are imputed using predicted values from a decision tree. For each
class
variable, SAS Enterprise Miner builds a decision tree (in this case, potentially using surrogate
splitting
rules) with that variable as the target and the other input variables
as predictors.
-
In the Diagram Workspace,
right-click the
Impute node, and select
Run from
the resulting menu. Click
Yes in the
Confirmation window
that opens.
-
In the window that appears
when processing completes, click
OK.
Note: In the data that is exported
from the Impute node, a new variable is created for each variable
for which missing values are imputed. The original variable is not
overwritten. Instead, the new variable has the same name as the original
variable but is prefaced with IMP_. The original version of each variable
also exists in the exported data and has the role Rejected
.
In this example, SES and URBANICITY have values that are replaced
and then imputed. Therefore, in addition to the original version,
each of these variables has a version in the exported data that is
prefaced by IMP_REP_.