Variable Importance :: SAS/STAT(R) 12.3 User's Guide: High-Performance Procedures

Variable Importance

Variable importance is calculated based on how the variables are used in the finished tree. Three metrics are used: count, SSE, and relative importance. The count-based variable importance simply counts the number of times in the entire tree that a given variable is used in a split. The SSE and relative importance are calculated from the training set. They are also calculated again from the validation set if one exists. These are reported as “VSSE” and “VIMPORT.”

The SSE-based variable importance is based on the nodes in which the variable is used in a split. For each variable, the change of the SSE that results from the split is found. The change is

$\begin{equation*} \Delta _\eta = \mathrm{SSE}_\eta - \sum _\lambda \mathrm{SSE}_\lambda ^\eta \end{equation*}$

where $\eta$ denotes the node. $\mathrm{SSE}_\eta$ is then the SSE if the node is treated as a leaf, and $\mathrm{SSE}_\lambda ^\eta$ is the SSE of the node after it has been split. If the change in SSE is negative (which is possible when you use the validation set), then the change is set to 0.

The SSE-based importance is then

$\begin{equation*} \mathrm{SSE} \mathrm{IMPORT}_{\text {variable}} = \sqrt {\sum _\eta ^{\text {variable nodes}} \Delta _\eta } \end{equation*}$

The relative importance metric is based on the SSE of each variable. The maximum SSE variable importance is found. Then all the variables are assigned a relative importance, which is simply

$\begin{equation*} \mathrm{IMPORT}_{\text {variable}} = \frac{\mathrm{SSE} \mathrm{IMPORT}_{\text {variable}}}{\mathrm{SSE} \mathrm{IMPORT}_{\text {max}}} \end{equation*}$

The HPSPLIT Procedure

Variable Importance