The HPSPLIT Procedure

Outputs

Performance Information

The Performance Information table is created by default. It displays information about the execution mode. For single-machine mode, the table displays the number of threads used. For distributed mode, the table displays the grid mode (symmetric or asymmetric), the number of compute nodes, and the number of threads per node.

Output Data Sets

SCORE Data Set

Table 13.3 shows the variables that are contained in an example data set that the SCORE statement produces.

Table 13.3: Example SCORE Statement Data Set Variables

Variable

Target Type

Description

VAR

Either

Target variable

_LEAF_

Either

Leaf number to which this observation is assigned

_NODE_

Either

Node number to which this observation is assigned

P_VARLEV

Nominal

Proportion of training set at this leaf that has target VAR = LEV

V_VARLEV

Nominal

Proportion of validation set at this leaf that has target VAR = LEV

P_VAR

Interval

Average value of target VAR in the training set

V_VAR

Interval

Average value of target VAR in the validation set


IMPORTANCE= Data Set

The variable importance data set contains the importance of the input variables in creating the pruned decision tree. PROC HPSPLIT outputs two simple count-based importance metrics and two variable importance metrics that are based on the sum of squares error. In addition, it outputs the number of observations that are used in the training and validation sets, the number of observations that have a missing value, and the number of observations that have a missing target. Table 13.4 shows the variables contained in the data set when you specify the IMPORTANCE= option in the OUTPUT statement. In addition to the variables listed below, a variable that contains the importance for each input variable is included.

Table 13.4: Variable Importance Data Set Variables

Variable

Description

_CRITERION_

Criterion used to generate the tree

_ITYPE_

Importance type (Count, NSURROGATES, SSE,

 

VSSE, IMPORT, or VIMPORT)

_OBSMISS_

Number of observations that have a missing value

_OBSTMISS_

Number of observations that have a missing target

_OBSUSED_

Number of observations that were used to build the tree (training set)

_OBSVALID_

Number of observations in the validation set

_TREENUM_

Tree number (always 1)


NODESTATS= Data Set

The data set specified in the NODESTATS= option in the OUTPUT statement can be used to visualize the tree. Table 13.5 shows the variables in this data set.

Table 13.5: NODESTATS= Data Set Variables

Variable

Target Type

Description

ALLTEXT

Either

Text that describes the split

CRITERION

Either

Which of the three criteria was used

DECISION

Either

Values of the parent variable’s split to get to this node

DEPTH

Either

Depth of the node

ID

Either

Node number

LEAF

Either

Leaf number

LINKWIDTH

Either

Fraction of all training observations going to this node

N

Either

Number of training observations at this node

NVALID

Either

Number of validation observations at this node

PARENT

Either

Parent’s node number

PREDICTEDVALUE

Either

Value of target predicted at this node

SPLITVAR

Either

Variable used in the split

TREENUM

Either

Tree number (always 1)

P_VARLEV

Nominal

Proportion of training set at this leaf that has target VAR = LEV

V_VARLEV

Nominal

Proportion of validation set at this leaf that has target VAR = LEV

P_VAR

Interval

Average value of target VAR in the training set

V_VAR

Interval

Average value of target VAR in the validation set


GROWTHSUBTREE= and PRUNESUBTREE= Data Sets

During tree growth and pruning, the number of leaves at each growth or pruning iteration and other metrics are output to data sets that are specified in the GROWTHSUBTREE= and PRUNESUBTREE= options, respectively.

The growth and pruning data sets are identical, except that:

  • The growth data set reflects statistics of the tree during growth. The pruning data set reflects statistics of the tree during pruning.

  • The statistics of the growth data set are always computed from the training subset. The statistics of the pruning data set are computed from the validation subset if one is available. Otherwise, the statistics of the pruning data set are computed from the training subset.

Table 13.6: GROWTHSUBTREE= and PRUNESUBTREE= Data Set Variables

Variable

Target Type

Description

ITERATION

Either

Iteration number

NLEAVES

Either

Number of leaves

TREENUM

Either

Tree number (always 1)

_ASE_

Either

Training set: average square error

_ENTROPY_

Nominal

Training set: entropy

_GINI_

Nominal

Training set: Gini

_MISC_

Nominal

Training set: misclassification rate

_SSE_

Either

Training set: sum of squares error

_VASE_

Either

Validation set: average square error

_VENTROPY_

Nominal

Validation set: entropy

_VGINI_

Nominal

Validation set: Gini

_VMISC_

Nominal

Validation set: misclassification rate

_VSSE_

Either

Validation set: sum of squares error

_ASSESS_

Either

Subtree assessment value

   

Ratio of slopes or change in errors

FINALTREE

Either

Chosen subtree if 1

CHOSENID

Either

Intermediate number of node selected for pruning

NODECHANGE

Either

Change in node selection metric by pruning node