The HPSPLIT Procedure

Outputs

Performance Information

The Performance Information table is created by default. It displays information about the execution mode. For single-machine mode, the table displays the number of threads used. For distributed mode, the table displays the grid mode (symmetric or asymmetric), the number of compute nodes, and the number of threads per node.

Output Data Sets

SCORE Data Set

Table 9.3 shows the variables that are contained in an example data set that the SCORE statement produces. In this data set, the variable BAD is the target and has values 0 and 1.

Table 9.3: Example SCORE Statement Data Set Variables

Variable

Description

BAD

Target variable

_LEAF_

Leaf number to which this observation is assigned

_NODE_

Node number to which this observation is assigned

P_BAD0

Proportion of training set at this leaf that has BAD = 0

P_BAD1

Proportion of training set at this leaf that has BAD = 1

V_BAD0

Proportion of validation set at this leaf that has BAD = 0

V_BAD1

Proportion of validation set at this leaf that has BAD = 1


IMPORTANCE= Data Set

The variable importance data set contains the importance of the input variables in creating the pruned decision tree. A simple count-based importance metric and two variable importance metrics that are based on the sum of squares error-based are output. In addition, the number of observations that are used in the training and validation sets, the number of observations that have a missing value, and the number of observations that have a missing target are output. Table 9.4 shows the variables contained in the data set that the OUTPUT statement produces using the IMPORTANCE= option. In addition to the variables listed below, a variable containing the importance for each input variable is included.

Table 9.4: Variable Importance Data Set Variables

Variable

Description

TREENUM

Tree number (always 1)

CRITERION

Criterion used to generate the tree

ITYPE

Importance type (Count, SSE, VSSE, IMPORT, or VIMPORT)

OBSMISS

Number of observations that have a missing value

OBSTMISS

Number of observations that have a missing target

OBSUSED

Number of observations used to build the tree (training set)

OBSVALID

Number of observations in the validation set


NODESTATS= Data Set

The data set specified in the NODESTATS= option in the OUTPUT statement can be used to visualize the tree. Table 9.5 shows the variables in this data set.

Table 9.5: NODESTATS= Data Set Variables

Variable

Description

ALLTEXT

Text that describes the split

CRITERION

Which of the three criteria was used

DECISION

Values of the parent variable’s split to get to this node

DEPTH

Depth of the node

ID

Node number

LINKWIDTH

Fraction of all training observations going to this node

N

Number of training observations at this node

NVALID

Number of validation observations at this node

PARENT

Parent’s node number

PREDICTEDVALUE

Value of target predicted at this node

P_BAD0

Proportion of training observations that have BAD=0

P_BAD1

Proportion of training observations that have BAD=1

SPLITVAR

Variable used in the split

TREENUM

Tree number (always 1)

V_BAD0

Proportion of validation observations that have BAD=0

V_BAD1

Proportion of validation observations that have BAD=1


GROWTHSUBTREE= and PRUNESUBTREE= Data Sets

During tree growth and pruning, the number of leaves at each growth or pruning iteration is output in addition to other, optional metrics.

The GROWTHSUBTREE= and PRUNESUBTREE= data sets are identical, except that:

  • The growth data set reflects statistics of the tree during growth. The pruning data set reflects statistics of the tree during pruning.

  • The statistics of the growth data set are always from the training subset. The statistics of the pruning data set are from the validation subset if one is available. Otherwise, the statistics of the pruning data set are from the training subset.

Table 9.6: GROWTHSUBTREE= and PRUNESUBTREE= Data Set Variables

Variable

Description

ITERATION

Iteration number

NLEAVES

Number of leaves

TREENUM

Tree number (always 1)

_ASE_

Training set: average square error

_ENTROPY_

Training set: entropy

_GINI_

Training set: Gini

_MISC_

Training set: misclassification rate

_SSE_

Training set: sum of squares error

_VASE_

Validation set: average square error

_VENTROPY_

Validation set: entropy

_VGINI_

Validation set: Gini

_VMISC_

Validation set: misclassification rate

_VSSE_

Validation set: sum of squares error