The HPSPLIT Procedure

Outputs

Subsections:

Performance Information
Output Data Sets

Performance Information

The "Performance Information" table is created by default. It displays information about the execution mode. For single-machine mode, the table displays the number of threads used. For distributed mode, the table displays the grid mode (symmetric or asymmetric), the number of compute nodes, and the number of threads per node.

Output Data Sets

SCORE Data Set

Table 15.3 shows the variables that are contained in an example data set that the SCORE statement produces.

Table 15.3: Example SCORE Statement Data Set Variables

Variable	Target Type	Description
`VAR`	Either	Target variable
`_LEAF_`	Either	Leaf number to which this observation is assigned
`_NODE_`	Either	Node number to which this observation is assigned
`P_VARLEV`	Nominal	Proportion of training set at this leaf that has target `VAR` = LEV
`V_VARLEV`	Nominal	Proportion of validation set at this leaf that has target `VAR` = LEV
`P_VAR`	Interval	Average value of target `VAR` in the training set
`V_VAR`	Interval	Average value of target `VAR` in the validation set

IMPORTANCE= Data Set

The variable importance data set contains the importance of the input variables in creating the pruned decision tree. PROC HPSPLIT outputs two simple count-based importance metrics and two variable importance metrics that are based on the sum of squares error. In addition, it outputs the number of observations that are used in the training and validation sets, the number of observations that have a missing value, and the number of observations that have a missing target. Table 15.4 shows the variables contained in the data set when you specify the IMPORTANCE= option in the OUTPUT statement. In addition to the variables listed below, a variable that contains the importance for each input variable is included.

Table 15.4: Variable Importance Data Set Variables

Variable	Description
`_CRITERION_`	Criterion used to generate the tree
`ITYPE`	Importance type (Count, NSURROGATES, SSE,
	VSSE, IMPORT, or VIMPORT)
`_OBSMISS_`	Number of observations that have a missing value
`_OBSTMISS_`	Number of observations that have a missing target
`_OBSUSED_`	Number of observations that were used to build the tree (training set)
`_OBSVALID_`	Number of observations in the validation set
`_TREENUM_`	Tree number (always 1)

NODESTATS= Data Set

The data set specified in the NODESTATS= option in the OUTPUT statement can be used to visualize the tree. Table 15.5 shows the variables in this data set.

Table 15.5: NODESTATS= Data Set Variables

Variable	Target Type	Description
`ALLTEXT`	Either	Text that describes the split
`CRITERION`	Either	Which of the three criteria was used
`DECISION`	Either	Values of the parent variable’s split to get to this node
`DEPTH`	Either	Depth of the node
`ID`	Either	Node number
`LEAF`	Either	Leaf number
`LINKWIDTH`	Either	Fraction of all training observations going to this node
`N`	Either	Number of training observations at this node
`NVALID`	Either	Number of validation observations at this node
`PARENT`	Either	Parent’s node number
`PREDICTEDVALUE`	Either	Value of target predicted at this node
`SPLITVAR`	Either	Variable used in the split
`TREENUM`	Either	Tree number (always 1)
`P_VARLEV`	Nominal	Proportion of training set at this leaf that has target `VAR` = LEV
`V_VARLEV`	Nominal	Proportion of validation set at this leaf that has target `VAR` = LEV
`P_VAR`	Interval	Average value of target `VAR` in the training set
`V_VAR`	Interval	Average value of target `VAR` in the validation set

GROWTHSUBTREE= and PRUNESUBTREE= Data Sets

During tree growth and pruning, the number of leaves at each growth or pruning iteration and other metrics are output to data sets that are specified in the GROWTHSUBTREE= and PRUNESUBTREE= options, respectively.

The growth and pruning data sets are identical, except that:

The growth data set reflects statistics of the tree during growth. The pruning data set reflects statistics of the tree during pruning.
The statistics of the growth data set are always computed from the training subset. The statistics of the pruning data set are computed from the validation subset if one is available. Otherwise, the statistics of the pruning data set are computed from the training subset.

Table 15.6: GROWTHSUBTREE= and PRUNESUBTREE= Data Set Variables

Variable	Target Type	Description
`ITERATION`	Either	Iteration number
`NLEAVES`	Either	Number of leaves
`TREENUM`	Either	Tree number (always 1)
`_ASE_`	Either	Training set: average square error
`_ENTROPY_`	Nominal	Training set: entropy
`_GINI_`	Nominal	Training set: Gini
`_MISC_`	Nominal	Training set: misclassification rate
`_SSE_`	Either	Training set: sum of squares error
`_VASE_`	Either	Validation set: average square error
`_VENTROPY_`	Nominal	Validation set: entropy
`_VGINI_`	Nominal	Validation set: Gini
`_VMISC_`	Nominal	Validation set: misclassification rate
`_VSSE_`	Either	Validation set: sum of squares error
`_ASSESS_`	Either	Subtree assessment value
		Ratio of slopes or change in errors
`FINALTREE`	Either	Chosen subtree if 1
`CHOSENID`	Either	Intermediate number of node selected for pruning
`NODECHANGE`	Either	Change in node selection metric by pruning node