The HPSPLIT Procedure

PROC HPSPLIT Statement

PROC HPSPLIT <options> ;

The PROC HPSPLIT statement invokes the procedure. Table 9.2 summarizes the options in the PROC HPSPLIT statement.

Table 9.2: PROC HPSPLIT Statement Options

Option

Description

Basic Options

DATA=

Specifies the input data set

EVENT=

Specifies the formatted value of the target event

INTERVALBINS=

Specifies the number of bins for interval variables

Splitting Options

LEAFSIZE=

Specifies the minimum number of observations per leaf

MAXBRANCH=

Specifies the maximum leaves per node

MAXDEPTH=

Specifies the maximum tree depth

MINCATSIZE=

Specifies the number of observations per level to consider a level for splitting

FastCHAID Options

ALPHA=

Specifies the maximum p-value for a split to be considered

BONFERRNOI

Enables the Bonferroni adjustment to after-split p-values

MINDIST=

Specifies the minimum Kolmogorov-Smirnov distance


DATA=SAS-data-set

names the input SAS data set to be used by PROC HPSPLIT. The default is the most recently created data set.

If the procedure executes in distributed mode, the input data are distributed to memory on the appliance nodes and analyzed in parallel, unless the data are already distributed in the appliance database. In that case the procedure reads the data alongside the distributed database. See the section Processing Modes about the various execution modes and the section Alongside-the-Database Execution about the alongside-the-database model. Both sections are in Chapter 2: Shared Concepts and Topics.

EVENT=value

specifies a formatted value of the target level variable to use for sorting nominal input levels when you use the FastCHAID criterion or when PROC HPSPLIT uses the fastest splitting category. Ties are broken in internal order. For example, if you are looking for defects in a data set where a target value of 'D' indicates a defect, specify EVENT='D'.

See section Input Variable Splitting and Selection for details on splitting categories and the FastCHAID criterion.

The default is the first level of the target as specified by the ORDER= option in the TARGET statement.

INTERVALBINS=number

specifies the number of bins for interval variables. For more information about interval variable binning, see the section Details: HPSPLIT Procedure.

The default is INTERVALBINS=100.

LEAFSIZE=number

specifies the minimum number of observations that a split must contain in the training data set in order for the split to be considered.

The default is LEAFSIZE=1.

MAXBRANCH=number

specifies the maximum number of children per node in the tree. PROC HPSPLIT tries to create this number of children unless it is impossible (for example, if a split variable does not have enough levels).

The default is the number of target levels.

MAXDEPTH=number

specifies the maximum depth of the tree to be grown.

The default depends on the value of the MAXBRANCH= option. If MAXBRANCH=2, the default is MAXDEPTH=10. Otherwise, the MAXDEPTH= option is set using the following equation:

\begin{equation*}  \mathrm{MAXDEPTH} = \left\lceil \frac{10}{\log _2\left(\mathrm{MAXBRANCH}\right)} \right\rceil \end{equation*}
MINCATSIZE=number

specifies the number of observations that a nominal variable level must have in order to be considered in the split. Targets that have fewer observations than number receive the missing value assignment for that split.

The default is MINCATSIZE=0. That is, it is disabled by default.

ALPHA=number

specifies the maximum p-value for a split to be considered if you specify FastCHAID in the CRITERION statement. Otherwise, this option is ignored.

The default is ALPHA=0.3.

BONFERRONI

enables the Bonferroni adjustment to the p-value of each variable’s split after the split has been determined by Kolmogorov-Smirnov distance if you specify FastCHAID in the CRITERION statement. Otherwise, this option is ignored.

By default, there is no Bonferroni adjustment.

MINDIST=number

specifies the minimum Kolmogorov-Smirnov distance for a split to be considered when you specify FastCHAID in the CRITERION statement. Otherwise, this option is ignored.

The default is MINDIST=0.01.