IMSTAT Procedure (Analytics)

DECISIONTREE Statement

The DECISIONTREE statement provides an implementation of a C4.5 decision tree method for classification. You specify a single column as the target variable when you generate the decision tree. You can also score against the generated tree.

Examples:

Training and Validating a Decision Tree

Storing and Scoring a Decision Tree

Syntax

DECISIONTREE Statement Options

Details

ODS Table Names

Syntax

DECISIONTREE target-variable </ options>;

DECISIONTREE Statement Options

ASSESS

specifies that predicted probabilities are added to the temporary result table for the event levels. You can use these predicted probabilities in an ASSESS statement.

CODE <(code-generation-options)>

requests that the server produce SAS scoring code based on the actions that it performed during the analysis. The server generates DATA step code. By default, the code is replayed as an ODS table by the procedure as part of the output of the statement. More frequently, you might want to write the scoring code to an external file by specifying options.

The scoring code computes the predicted value of the response variable and prefixes the name with "DT_". For example, if the response variable is Y, the generated code stores the predicted value as DT_Y. The name of the variable is truncated to fit within the SAS name length requirements.

COMMENT

specifies to add comments to the code in addition to the header block. The header block is added by default.

FILENAME='path'

specifies the name of the external file to which the scoring code is written. This suboption applies only to the scoring code itself.

Alias

FILE=

FORMATWIDTH=k

specifies the width to use in formatting derived numbers such as parameter estimates in the scoring code. The server applies the BEST format, and the default format for code generation is BEST20.

Alias	FMTW=
Range	4 to 32

LABELID=id

specifies a group identifier for group processing. The identifier is an integer and is used to create array names and statement labels in the generated code.

LINESIZE=n

specifies the line size for the generated code.

Alias	LS=
Default	72
Range	64 to 256

NOTRIM

specifies to format the variables using the full format width with padding. By default, leading and trailing blanks are removed from the formatted values.

REPLACE

specifies to overwrite the external file if a file with the specified name already exists. The option has no effect unless you specify the FILENAME= option.

CFLEV=number

specifies a value between 0 and 1 that controls the aggressiveness of tree pruning according to the C4.5 algorithm. Smaller numbers indicate to more aggressive pruning. See the PRUNE option.

DETAIL

requests detailed information about the classification results when scoring a table against a previously calculated tree.

FORMATS=("format-specification",...)

specifies the formats for the input variables. If you do not specify the FORMATS= option, the default format is applied for that variable.

Enclose each format specification in quotation marks and separate each format specification with a comma.

Example

proc imstat data=lasr1.table1;
   decisiontree x / input=(a b) formats=("8.3", "$10");
quit;

GAIN

specifies that the splitting criterion is changed to information gain. Typically, this criterion intends to generate trees with more nodes than information gain ratio.

GREEDY

specifies how to perform splitting under specific circumstances.

Assuming that one variable has q levels, when binary splitting is performed and q is less than 15, or option MAXBRANCH > 2 and q < 12, all possible binary splits are enumerated and the split with the largest gain or gain ratio is chosen for the variable.

When q is less than 1024 and splitting is not just binary, local greedy searches are applied to determine the optimum local split. Specifically, when the variable is numeric, q levels (similar to q bins) are sorted by value.

When the variable is nominal, the q levels are ordered by random weights. The best binary splitting is applied until the desired number of branches is reached. Only a local optimum can be found with this technique.

For values of q ≥ 1024, the default k-means clustering algorithm is applied to determine the splits.

IMPUTE

specifies how to treat observations with nonmissing values for the target variable during scoring. When this option is specified, the observed values are used as the predicted values. That is, the observed value is assumed to be known without error. Only the observations with missing values for the target variable are then scored against the decision tree based on their values for the input variables.

The IMPUTE option is useful if you want to replace missing values of a target variable with classified values based on the decision tree.

INPUT=variable-name

INPUT=(variable-name1 < variable-name2, ...>)

specifies the variables to use for building the tree. You can add the target variable to the input list if you want to assign a format to the target variable by using the FORMATS= option. Any numeric variable that is not specified in the NOMINAL= option is binned according to the NBINS= specification.

LEAFSIZE=m

specifies the minimal number of observations on each node. When the number of observations on a tree node is less than m, the node is changed to a leaf during the building of the decision tree.

Interaction

Specifying the LEAFSIZE option affects the pruning of the tree.

MAXBRANCH=n

specifies the maximum number of children (branches) to allow for each level of the tree.

Default

MAXLEVEL=n

specifies the maximum number of the tree level.

Default

MULTVAR

specifies to allow a variable to appear multiple times when traversing the tree from top to bottom.

NBINS=k

specifies the number of bins to use in the calculation of the decision tree. The number of bins affects the accuracy of the tree. The accuracy increases as values of k increase. However, computing time and memory consumption also increase as values of k increase.

Default

NBINSTARGET=k

specifies the number of bins to use for a numeric target variable. The number of bins affects the accuracy of the tree. The accuracy increases as values of k increase. However, computing time and memory consumption also increase as values of k increase. When k is greater than zero, the numeric target variable is binned into equally sized bins first and then the bins are used to perform the classification.

Default

NOMINAL=variable-name

NOMINAL=(variable-list)

specifies the numeric variables to use as nominal variables. Binning is not applied to the specified variables. The target variable is always treated as a nominal variable and does not need to be listed.

NOMISSOBS

specifies to ignore observations that have missing values in the analysis variables when building a decision tree. When scoring a data set, any observations with missing values in the analysis variables for the decision tree are ignored when this option is specified.

When this option is not specified, the DECISIONTREE statement builds a tree by applying the following policy for missing values:

for an interval variable, the smallest machine value is assigned
for a nominal variable, missing values are represented by a separate level

NOPREPARSE

prevents the procedure from preparsing and pregenerating code for temporary expressions, scoring programs, and other user-written SAS statements.

When this option is specified, the user-written statements are sent to the server "as is" and then the server attempts to generate code from it. If the server detects problems with the code, the error messages might not to be as detailed as the messages that are generated by SAS client. If you are debugging your user-written program, then you might want to preparse and pregenerate code in the procedure. However, if your SAS statements compile and run as you want them to, then you can specify this option to avoid the work of parsing and generating code on the SAS client.

When you specify this option in the PROC IMSTAT statement, the option applies to all statements that can generate code. You can also exclude specific statements from preparsing by using the NOPREPARSE option in statements that allow temporary columns or the SCORE statement.

Alias

NOPREP

NOPRUNEOBS

specifies not to prune any observations when building a decision tree.

NOSCORE

suppresses the generation of the scoring temporary table when the TEMPTABLE option is specified. In this case, the server generates only one temporary table and the table contains the decision tree.

PRUNE

requests to prune the tree according to the C4.5 algorithm. Pruning can increase the error of misclassification. You can control the aggressiveness of pruning with the CFLEV= option. Smaller values for the CFLEV= option result in more aggressive pruning.

PRUNEGROW

specifies to enable C4.5 pruning when building a classification decision tree. The tree could have large a misclassification rate but the building process is performed quickly.

REG

specifies to build a regression tree. Minimal cost-complexity pruning is applied to prune the tree.

SAVE=table-name

saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.

SCOREDATA=table-name

specifies the in-memory table that contains the scoring data. The table must exist in-memory on the server. The DECISIONTREE statement in the IMSTAT procedure does not transfer a local data set to the server.

SETSIZE

requests that the server estimate the size of the result set. The procedure does not create a result table if the SETSIZE option is specified. Instead, the procedure reports the number of rows that are returned by the request and the expected memory consumption for the result set (in KB). If you specify the SETSIZE option, the SAS log includes the number of observations and the estimated result set size. See the following log sample:

NOTE: The LASR Analytic Server action request for the STATEMENT
      statement would return 17 rows and approximately
      3.641 kBytes of data.

The typical use of the SETSIZE option is to get an estimate of the size of the result set in situations where you are unsure whether the SAS session can handle a large result set. Be aware that in order to determine the size of the result set, the server has to perform the work as if you were receiving the actual result set. Requesting the estimated size of the result set does consume resources on the server. The estimated number of KB is very close to the actual memory consumption of the result set. It might not be immediately obvious how this size relates to the displayed table, since many tables contain hidden columns. In addition, some elements of the result set might not be converted to tabular output by the procedure.

STAT

specifies to generate two additional tables that contain statistical information about the variables that are used in the decision tree. One table contains the variable importance information, which is determined by the total Gini reduction. The second table contains the variable splitting information for each node in the decision tree.

TEMPEXPRESS="SAS-expressions"

TEMPEXPRESS=file-reference

specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.

Alias

TE=

TEMPNAMES=variable-name

TEMPNAMES=(variable-list)

specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.

Alias

TN=

TEMPTABLE

specifies to store the results in a temporary table. The type of information that is stored depends on whether you are building a decision tree or scoring a table with a decision tree.

When you are building a decision tree, the generated decision tree is stored in the server and the input table is automatically scored using this tree. The scoring details are saved in a temporary table. The _TEMPTREE_ macro variable stores the name of the temporary table for the tree. The _TEMPSCORE_ macro variable stores the name of the temporary table that has the scoring results of traversing the decision tree. You can suppress the generation of the scoring temporary table (_TEMPSCORE_) during the tree building phase by specifying the NOSCORE option.

When you are scoring a table using a decision tree, the TEMPTABLE option requests to store the scoring details in a temporary table in the server. The IMSTAT procedure displays the name of the table and stores it in the _TEMPSCORE_ macro variable. Be aware that the DETAIL option can generate a very large amount of scoring results when the in-memory table that is specified in the SCOREDATA= option is large. Observations from the scored data set can be transferred to the temporary table using the VARS= option.

TIMEOUT=s

specifies the maximum number of seconds that the server should run the statement. If the time-out is reached, the server terminates the request and generates an error and error message. By default, there is no time-out.<

TREEDATA=libref.member-name

TREETAB=saved-table

TREELASRTAB=table-name

specifies the saved table that contains the generated tree. In order to score a (validation) data set against the generated tree, you need the validation data and a representation of the tree. Specify these options as follows:

The TREEDATA= option is used to specify the name of a SAS data set that stores the generated tree. The data set is local to the SAS client.
The TREETAB= option is used to specify a table on the SAS client that stores the generated tree.
The TREELASRTAB= option is used to specify a valid decision tree that is stored in an in-memory table.

The data set with the observations to score is specified in the SCOREDATA= option

Alias

SCORETAB=

VARS=variable-name

VARS=(variable-name1 < variable-name2, ...>)

specifies the variables to transfer from the input table to the temporary table in the server that contains the results of scoring a decision tree. This option has no effect unless you specify the TEMPTABLE option and you score a decision tree.

Details

ODS Table Names

The DECISIONTREE statement generates ODS tables that are specified in the following table.

ODS Table Name	Description	Option
DTREE	Classification decision tree	Default
DTreeVarImpInfo	Variable importance in a decision tree	STAT
DTreeVarStatInfo	Variable information for decision tree	STAT
DTREESCORE	Classification decision tree scoring summary	SCOREDATA=
GeneratedCode	Generated SAS code from modeling task	CODE
TempTable	Information about a temporary table	TEMPTABLE

For information about using the ODS table with SAVE= option, see the Details section of the STORE statement.