The DECISIONTREE statement provides an implementation of a C4.5 decision tree method for classification. You specify a single column as the target variable when you generate the decision tree. You can also score against the generated tree.
Examples: | Training and Validating a Decision Tree |
specifies that predicted probabilities are added to the temporary result table for the event levels. You can use these predicted probabilities in an ASSESS statement.
requests that the server produce SAS scoring code based on the actions that it performed during the analysis. The server generates DATA step code. By default, the code is replayed as an ODS table by the procedure as part of the output of the statement. More frequently, you might want to write the scoring code to an external file by specifying options.
Y
,
the generated code stores the predicted value as DT_Y
.
The name of the variable is truncated to fit within the SAS name length
requirements.
specifies to add comments to the code in addition to the header block. The header block is added by default.
specifies the name of the external file to which the scoring code is written. This suboption applies only to the scoring code itself.
Alias | FILE= |
specifies the width to use in formatting derived numbers such as parameter estimates in the scoring code. The server applies the BEST format, and the default format for code generation is BEST20.
Alias | FMTW= |
Range | 4 to 32 |
specifies a group identifier for group processing. The identifier is an integer and is used to create array names and statement labels in the generated code.
specifies the line size for the generated code.
Alias | LS= |
Default | 72 |
Range | 64 to 256 |
specifies to format the variables using the full format width with padding. By default, leading and trailing blanks are removed from the formatted values.
specifies to overwrite the external file if a file with the specified name already exists. The option has no effect unless you specify the FILENAME= option.
specifies a value between 0 and 1 that controls the aggressiveness of tree pruning according to the C4.5 algorithm. Smaller numbers indicate to more aggressive pruning. See the PRUNE option.
requests detailed information about the classification results when scoring a table against a previously calculated tree.
specifies the formats for the input variables. If you do not specify the FORMATS= option, the default format is applied for that variable.
Example | proc imstat data=lasr1.table1;
decisiontree x / input=(a b) formats=("8.3", "$10");
quit;
|
specifies that the splitting criterion is changed to information gain. Typically, this criterion intends to generate trees with more nodes than information gain ratio.
specifies how to perform splitting under specific circumstances.
specifies how to treat observations with nonmissing values for the target variable during scoring. When this option is specified, the observed values are used as the predicted values. That is, the observed value is assumed to be known without error. Only the observations with missing values for the target variable are then scored against the decision tree based on their values for the input variables.
specifies the variables to use for building the tree. You can add the target variable to the input list if you want to assign a format to the target variable by using the FORMATS= option. Any numeric variable that is not specified in the NOMINAL= option is binned according to the NBINS= specification.
specifies the minimal number of observations on each node. When the number of observations on a tree node is less than m, the node is changed to a leaf during the building of the decision tree.
Interaction | Specifying the LEAFSIZE option affects the pruning of the tree. |
specifies the maximum number of children (branches) to allow for each level of the tree.
Default | 2 |
specifies the maximum number of the tree level.
Default | 6 |
specifies to allow a variable to appear multiple times when traversing the tree from top to bottom.
specifies the number of bins to use in the calculation of the decision tree. The number of bins affects the accuracy of the tree. The accuracy increases as values of k increase. However, computing time and memory consumption also increase as values of k increase.
Default | 2 |
specifies the number of bins to use for a numeric target variable. The number of bins affects the accuracy of the tree. The accuracy increases as values of k increase. However, computing time and memory consumption also increase as values of k increase. When k is greater than zero, the numeric target variable is binned into equally sized bins first and then the bins are used to perform the classification.
Default | 0 |
specifies the numeric variables to use as nominal variables. Binning is not applied to the specified variables. The target variable is always treated as a nominal variable and does not need to be listed.
specifies to ignore observations that have missing values in the analysis variables when building a decision tree. When scoring a data set, any observations with missing values in the analysis variables for the decision tree are ignored when this option is specified.
prevents the procedure from preparsing and pregenerating code for temporary expressions, scoring programs, and other user-written SAS statements.
Alias | NOPREP |
specifies not to prune any observations when building a decision tree.
suppresses the generation of the scoring temporary table when the TEMPTABLE option is specified. In this case, the server generates only one temporary table and the table contains the decision tree.
requests to prune the tree according to the C4.5 algorithm. Pruning can increase the error of misclassification. You can control the aggressiveness of pruning with the CFLEV= option. Smaller values for the CFLEV= option result in more aggressive pruning.
specifies to enable C4.5 pruning when building a classification decision tree. The tree could have large a misclassification rate but the building process is performed quickly.
specifies to build a regression tree. Minimal cost-complexity pruning is applied to prune the tree.
saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.
specifies the in-memory table that contains the scoring data. The table must exist in-memory on the server. The DECISIONTREE statement in the IMSTAT procedure does not transfer a local data set to the server.
requests that the server estimate the size of the result set. The procedure does not create a result table if the SETSIZE option is specified. Instead, the procedure reports the number of rows that are returned by the request and the expected memory consumption for the result set (in KB). If you specify the SETSIZE option, the SAS log includes the number of observations and the estimated result set size. See the following log sample:
NOTE: The LASR Analytic Server action request for the STATEMENT
statement would return 17 rows and approximately
3.641 kBytes of data.
specifies to generate two additional tables that contain statistical information about the variables that are used in the decision tree. One table contains the variable importance information, which is determined by the total Gini reduction. The second table contains the variable splitting information for each node in the decision tree.
specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.
Alias | TE= |
specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.
Alias | TN= |
specifies to store the results in a temporary table. The type of information that is stored depends on whether you are building a decision tree or scoring a table with a decision tree.
specifies the maximum number of seconds that the server should run the statement. If the time-out is reached, the server terminates the request and generates an error and error message. By default, there is no time-out.<
specifies the saved table that contains the generated tree. In order to score a (validation) data set against the generated tree, you need the validation data and a representation of the tree. Specify these options as follows:
Alias | SCORETAB= |
specifies the variables to transfer from the input table to the temporary table in the server that contains the results of scoring a decision tree. This option has no effect unless you specify the TEMPTABLE option and you score a decision tree.
ODS Table Name
|
Description
|
Option
|
---|---|---|
DTREE
|
Classification decision
tree
|
Default
|
DTreeVarImpInfo
|
Variable importance
in a decision tree
|
STAT
|
DTreeVarStatInfo
|
Variable information
for decision tree
|
STAT
|
DTREESCORE
|
Classification decision
tree scoring summary
|
SCOREDATA=
|
GeneratedCode
|
Generated SAS code from
modeling task
|
CODE
|
TempTable
|
Information about a
temporary table
|
TEMPTABLE
|