IMSTAT Procedure (Analytics)

RANDOMWOODS Statement

The RANDOMWOODS statement builds a random forest of decision trees. Each tree is constructed from a bootstrap sample of the data, drawn with replacement, and is constructed from only a subset of the variables specified in the INPUT= option.

Syntax

RANDOMWOODS target-variable </ options>;

Required Argument

target-variable

specifies a single column in the in-memory table as the target variable. The variable can be a temporary calculated column.

RANDOMWOODS Statement Options

ADDTREES

requests that the temporary table that is generated by scoring a random forest is enhanced with information about the votes of the individual trees. The process of scoring a random forest means that each tree votes on the predicted value and the predicted value for the forest is obtained by majority vote. This option adds the votes for each tree in the forest. By default, only the overall vote is reported in the temporary table.

ASSESS

specifies that predicted probabilities are added to the temporary result table for the event levels. You can use these predicted probabilities in an ASSESS statement.

BOOTSTRAP=f

specifies the fraction of the data in the bootstrap sample.

Default f = 1 – exp(–1)
Range 0 to 1

CODE <(code-generation-options)>

requests that the server produce SAS scoring code based on the actions that it performed during the analysis. The server generates DATA step code. By default, the code is replayed as an ODS table by the procedure as part of the output of the statement. More frequently, you might want to write the scoring code to an external file by specifying options.

The scoring code computes the predicted value of the response variable on the data scale (the inverse link scale) and prefixes the name with "RF_". For example, if the response variable is Y, the generated code stores the predicted value as RF_Y. The name of the variable is truncated to fit within the SAS name length requirements.

COMMENT

specifies to add comments to the code in addition to the header block. The header block is added by default.

FILENAME='path'

specifies the name of the external file to which the scoring code is written. This suboption applies only to the scoring code itself.

Alias FILE=

FORMATWIDTH=k

specifies the width to use in formatting derived numbers such as parameter estimates in the scoring code. The server applies the BEST format, and the default format for code generation is BEST20.

Alias FMTW=
Range 4 to 32

LABELID=id

specifies a group identifier for group processing. The identifier is an integer and is used to create array names and statement labels in the generated code.

LINESIZE=n

specifies the line size for the generated code.

Alias LS=
Default 72
Range 64 to 256

NOTRIM

specifies to format the variables using the full format width with padding. By default, leading and trailing blanks are removed from the formatted values.

REPLACE

specifies to overwrite the external file if a file with the specified name already exists. The option has no effect unless you specify the FILENAME= option.

EVENT=("event1" <, "event2”>...)

specifies the event names of the target variable. This option is combined with the WEIGHT= option to specify the weight for each specific event. Observations with the specified event are reweighted with the value from the WEIGHT= option. This option is useful for rare-event sampling.

FORMATS=("format-specification",...)

specifies the formats for the input variables. If you do not specify the FORMATS= option, the default format is applied for that variable. Enclose each format specification in quotation marks and separate each format specification with a comma.

GAIN

specifies that the splitting criterion is changed to information gain. Typically, this criterion intends to generate trees with more nodes than information gain ratio.

GREEDY

specifies how to perform splitting under specific circumstances.

Assuming that one variable has q levels, when binary splitting is performed and q is less than 15, or option MAXBRANCH > 2 and q < 12, all possible binary splits are enumerated and the split with the largest gain or gain ratio is chosen for the variable.
When q is less than 1024 and splitting is not just binary, local greedy searches are applied to determine the optimum local split. Specifically, when the variable is numeric, q levels (similar to q bins) are sorted by value.
When the variable is nominal, the q levels are ordered by random weights. The best binary splitting is applied until the desired number of branches is reached. Only a local optimum can be found with this technique.
For values of q ≥ 1024, the default k-means clustering algorithm is applied to determine the splits.

IMPUTE

specifies how to treat observations with nonmissing values for the target variable during scoring. When this option is specified, the observed values are used as the predicted values. That is, the observed value is assumed to be known without error. Only the observations with missing values for the target variable are then scored against the random forest, based on their values for the input variables.

This option is useful if you want to replace missing values of a target variable with classified values that are based on the random forest.

INPUT=variable-name

INPUT=(variable-list)

specifies the variables to use for building the tree. You can add the target variable to the input list if you want to assign a format to the target variable by using the FORMATS= option. Any numeric variable that is not specified in the NOMINAL= option is binned according to the NBINS= specification.

In random forest implementations, all of the input variables do not participate in the construction of the trees. Each tree is built from a subset of the input variables. You can use the M= option to affect the selection of these input variables.

LEAFSIZE=m

specifies the minimal number of observations on each node. When the number of observations on a tree node falls short of the specified leaf size m, the node is changed into a leaf during the building of the tree.

Interaction Specifying the LEAFSIZE option affects the pruning of the tree.

M=k

specifies the number of input variables used to build a tree. The k variables are selected at random from the list of input variables for each tree. If not specified, then k defaults to the square root of the number of input variables, rounded up to the nearest integer.

MAXBRANCH=n

specifies the maximum number of children (branches) allowed for each level of the tree.

Default 2

MAXLEVEL=n

specifies the maximum number of tree levels.

Default 6

NBINS=k

specifies the number of bins used in the calculation of the tree. The number of bins affects the accuracy of the tree and increases with k at the expense of computing time and memory consumption.

Default 2

NBINSTARGET=k

specifies the number of bins to use for a numeric target variable. The number of bins affects the accuracy of the tree. The accuracy increases as values of k increase. However, computing time and memory consumption also increase as values of k increase. When k is greater than zero, the numeric target variable is binned into equally sized bins first and then the bins are used to perform the classification.

Default 0

NOERROR

specifies that the out-of-bag error is not computed when building a random decision forest. This option is useful to speed up the building process.

NOMINAL=variable-name

NOMINAL=(variable-list)

specifies the numeric variables to use as nominal variables. Binning is not applied to the specified variables. The target variable is always treated as a nominal variable and does not need to be listed.

NOMISSOBS

specifies to ignore observations that have missing values in the analysis variables when building a decision tree. When scoring a data set, any observations with missing values in the analysis variables for the decision tree are ignored when this option is specified.

When this option is not specified, the RANDOMWOODS statement builds a tree by applying the following policy for missing values:
  • For an interval variable, the smallest machine value is assigned.
  • For a nominal variable, missing values are represented by a separate level.

NOPREPARSE

prevents pre-parsing and pre-generating the program code that is referenced in the CODE= option. If you know the code is correct, you can specify this option to save resources. The code is always parsed by the server, but you might get more detailed error messages when the procedure parses the code rather than the server. The server assumes that the code is correct. If the code fails to compile, the server indicates that it could not parse the code, but not where the error occurred.

Alias NOPREP

NTREE=n

specifies the number of trees to build for the random forest.

Default 1

REG

specifies to build the random decision forest using regression trees. Minimal cost-complexity pruning is applied to prune the trees.

SAVE=table-name

saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.

SCOREDATA=table-name

specifies the in-memory table that contains the scoring data. The table must exist in-memory on the server. The RANDOMWOODS statement in the IMSTAT procedure does not transfer a local data set to the server.

If you do not specify a table name for this option, the active table is used as the scoring input.

SEED=s

specifies the random number seed for the random number generator in the server. The default value, zero, implies that the random number stream is based on the computer clock. Negative seed values also lead to random number streams that are based on the computer clock. If you want a reproducible random number sequence between runs, specify a value that is greater than zero.

Default 0

TEMPEXPRESS="SAS-expressions"

TEMPEXPRESS=file-reference

specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.

Alias TE=

TEMPNAMES=variable-name

TEMPNAMES=(variable-list)

specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.

Alias TN=

TEMPTABLE

specifies to save the results of the RANDOMWOODS statement in a temporary table.

If you build a random forest, the temporary table contains information about the forest.
If you score a random forest, the temporary table contains the predicted values for the observations in the input table that you scored. The temporary table also contains the variables that you specified to transfer from the input table and other statistics. This option is required when you perform scoring so that you can access the predictions for each observation. The IMSTAT procedure then displays the name of the table and stores it in the _TEMPSCORE_ macro variable, provided that the scoring action was successful. Observations from the table that you scored can be transferred to the temporary table using the VARS= option.

TIMEOUT=s

specifies the maximum number of seconds that the statement should run in the server. If the computation does not complete before the time-out is reached, execution stops and the server generates an error message. There is no default time-out.

TREEINFO

specifies to display information about individual tress, when you build a tree. For example, the table that is shown can display which variables are used in each tree. The option has no effect if you store the tree in a temporary table.

TREELASR=table-name

specifies the in-memory table that contains the information for the random forest if you want to score a table against the random forest.

The data set whose observations are to be scored is specified in the SCOREDATA= option. If you do not specify the SCOREDATA= option, the active table is used as scoring input.
Alias LASRTREE=

VARS=variable-name

VARS=(variable-name1 <, variable-name2, ...>)

specifies the variables to transfer from the input table to the temporary table in the server that contains the results of scoring a decision tree. This option has no effect unless you specify the TEMPTABLE option and you score a decision tree.

WEIGHT=

specifies the weight for each corresponding event in the EVENT= option.

Details

ODS Table Names

The RANDOMWOODS statement generates the following ODS tables.
ODS Table Name
Description
Option
ForestInfo
Basic information about a random forest
Default
GeneratedCode
Generated SAS code from modeling task
CODE=
ForestVarImpInfo
Variable importance in a random forest
Default
ForestScoreInfo
Basic information about scoring in a random forest
TREELASR=
TreeInfo
Basic information about trees in a random forest
TREEINFO
TempTable
Information about a temporary table
TEMPTABLE
The information in the ForestVarImpInfo table is similar to the DTreeVarImpInfo table that is produced with the STAT option to the DESCISIONTREE statement. The difference is that the RANDOMWOODS statement uses multiple trees and the server computes the reduction from all trees.
For information about using the ODS table with SAVE= option, see the Details section of the STORE statement.