The RANDOMWOODS statement builds a random forest of decision trees. Each tree is constructed from a bootstrap sample of the data, drawn with replacement, and is constructed from only a subset of the variables specified in the INPUT= option.
specifies a single column in the in-memory table as the target variable. The variable can be a temporary calculated column.
requests that the temporary table that is generated by scoring a random forest is enhanced with information about the votes of the individual trees. The process of scoring a random forest means that each tree votes on the predicted value and the predicted value for the forest is obtained by majority vote. This option adds the votes for each tree in the forest. By default, only the overall vote is reported in the temporary table.
specifies that predicted probabilities are added to the temporary result table for the event levels. You can use these predicted probabilities in an ASSESS statement.
specifies the fraction of the data in the bootstrap sample.
Default | f = 1 – exp(–1) |
Range | 0 to 1 |
requests that the server produce SAS scoring code based on the actions that it performed during the analysis. The server generates DATA step code. By default, the code is replayed as an ODS table by the procedure as part of the output of the statement. More frequently, you might want to write the scoring code to an external file by specifying options.
Y
, the
generated code stores the predicted value as RF_Y
.
The name of the variable is truncated to fit within the SAS name length
requirements.
specifies to add comments to the code in addition to the header block. The header block is added by default.
specifies the name of the external file to which the scoring code is written. This suboption applies only to the scoring code itself.
Alias | FILE= |
specifies the width to use in formatting derived numbers such as parameter estimates in the scoring code. The server applies the BEST format, and the default format for code generation is BEST20.
Alias | FMTWIDTH= |
Range | 4 to 32 |
specifies the line size for the generated code.
Alias | LS= |
Default | 72 |
Range | 64 to 256 |
specifies the event names of the target variable. This option is combined with the WEIGHT= option to specify the weight for each specific event. Observations with the specified event are reweighted with the value from the WEIGHT= option. This option is useful for rare-event sampling.
specifies the formats for the input variables. If you do not specify the FORMATS= option, the default format is applied for that variable. Enclose each format specification in quotation marks and separate each format specification with a comma.
specifies that the splitting criterion is changed to information gain. Typically, this criterion intends to generate trees with more nodes than information gain ratio.
specifies how to perform splitting under specific circumstances.
specifies how to treat observations with nonmissing values for the target variable during scoring. When this option is specified, the observed values are used as the predicted values. That is, the observed value is assumed to be known without error. Only the observations with missing values for the target variable are then scored against the random forest, based on their values for the input variables.
specifies the variables to use for building the tree. You can add the target variable to the input list if you want to assign a format to the target variable by using the FORMATS= option. Any numeric variable that is not specified in the NOMINAL= option is binned according to the NBINS= specification.
specifies the minimal number of observations on each node. When the number of observations on a tree node falls short of the specified leaf size m, the node is changed into a leaf during the building of the tree.
Interaction | Specifying the LEAFSIZE option affects the pruning of the tree. |
specifies the number of input variables used to build a tree. The k variables are selected at random from the list of input variables for each tree. If not specified, then k defaults to the square root of the number of input variables, rounded up to the nearest integer.
specifies the maximum number of children (branches) allowed for each level of the tree.
Default | 2 |
specifies the maximum number of tree levels.
Default | 6 |
specifies the number of bins used in the calculation of the tree. The number of bins affects the accuracy of the tree and increases with k at the expense of computing time and memory consumption.
Default | 2 |
specifies the number of bins to use for a numeric target variable. The number of bins affects the accuracy of the tree. The accuracy increases as values of k increase. However, computing time and memory consumption also increase as values of k increase. When k is greater than zero, the numeric target variable is binned into equally sized bins first and then the bins are used to perform the classification.
Default | 0 |
specifies that the out-of-bag error is not computed when building a random decision forest. This option is useful to speed up the building process.
specifies the numeric variables to use as nominal variables. Binning is not applied to the specified variables. The target variable is always treated as a nominal variable and does not need to be listed.
specifies to ignore observations that have missing values in the analysis variables when building a decision tree. When scoring a data set, any observations with missing values in the analysis variables for the decision tree are ignored when this option is specified.
prevents pre-parsing and pre-generating the program code that is referenced in the CODE= option. If you know the code is correct, you can specify this option to save resources. The code is always parsed by the server, but you might get more detailed error messages when the procedure parses the code rather than the server. The server assumes that the code is correct. If the code fails to compile, the server indicates that it could not parse the code, but not where the error occurred.
Alias | NOPREP |
specifies the number of trees to build for the random forest.
Default | 1 |
specifies to build the random decision forest using regression trees. Minimal cost-complexity pruning is applied to prune the trees.
saves the result table so that you can use it in other IMSTAT procedure statements like STORE, REPLAY, and FREE. The value for table-name must be unique within the scope of the procedure execution. The name of a table that has been freed with the FREE statement can be used again in subsequent SAVE= options.
specifies the in-memory table that contains the scoring data. The table must exist in-memory on the server. The RANDOMWOODS statement in the IMSTAT procedure does not transfer a local data set to the server.
specifies the random number seed for the random number generator in the server. The default value, zero, implies that the random number stream is based on the computer clock. Negative seed values also lead to random number streams that are based on the computer clock. If you want a reproducible random number sequence between runs, specify a value that is greater than zero.
Default | 0 |
specifies either a quoted string that contains the SAS expression that defines the temporary variables or a file reference to an external file with the SAS statements.
Alias | TE= |
specifies the list of temporary variables for the request. Each temporary variable must be defined through SAS statements that you supply with the TEMPEXPRESS= option.
Alias | TN= |
specifies to save the results of the RANDOMWOODS statement in a temporary table.
specifies the maximum number of seconds that the statement should run in the server. If the computation does not complete before the time-out is reached, execution stops and the server generates an error message. There is no default time-out.
specifies to display information about individual tress, when you build a tree. For example, the table that is shown can display which variables are used in each tree. The option has no effect if you store the tree in a temporary table.
specifies the in-memory table that contains the information for the random forest if you want to score a table against the random forest.
Alias | LASRTREE= |
specifies the variables to transfer from the input table to the temporary table in the server that contains the results of scoring a decision tree. This option has no effect unless you specify the TEMPTABLE option and you score a decision tree.
specifies the weight for each corresponding event in the EVENT= option.