Example Process Flow Diagram |
An empirical tree represents segmentation of the data that is created by applying a series of simple rules. Each rule assigns an observation to a segment based on the value of one input. One rule is applied after another rule, resulting in a hierarchy of segments within segments. The hierarchy is called a tree, and each segment is called a node. The original segment contains the entire data set and is called the root node of the tree. A node and all its successors form a branch of the node that created it. The final nodes are called leaves. For each leaf, a decision is made and applied to all observations in the leaf. The type of decision depends on the context. In predictive modeling, as in this example, the decision is simply the predicted value.
Tree models readily accommodate nonlinear associations between the input variables and the target. They offer easy interpretability, and they handle missing values without using imputation.
Add a Tree node to the Diagram Workspace.
Connect the Transform Variables node to the Tree node.
Open the Tree node. For binary targets, the node uses a chi-square test that has a significance level of 0.200 as the default splitting criterion for binary targets.
For brevity, use the default Basic tab settings of the Tree node to fit the model.
Select the Advanced tab. Because the Tree node recognizes that an active loss matrix has been defined, it automatically sets the model assessment measure to Average loss. The best tree will be created, based on minimizing the expected loss for the cases in the validation data set.
Save the model using the Save tool icon from the toolbox. The Save Model As window opens. Type a model name and description in the respective entry fields and then click . By default, the model is saved as "Untitled."
Train the node by using the Run icon in the Toolbox.
After the node finishes training, click
in the Message window to view the Tree node Results Browser.The All tab of the Tree Results Browser is displayed:
The table in the upper-left corner summarizes the overall classification process. Below this is another table that lists the training and validation expected loss values for increasing tree complexity. The plot at the bottom-right corner represents this same information graphically.
The tree that has only two leaves provides the smallest expected loss for the validation data. The average expected loss for the cases in the validation data set is about -12 cents (a 12-cent profit).
The tree ring provides a quick overview of the tree complexity, split balance, and discriminatory power. The center of the ring corresponds to the root node of the tree: the farther from the center, the deeper the tree. A split in the ring corresponds to a split in the tree. The arc length of the regions within each ring corresponds to the sample sizes of the nodes. Darker colors represent node purity ("pure" nodes have minimal expected loss values). The Assessment Colors window displays the segment color hues that correspond to the expected loss values in the tree ring.
To view the more traditional tree diagram, click the View menu and select Tree.
The tree diagram contains the following items:
Root node -- the top node in the tree that contains all cases.
Internal nodes -- nonterminal nodes (also includes the root node) that contain the splitting rule.
Leaf nodes -- terminal nodes that contain the final classification for a set of observations.
You can use the scroll bars to display additional nodes. The expected loss values are used to recursively partition the data in homogeneous groups. The method is recursive because each subgroup results from splitting a subgroup from a previous split.
The numeric labels that are directly above each node indicate at which point the Tree node found significant splits. The character labels that are in the center of each split are the variable names. For the credit data, the only input variable that resulted in a split was CHECKING.
When you use the loss assessment criterion to build the tree, each row of a node contains the following statistics:
The first row lists the percentage of good values in the training and validation data sets.
The second row lists the percentage of bad values in the training and validation data sets.
The third row lists the number of good values in the training and validation data sets.
The fourth row lists the number of bad values in the training and validation data sets.
The fifth row lists the total number of observations in the training and validation data sets.
The sixth row lists the decision alternative that is assigned to the node (accept or reject).
The seventh row lists the expected loss for the accept decision in the training and validation data sets.
The eighth row lists the expected loss for the reject decision in the training and validation data sets.
Close the Tree Results Browser and then close the Tree node.
Copyright © 2006 by SAS Institute Inc., Cary, NC, USA. All rights reserved.