Subtree Statistics :: SAS/STAT(R) 13.1 User's Guide: High-Performance Procedures

Subtree Statistics

Subsections:

Decision Tree Per-Observation Methods
Decision Tree Per-Leaf Methods
Regression Tree Per-Observation Methods
Regression Tree Per-Leaf Methods

Statistics that are printed in the subtree tables are similar to the pruning statistics. There are two ways to calculate the subtree statistics: one is based on a scored data set (using the SCORE statement or the SAS DATA step score code that the CODE statement produces), and the other is based on the internal observation counts at each leaf of the tree. The two methods should provide identical results unless the target is missing.

Note: The per-observation and per-leaf methods of calculating the subtree statistics might not agree if the input data set contains observations that have a missing value for the target.

Decision Tree Per-Observation Methods

In scoring, whether you use the SCORE statement or you use the CODE statement with a SAS DATA step, each observation is assigned a posterior probability, $P_\tau$ , where $\tau$ is a target level. These posterior probabilities are then used to calculate the subtree statistics of the final tree.

For a leaf $\lambda$ , the posterior probability is the fraction of observations at that leaf that have the target level $\tau$ . That is, for leaf $\lambda$ ,

$\begin{equation*} P_\tau ^\lambda = \frac{N_\tau ^\lambda }{N_\lambda } \end{equation*}$

When a record is scored, it is assigned to a leaf, and all posterior probabilities for that leaf are assigned along with it. Thus, for observation $\omega$ assigned to leaf $\lambda$ , the posterior probability is

$\begin{equation*} P_\tau ^\omega = P_\tau ^\lambda = \frac{N_\tau ^\lambda }{N_\lambda } \end{equation*}$

The variable $N_0$ continues to indicate the total number of observations in the input data set, and $\omega$ is the observation number ( $\omega$ is used to prevent confusion with 0).

If a validation set is selected, the statistics are calculated separately for the validation set and for the training set. In addition, the per-observation validation posterior probabilities should be used. The validation posterior probabilities, $V_\tau ^\omega$ , are the same as the posterior probabilities from the training set, but they are the fraction of the observations from the validation set that are in each target level,

$\begin{equation*} V_\tau ^\omega = V_\tau ^\lambda = \frac{N_\tau ^\lambda }{N_\lambda } \end{equation*}$

where $N_\tau ^\lambda$ and $N_\lambda$ are now observation counts from the validation set. For calculating the statistics on the validation set, the same equations can be used but substituting V for P where appropriate (for example, $V_\tau ^\lambda$ for $P_\tau ^\lambda$ ).

Decision Tree Observationwise Entropy Statistic

The entropy at each observation is calculated from the posterior probabilities:

$\begin{equation*} \mathrm{Entropy} = -\sum _\omega {\frac{1}{N_0} \sum _\tau {P_\tau ^\omega \log _2\left(P_\tau ^\omega \right)}} \end{equation*}$

Decision Tree Observationwise Gini Statistic

Like the entropy, the Gini statistic is also calculated from the posterior probabilities:

$\begin{equation*} \mathrm{Gini} = \sum _\omega {\frac{1}{N_0} \sum _\tau {P_\tau ^\omega \left(1 - P_\tau ^\omega \right)}} \end{equation*}$

Decision Tree Observationwise Misclassification Rate

The misclassification rate is the average number of incorrectly predicted observations in the input data set. Predictions are always based on the training set. Thus, each scored record’s predicted target level $\tau _ P^\omega$ is compared against the actual level $\tau _\pi ^\omega$ :

$\begin{equation*} \mathrm{MISC} = \sum _\omega {\frac{1 - \delta _{\tau _ P^\omega \tau _\pi ^\omega }}{N_0}} \end{equation*}$

$\delta _{\tau _ P^\omega \tau _\pi ^\omega }$ is the Kronecker delta:

$\begin{equation*} \delta _{\tau _ P^\omega \tau _\pi ^\omega } = \left\{ \begin{array}{l l} 1 & \text {if $\tau _ p^\omega = \tau _\pi ^\omega $} \\ 0 & \text {otherwise} \end{array} \right. \end{equation*}$

Phrased slightly differently, the misclassification rate is the fraction of incorrectly predicted observations:

$\begin{equation*} \mathrm{MISC} = \frac{1}{N_0} \sum _\omega {\left\{ \begin{array}{l l} 0 & \text {if $\tau _ p^\omega = \tau _\pi ^\omega $} \\ 1 & \text {otherwise} \end{array} \right.} \end{equation*}$

Decision Tree Observationwise Sum of Squares Error

For the sum of squares error (SSE), $N_\tau$ predictions are made for every observation: that the correct posterior is 1 and that the incorrect posteriors is 0. Thus the SSE is as follows, where $\tau _\pi ^\omega$ is again the actual target level for observation $\omega$ :

$\begin{equation*} \mathrm{SSE} = \sum _\omega \left[ \sum _{\tau \ne \tau _\pi ^\omega } \left( P_\tau ^\omega \right)^2 + \left(1 - P_{\tau _\pi ^\omega }^\omega \right)^2 \right] \end{equation*}$

Decision Tree Observationwise Average Square Error

The average square error (ASE) is simply the SSE divided by the number of predictions (there are $N_\tau$ predictions per observation):

$\begin{equation*} \mathrm{ASE} = \frac{1}{N_\tau N_0} \sum _\omega \left[ \sum _{\tau \ne \tau _\pi ^\omega } \left( P_\tau ^\omega \right)^2 + \left(1 - P_{\tau _\pi ^\omega } \right)^2 \right] \end{equation*}$

Decision Tree Per-Leaf Methods

The subtree statistics that are calculated by PROC HPSPLIT are calculated per leaf. That is, instead of scanning through the entire data set, PROC HPSPLIT examines the proportions of observations at the leaves. Barring missing target values, which are not handled by the tree, the per-leaf and per-observation methods for calculating the subtree statistics are the same.

As with the per-observation method, observation counts N ( $N_\tau ^\lambda$ , $N_\lambda$ , and $N_0$ ) can come from either the training set or the validation set. The growth subtree table always produces statistics from the training set. The pruning subtree table produces both sets of data if they are both present.

Unless otherwise marked, counts N can come from either set.

Decision Tree Leafwise Entropy Statistic

Because there are $N_\lambda$ observations on the leaf $\lambda$ , entropy takes the following form:

$\begin{equation*} \mathrm{Entropy} = -\sum _\lambda {\frac{N_\lambda }{N_0} \sum _\tau {P_\tau ^\lambda \log _2\left(P_\tau ^\lambda \right)}} \end{equation*}$

Rephrased in terms of N, this becomes

$\begin{equation*} \mathrm{Entropy} = -\sum _\lambda {\frac{N_\lambda }{N_0} \sum _\tau {\frac{N_\tau ^\lambda }{N_\lambda } \log _2\left(\frac{N_\tau ^\lambda }{N_\lambda }\right)}} \end{equation*}$

Decision Tree Leafwise Gini Statistic

The Gini statistic is similar to entropy in its leafwise form:

$\begin{equation*} \mathrm{Gini} = \sum _\lambda {\frac{N_\lambda }{N_0} \sum _\tau {P_\tau ^\lambda \left(1 - P_\tau ^\lambda \right)}} \end{equation*}$

Rephrased in terms of N, this becomes

$\begin{equation*} \mathrm{Gini} = \sum _\lambda {\frac{N_\lambda }{N_0} \sum _\tau {\frac{N_\tau ^\lambda }{N_\lambda }\left(1 - \frac{N_\tau ^\lambda }{N_\lambda }\right)}} \end{equation*}$

Decision Tree Leafwise Misclassification Rate

Misclassification comes from the number of incorrectly predicted observations. Thus, it is necessary to count the proportion of observations at each leaf in each target level. Similar to the misprediction rate of an entire data set, the misprediction rate of a single leaf, is

$\begin{equation*} \mathrm{MISC}_\lambda = \frac{1}{N_\lambda } \sum _\omega {\left\{ \begin{array}{l l} 0 & \text {if $\tau _ p^\omega = \tau _\pi ^\omega $} \\ 1 & \text {otherwise} \end{array} \right.} \end{equation*}$

where the summation is over the observations that arrive at a leaf $\lambda$ .

All observations at a leaf are assigned the same prediction because they are all assigned the same leaf. Therefore, the summation reduces to simply the number of observations at leaf $\lambda$ that have a target level other than the predicted target level for that leaf, $\tau _\pi$ . Thus,

$\begin{align*} \mathrm{MISC}_\lambda & = \frac{N_\lambda - N_{\tau _\pi }^\lambda }{N_\lambda } \\ & = \sum _{\tau \ne \tau _\pi } \frac{N_\tau ^\lambda }{N_\lambda } \\ & = \sum _{\tau \ne \tau _\pi } P_\tau ^\lambda \end{align*}$

where $P_\tau ^\lambda$ is $V_\tau ^\lambda$ if the validation set is being examined.

Thus, for the entire data set, the misclassification rate is

$\begin{align*} \mathrm{MISC} & = \sum { \frac{N_\lambda }{N_0} \sum _{\tau \ne \tau _\pi } P_\tau ^\lambda } \\ & = \sum { \frac{N_\lambda }{N_0} \sum _{\tau \ne \tau _\pi } \frac{N_\tau ^\lambda }{N_\lambda }} \end{align*}$

where again $P_\tau ^\lambda$ is $V_\tau ^\lambda$ for the validation set.

Decision Tree Leafwise Sum of Squares Error

The sum of squares error (SSE) is treated similarly to the misclassification rate. Each observation is assigned per-target posterior probabilities $P_\tau ^\lambda$ from the training data set. These are the predictions for the purpose of the SSE.

The observations at leaf $\lambda$ are then grouped by the observations’ target levels. Because each observation in the group has the same actual target level, $\Phi$ , and because all observations on the same node are assigned the same posterior probabilities, $P_\tau ^\lambda$ , the per-observation SSE equation is identical:

$\begin{align*} \mathrm{SSE}_\Phi ^\lambda & = \sum _{\omega \in \Phi } \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( P_{\tau }^\lambda \right)^2 + \left(1 - P_{\tau _\pi ^\Phi }^\lambda \right)^2 \right] \\ & = N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( P_{\tau }^\lambda \right)^2 + \left(1 - P_{\tau _\pi ^\Phi }^\lambda \right)^2 \right] \end{align*}$

Here, the posterior probabilities $P_\tau ^\lambda$ are from the training set, and the counts $N_\tau ^\lambda$ are from whichever data set is being examined.

Thus, the SSE equation for the leaf can be rephrased in terms of a further summation over the target levels $\Phi$ :

$\begin{equation*} \mathrm{SSE}_\lambda = \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( P_{\tau }^\lambda \right)^2 + \left(1 - P_{\tau _\pi ^\Phi }^\lambda \right)^2 \right] \end{equation*}$

So the SSE for the entire tree is then

$\begin{equation*} \mathrm{SSE} = \sum _\lambda \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( P_{\tau }^\lambda \right)^2 + \left(1 - P_{\tau _\pi ^\Phi }^\lambda \right)^2 \right] \end{equation*}$

Substituting the counts from the training set back in and using $\nu$ to denote training set counts, this becomes

$\begin{align*} \mathrm{SSE} & = \sum _\lambda \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 + \left(1 - \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } \right)^2 \right] \\ & = \sum _\lambda \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 + 1 -2 \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } + \left( \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } \right)^2 \right] \\ & = \sum _\lambda \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 + 1 -2 \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } + \right] \\ & = \sum _\lambda N_\lambda \sum _\Phi \frac{N_\Phi ^\lambda }{N_\lambda } \left[ \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 + 1 -2 \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } \right] \\ & = \sum _\lambda N_\lambda \left[ 1 + \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 -2 \sum _\Phi \frac{N_\Phi ^\lambda }{N_\lambda } \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } \right] \end{align*}$

In the rightmost inner summation, $\tau _\pi ^\Phi$ is simply $\Phi$ , the target level being summed over. This gives the final equivalent forms

$\begin{align*} \mathrm{SSE} & = \sum _\lambda N_\lambda \left[ 1 + \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 -2 \sum _\Phi \frac{N_\Phi ^\lambda }{N_\lambda } \frac{\nu _{\Phi }^\lambda }{\nu _\lambda } \right] \\ \mathrm{SSE} & = \sum _\lambda N_\lambda \left[ 1 + \sum _{\tau } \left( P_\tau ^\lambda \right)^2 -2 \sum _\Phi V_\Phi ^\lambda P_\Phi ^\lambda \right] \end{align*}$

where $\nu$ and P are again counts and fraction, respectively, from the training set, and N and V are counts and fraction, respectively, from the validation set. (For example, $N_\Phi ^\lambda$ is the number of observations on leaf $\lambda$ that have target $\Phi$ .)

If there is no validation set, the training set is used instead, and the equations simplify to the following (because $\Phi$ is merely an index over target levels and can be renamed $\tau$ ):

$\begin{align*} \mathrm{SSE} & = \sum _\lambda N_\lambda \left[ 1 - \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2\right] \\ \mathrm{SSE} & = \sum _\lambda N_\lambda \left[ 1 - \sum _{\tau } \left( P_\tau ^\lambda \right)^2\right] \end{align*}$

Decision Tree Leafwise Average Square Error

Because the average square error (ASE) is simply the SSE divided by the number of predictions (there are $N_\tau$ , the number of target levels, predictions per observation), this becomes

$\begin{align*} \mathrm{ASE} & = \sum _\lambda \frac{N_\lambda }{N_\tau N_0} \left[ 1 + \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 -2 \sum _\Phi \frac{N_\Phi ^\lambda }{N_\lambda } \frac{\nu _{\Phi }^\lambda }{\nu _\lambda } \right] \\ \mathrm{ASE} & = \sum _\lambda \frac{N_\lambda }{N_\tau N_0} \left[ 1 + \sum _{\tau } \left( P_\tau ^\lambda \right)^2 -2 \sum _\Phi V_\Phi ^\lambda P_\Phi ^\lambda \right] \end{align*}$

Or, if only the training set is used:

$\begin{align*} \mathrm{SSE} & = \sum _\lambda \frac{N_\lambda }{N_\tau N_0} \left[ 1 - \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2\right] \\ \mathrm{SSE} & = \sum _\lambda \frac{N_\lambda }{N_\tau N_0} \left[ 1 - \sum _{\tau } \left( P_\tau ^\lambda \right)^2\right] \end{align*}$

Regression Tree Per-Observation Methods

Unlike a decision tree, regression trees model a continuous target. The predictions produced by a decision tree are based on the average of the partitioned space (the observations at the leaves).

As with decision trees, predictions come from the training set. The predicted value at leaf $\lambda$ is

$\begin{equation*} \hat y_\lambda ^ T = \frac{1}{N_\lambda }\sum _{i \in \lambda } y_ i^ T \end{equation*}$

where the summation is over the observations i within leaf $\lambda$ , and $y_ i^ T$ is the value of the target variable at observation i within the training set.

Regression Tree Observationwise Sum of Squares Error

The observationwise sum of squares error (SSE) is the sum of squares of the difference between the observation’s value and the value that is predicted for that observation, $\hat y_\omega ^ T$ , which is equal to $\hat y_\lambda$ where $\lambda$ is the leaf to which that observation has been assigned, as described in the equation in the preceding section. The SSE is then simply

$\begin{align*} \mathrm{SSE} & = \sum _\omega \left( y_\omega - \hat y_\omega ^ T \right)^2 \end{align*}$

Regression Tree Observationwise Average Square Error

The observationwise average square error (ASE) is simply the average of the observationwise square differences and is identical to the per-observation ASE:

$\begin{align*} \mathrm{ASE} & = \frac{1}{N_0} \sum _\omega \left( y_\omega - \hat y_\omega ^ T \right)^2 \end{align*}$

Regression Tree Per-Leaf Methods

Regression Tree Leafwise Sum of Squares Error

The leafwise sum of squares error (SSE) is related to the variance within that leaf. The training set SSE is identical to the sum of the variances within the leaves:

$\begin{align*} \mathrm{SSE} & = \sum _{\lambda }^\Lambda \sum _{i \in \lambda } \left( y_ i - \hat y_\lambda ^ T \right)^2 \end{align*}$

Regression Tree Leafwise Average Square Error

The average square error (ASE) is simply the sum of weighted per-leaf average SSE’s. That is, the ASE is identical to the per-observation ASE:

$\begin{align*} \mathrm{ASE} & = \sum _{\lambda }^\Lambda \frac{N_\lambda }{N_0} \frac{1}{N_\lambda } \sum _{i \in \lambda } \left( y_ i - \hat y_\lambda ^ T \right)^2 \\ \mathrm{ASE} & = \frac{1}{N_0} \sum _{\lambda }^\Lambda \sum _{i \in \lambda } \left( y_ i - \hat y_\lambda ^ T \right)^2 \end{align*}$

The HPSPLIT Procedure