Subtree Statistics :: SAS/STAT(R) 12.3 User's Guide: High-Performance Procedures

Subtree Statistics

Per-Observation Methods
Per-Leaf Methods

Statistics that are printed in the subtree tables are similar to the pruning statistics. There are two ways to calculate the subtree statistics: one is based on a scored data set (using the SCORE statement or the SAS DATA step score code that the CODE statement produces), and the other is based on the internal observation counts at each leaf of the tree. The two methods should provide identical results unless the target is missing.

Note: The per-observation and per-leaf methods of calculating the subtree statistics might not agree if the input data set contains observations that have a missing value for the target.

Per-Observation Methods

In scoring, whether you use the SCORE statement or you use the CODE statement with a SAS DATA step, each observation is assigned a posterior probability, $P_\tau$ , where $\tau$ is a target level. These posterior probabilities are then used to calculate the subtree statistics of the final tree.

For a leaf $\lambda$ , the posterior probability is the fraction of observations at that leaf that have the target level $\tau$ . That is, for that leaf $\lambda$

$\begin{equation*} P_\tau ^\lambda = \frac{N_\tau ^\lambda }{N_\lambda } \end{equation*}$

When a record is scored, it is assigned to a leaf, and all posterior probabilities for that leaf are assigned along with it. Thus, for observation $\omega$ assigned to leaf $\lambda$ , the posterior probability is

$\begin{equation*} P_\tau ^\omega = P_\tau ^\lambda = \frac{N_\tau ^\lambda }{N_\lambda } \end{equation*}$

The variable continues to indicate the total number of observations in the input data set, and $\omega$ is the observation number ( $\omega$ is used to prevent confusion with 0).

If a validation set is selected, the per-observation statistics are calculated separately for each. In addition, the per-observation validation posterior probabilities should be used. The validation posterior probabilities, $V_\tau ^\omega$ , are the same as the posterior probabilities from the training set, but they are the fraction of observations from the validation set that are in each target level,

$\begin{equation*} V_\tau ^\omega = V_\tau ^\lambda = \frac{N_\tau ^\lambda }{N_\lambda } \end{equation*}$

where $N_\tau ^\lambda$ and $N_\lambda$ are now observation counts from the validation set. For calculating the statistics on the validation set, the same equations can be used but substituting V for P where appropriate (for example, $V_\tau ^\lambda$ for $P_\tau ^\lambda$ ).

Observationwise Entropy Statistic

The entropy at each observation is calculated from the posterior probabilities:

$\begin{equation*} \mathrm{Entropy} = -\sum _\omega {\frac{1}{N_0} \sum _\tau {P_\tau ^\omega \log _2\left(P_\tau ^\omega \right)}} \end{equation*}$

Observationwise Gini Statistic

Like the entropy, the Gini statistic is also calculated from the posterior probabilities:

$\begin{equation*} \mathrm{Gini} = \sum _\omega {\frac{1}{N_0} \sum _\tau {P_\tau ^\omega \left(1 - P_\tau ^\omega \right)}} \end{equation*}$

Observationwise Misclassification Rate

The misclassification rate is the average number of incorrectly predicted observations in the input data set. Predictions are always based on the training set. Thus, each scored record’s predicted target level $\tau _ P^\omega$ is compared against the actual level $\tau _\pi ^\omega$ :

$\begin{equation*} \mathrm{MISC} = \sum _\omega {\frac{1 - \delta _{\tau _ P^\omega \tau _\pi ^\omega }}{N_0}} \end{equation*}$

$\delta _{\tau _ P^\omega \tau _\pi ^\omega }$ is the Kronecker delta:

$\begin{equation*} \delta _{\tau _ P^\omega \tau _\pi ^\omega } = \left\{ \begin{array}{l l} 1 & \text {if $\tau _ p^\omega = \tau _\pi ^\omega $} \\ 0 & \text {otherwise} \end{array} \right. \end{equation*}$

Or, phrased slightly differently, the misclassification rate is the fraction of incorrectly predicted observations:

$\begin{equation*} \mathrm{MISC} = \frac{1}{N_0} \sum _\omega {\left\{ \begin{array}{l l} 0 & \text {if $\tau _ p^\omega = \tau _\pi ^\omega $} \\ 1 & \text {otherwise} \end{array} \right.} \end{equation*}$

Observationwise Sum of Squares Error

For the sum of squares error (SSE), $N_\tau$ predictions are made for every observation: that the correct posterior is 1 and that the incorrect posteriors is 0. Thus the SSE is as follows, with $\tau _\pi ^\omega$ once again being the actual target level for observation $\omega$ :

$\begin{equation*} \mathrm{SSE} = \sum _\omega \left[ \sum _{\tau \ne \tau _\pi ^\omega } \left( P_\tau ^\omega \right)^2 + \left(1 - P_{\tau _\pi ^\omega }^\omega \right)^2 \right] \end{equation*}$

Observationwise Average Square Error

The average square error (ASE) is simply the SSE divided by the number of predictions (there are $N_\tau$ predictions per observation):

$\begin{equation*} \mathrm{ASE} = \frac{1}{N_\tau N_0} \sum _\omega \left[ \sum _{\tau \ne \tau _\pi ^\omega } \left( P_\tau ^\omega \right)^2 + \left(1 - P_{\tau _\pi ^\omega } \right)^2 \right] \end{equation*}$

Per-Leaf Methods

The subtree statistics that are calculated by PROC HPSPLIT are calculated per leaf. That is, instead of scanning through the entire data set, the proportions of observations are examined at the leaves. Barring missing target values, which are not handled by the tree, the per-leaf and per-observation methods for calculating the subtree statistics are the same.

As with the per-observation method, observation counts N ( $N_\tau ^\lambda$ , $N_\lambda$ , and ) can come from either the training set or the validation set. The growth subtree table always produces statistics from the training set. The pruning subtree table produces both sets of data if they are both present.

Unless otherwise marked, counts N can come from either set.

Leafwise Entropy Statistic

Because there are $N_\lambda$ observations on the leaf $\lambda$ , entropy takes the following form:

$\begin{equation*} \mathrm{Entropy} = -\sum _\lambda {\frac{N_\lambda }{N_0} \sum _\tau {P_\tau ^\lambda \log _2\left(P_\tau ^\lambda \right)}} \end{equation*}$

Rephrased in terms of N, this becomes

$\begin{equation*} \mathrm{Entropy} = -\sum _\lambda {\frac{N_\lambda }{N_0} \sum _\tau {\frac{N_\tau ^\lambda }{N_\lambda } \log _2\left(\frac{N_\tau ^\lambda }{N_\lambda }\right)}} \end{equation*}$

Leafwise Gini Statistic

The Gini statistic is similar to entropy in its leafwise form:

$\begin{equation*} \mathrm{Gini} = \sum _\lambda {\frac{N_\lambda }{N_0} \sum _\tau {P_\tau ^\lambda \left(1 - P_\tau ^\lambda \right)}} \end{equation*}$

Rephrased in terms of N, this becomes

$\begin{equation*} \mathrm{Gini} = \sum _\lambda {\frac{N_\lambda }{N_0} \sum _\tau {\frac{N_\tau ^\lambda }{N_\lambda }\left(1 - \frac{N_\tau ^\lambda }{N_\lambda }\right)}} \end{equation*}$

Leafwise Misclassification Rate

Misclassification comes from the number of incorrectly predicted observations. Thus, it is necessary to count the proportion of observations at each leaf in each target level. The misprediction rate of a single leaf, similar to the misprediction rate of an entire data set, is

$\begin{equation*} \mathrm{MISC}_\lambda = \frac{1}{N_\lambda } \sum _\omega {\left\{ \begin{array}{l l} 0 & \text {if $\tau _ p^\omega = \tau _\pi ^\omega $} \\ 1 & \text {otherwise} \end{array} \right.} \end{equation*}$

where the summation is over the observations that arrive at a leaf $\lambda$ .

All observations at a leaf are assigned the same prediction because they are all assigned the same leaf. Therefore, the summation reduces to simply the number of observations at leaf $\lambda$ that have a target level other than the predicted target level for that leaf, $\tau _\pi$ . Thus,

$\displaystyle \mathrm{MISC}_\lambda$	$\displaystyle = \frac{N_\lambda - N_{\tau _\pi }^\lambda }{N_\lambda }$
$\displaystyle$	$\displaystyle = \sum _{\tau \ne \tau _\pi } \frac{N_\tau ^\lambda }{N_\lambda }$
$\displaystyle$	$\displaystyle = \sum _{\tau \ne \tau _\pi } P_\tau ^\lambda$

where $P_\tau ^\lambda$ is $V_\tau ^\lambda$ if the validation set is being examined.

Thus, for the entire data set, the misclassification rate is

$\displaystyle \mathrm{MISC}$	$\displaystyle = \sum { \frac{N_\lambda }{N_0} \sum _{\tau \ne \tau _\pi } P_\tau ^\lambda }$
$\displaystyle$	$\displaystyle = \sum { \frac{N_\lambda }{N_0} \sum _{\tau \ne \tau _\pi } \frac{N_\tau ^\lambda }{N_\lambda }}$

where again $P_\tau ^\lambda$ is $V_\tau ^\lambda$ for the validation set.

Leafwise Sum of Squares Error

The sum of squares error (SSE) is treated similarly to the misclassification rate. Each observation is assigned per-target posterior probabilities $P_\tau ^\lambda$ from the training data set. These are the predictions for the purpose of the SSE.

The observations at leaf $\lambda$ are then grouped by the observations’ target levels. Because each observation in the group has the same actual target level, $\Phi$ , and because all observations on the same node are assigned the same posterior probabilities, $P_\tau ^\lambda$ , the per-observation SSE equation is identical:

$\displaystyle \mathrm{SSE}_\Phi ^\lambda$	$\displaystyle = \sum _{\omega \in \Phi } \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( P_{\tau }^\lambda \right)^2 + \left(1 - P_{\tau _\pi ^\Phi }^\lambda \right)^2 \right]$
$\displaystyle$	$\displaystyle = N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( P_{\tau }^\lambda \right)^2 + \left(1 - P_{\tau _\pi ^\Phi }^\lambda \right)^2 \right]$

Here, the posterior probabilities $P_\tau ^\lambda$ are from the training set, and the counts themselves $N_\tau ^\lambda$ are from whichever data set is being examined.

Thus, the SSE equation for the leaf can be rephrased in terms of a further summation over the target levels $\Phi$ :

$\begin{equation*} \mathrm{SSE}_\lambda = \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( P_{\tau }^\lambda \right)^2 + \left(1 - P_{\tau _\pi ^\Phi }^\lambda \right)^2 \right] \end{equation*}$

So the SSE for the entire tree is then

$\begin{equation*} \mathrm{SSE} = \sum _\lambda \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( P_{\tau }^\lambda \right)^2 + \left(1 - P_{\tau _\pi ^\Phi }^\lambda \right)^2 \right] \end{equation*}$

Substituting the counts from the training set back in and using $\nu$ to denote training set counts, this becomes

$\displaystyle \mathrm{SSE}$	$\displaystyle = \sum _\lambda \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 + \left(1 - \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } \right)^2 \right]$
$\displaystyle$	$\displaystyle = \sum _\lambda \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau \ne \tau _\pi ^\Phi } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 + 1 -2 \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } + \left( \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } \right)^2 \right]$
$\displaystyle$	$\displaystyle = \sum _\lambda \sum _\Phi N_\Phi ^\lambda \left[ \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 + 1 -2 \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } + \right]$
$\displaystyle$	$\displaystyle = \sum _\lambda N_\lambda \sum _\Phi \frac{N_\Phi ^\lambda }{N_\lambda } \left[ \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 + 1 -2 \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } \right]$
$\displaystyle$	$\displaystyle = \sum _\lambda N_\lambda \left[ 1 + \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 -2 \sum _\Phi \frac{N_\Phi ^\lambda }{N_\lambda } \frac{\nu _{\tau _\pi ^\Phi }^\lambda }{\nu _\lambda } \right]$

Now, in that rightmost inner summation, $\tau _\pi ^\Phi$ is simply $\Phi$ , the target level being summed over. This gives the final equivalent forms

$\displaystyle \mathrm{SSE}$	$\displaystyle = \sum _\lambda N_\lambda \left[ 1 + \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 -2 \sum _\Phi \frac{N_\Phi ^\lambda }{N_\lambda } \frac{\nu _{\Phi }^\lambda }{\nu _\lambda } \right]$
$\displaystyle \mathrm{SSE}$	$\displaystyle = \sum _\lambda N_\lambda \left[ 1 + \sum _{\tau } \left( P_\tau ^\lambda \right)^2 -2 \sum _\Phi V_\Phi ^\lambda P_\Phi ^\lambda \right]$

where $\nu$ and P are again counts and fraction, respectively, from the training set, and N and V are counts and fraction, respectively, from the validation set. (For example, $N_\Phi ^\lambda$ is the number of observations on leaf $\lambda$ with target $\Phi$ .)

If there is no validation set, the training set is used instead, and the equations simplify to the following (because $\Phi$ is merely an index over target levels and can be renamed $\tau$ ):

$\displaystyle \mathrm{SSE}$	$\displaystyle = \sum _\lambda N_\lambda \left[ 1 - \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2\right]$
$\displaystyle \mathrm{SSE}$	$\displaystyle = \sum _\lambda N_\lambda \left[ 1 - \sum _{\tau } \left( P_\tau ^\lambda \right)^2\right]$

Leafwise Average Square Error

Because average square error (ASE) is simply the SSE divided by the number of predictions (there are $N_\tau$ , the number of target levels, predictions per observation), this becomes

$\displaystyle \mathrm{ASE}$	$\displaystyle = \sum _\lambda \frac{N_\lambda }{N_\tau N_0} \left[ 1 + \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2 -2 \sum _\Phi \frac{N_\Phi ^\lambda }{N_\lambda } \frac{\nu _{\Phi }^\lambda }{\nu _\lambda } \right]$
$\displaystyle \mathrm{ASE}$	$\displaystyle = \sum _\lambda \frac{N_\lambda }{N_\tau N_0} \left[ 1 + \sum _{\tau } \left( P_\tau ^\lambda \right)^2 -2 \sum _\Phi V_\Phi ^\lambda P_\Phi ^\lambda \right]$

Or, if only the training set is used:

$\displaystyle \mathrm{SSE}$	$\displaystyle = \sum _\lambda \frac{N_\lambda }{N_\tau N_0} \left[ 1 - \sum _{\tau } \left( \frac{\nu _{\tau }^\lambda }{\nu _\lambda } \right)^2\right]$
$\displaystyle \mathrm{SSE}$	$\displaystyle = \sum _\lambda \frac{N_\lambda }{N_\tau N_0} \left[ 1 - \sum _{\tau } \left( P_\tau ^\lambda \right)^2\right]$

The HPSPLIT Procedure