Statistics that are printed in the subtree tables are similar to the pruning statistics. There are two ways to calculate the subtree statistics: one is based on a scored data set (using the SCORE statement or the SAS DATA step score code that the CODE statement produces), and the other is based on the internal observation counts at each leaf of the tree. The two methods should provide identical results unless the target is missing.
Note: The per-observation and per-leaf methods of calculating the subtree statistics might not agree if the input data set contains observations that have a missing value for the target.
In scoring, whether you use the SCORE statement or you use the CODE statement with a SAS DATA step, each observation is assigned a posterior probability, , where is a target level. These posterior probabilities are then used to calculate the subtree statistics of the final tree.
For a leaf , the posterior probability is the fraction of observations at that leaf that have the target level . That is, for that leaf
When a record is scored, it is assigned to a leaf, and all posterior probabilities for that leaf are assigned along with it. Thus, for observation assigned to leaf , the posterior probability is
The variable continues to indicate the total number of observations in the input data set, and is the observation number ( is used to prevent confusion with 0).
If a validation set is selected, the per-observation statistics are calculated separately for each. In addition, the per-observation validation posterior probabilities should be used. The validation posterior probabilities, , are the same as the posterior probabilities from the training set, but they are the fraction of observations from the validation set that are in each target level,
where and are now observation counts from the validation set. For calculating the statistics on the validation set, the same equations can be used but substituting V for P where appropriate (for example, for ).
The entropy at each observation is calculated from the posterior probabilities:
Like the entropy, the Gini statistic is also calculated from the posterior probabilities:
The misclassification rate is the average number of incorrectly predicted observations in the input data set. Predictions are always based on the training set. Thus, each scored record’s predicted target level is compared against the actual level :
is the Kronecker delta:
Or, phrased slightly differently, the misclassification rate is the fraction of incorrectly predicted observations:
For the sum of squares error (SSE), predictions are made for every observation: that the correct posterior is 1 and that the incorrect posteriors is 0. Thus the SSE is as follows, with once again being the actual target level for observation :
The subtree statistics that are calculated by PROC HPSPLIT are calculated per leaf. That is, instead of scanning through the entire data set, the proportions of observations are examined at the leaves. Barring missing target values, which are not handled by the tree, the per-leaf and per-observation methods for calculating the subtree statistics are the same.
As with the per-observation method, observation counts N (, , and ) can come from either the training set or the validation set. The growth subtree table always produces statistics from the training set. The pruning subtree table produces both sets of data if they are both present.
Unless otherwise marked, counts N can come from either set.
Because there are observations on the leaf , entropy takes the following form:
Rephrased in terms of N, this becomes
The Gini statistic is similar to entropy in its leafwise form:
Rephrased in terms of N, this becomes
Misclassification comes from the number of incorrectly predicted observations. Thus, it is necessary to count the proportion of observations at each leaf in each target level. The misprediction rate of a single leaf, similar to the misprediction rate of an entire data set, is
where the summation is over the observations that arrive at a leaf .
All observations at a leaf are assigned the same prediction because they are all assigned the same leaf. Therefore, the summation reduces to simply the number of observations at leaf that have a target level other than the predicted target level for that leaf, . Thus,
|
|
|
|
|
|
where is if the validation set is being examined.
Thus, for the entire data set, the misclassification rate is
|
|
|
|
where again is for the validation set.
The sum of squares error (SSE) is treated similarly to the misclassification rate. Each observation is assigned per-target posterior probabilities from the training data set. These are the predictions for the purpose of the SSE.
The observations at leaf are then grouped by the observations’ target levels. Because each observation in the group has the same actual target level, , and because all observations on the same node are assigned the same posterior probabilities, , the per-observation SSE equation is identical:
|
|
|
|
Here, the posterior probabilities are from the training set, and the counts themselves are from whichever data set is being examined.
Thus, the SSE equation for the leaf can be rephrased in terms of a further summation over the target levels :
So the SSE for the entire tree is then
Substituting the counts from the training set back in and using to denote training set counts, this becomes
|
|
|
|
|
|
|
|
|
|
Now, in that rightmost inner summation, is simply , the target level being summed over. This gives the final equivalent forms
|
|
|
|
where and P are again counts and fraction, respectively, from the training set, and N and V are counts and fraction, respectively, from the validation set. (For example, is the number of observations on leaf with target .)
If there is no validation set, the training set is used instead, and the equations simplify to the following (because is merely an index over target levels and can be renamed ):
|
|
|
|