After you create a tree model, you can apply it to the training data for model diagnosis or to new data in order to make predictions. The process of applying a model to a data set is called scoring. You can use scoring to improve or deploy your model. There are two approaches to using PROC HPSPLIT to score a data set.
With the first approach, you can use the OUTPUT statement to score the training data. Usually, the purpose of scoring a training data set is to diagnose the model. The training data set is the data set that you specify by using the DATA= option. When scoring the training data, PROC HPSPLIT creates an output data set that contains one observation for each observation in the training data. You can specify the output data set by using the OUT= option in the OUTPUT statement.
In the following example, the input data set is scored after the tree model has been created:
proc hpsplit data=Sampsio.Hmeq; class Bad Delinq Derog Job Ninq Reason; model Bad = Delinq Derog Job Ninq Reason; output out=scored; run;
With the second approach to scoring, the HPSPLIT procedure generates SAS DATA step code that you can use to score new data. Usually, the purpose of scoring new data is to make predictions. PROC HPSPLIT generates SAS DATA step code when you specify the CODE statement. For more information, see the section "Creating Score Code and Scoring New Data" in Example 16.4: Creating a Binary Classification Tree with Validation Data.
With either approach, scoring a data set creates a data set that contains the new variables shown in Table 16.3.
Table 16.3: Scoring Variables
Variable |
Description |
---|---|
|
Leaf number to which the observation is assigned |
|
Node number to which the observation is assigned |
In addition, for classification trees, the scored data set also contains new variables with the prefix "P_" and can contain
new variables with the prefix "V_". There is one "P_" variable that corresponds to each level of the response variable. For
all observations in the same leaf, these variables represent the proportion of the training observations in that leaf that
have that particular response level. Similarly, if validation data are used, the variables that have the prefix "V_" represent
the proportion of the validation observations in a leaf that have the corresponding response. For example, if the name of
the categorical response variable is Color
and it has two levels, Blue
and Green
, then the scored data set contains the variables P_ColorBlue
and P_ColorGreen
, which provide the proportions of training data in this leaf that have the response levels Blue
and Green
. If you use validation data, the HPSPLIT procedure creates two more new variables, V_ColorBlue
and V_ColorGreen
, which provide the proportions of validation data in this leaf that have the response levels Blue
and Green
.
For regression trees, the scored data set contains a new variable with the prefix "P_" and can contain another new variable
with the prefix "V_". For all observations in the same leaf, the new variable that has the prefix "P_" represents the average
value of the response variable for those observations. If you use validation data, the new variable with the prefix "V_" represents
the average value of the response variable for the validation observations in the same leaf. For example, if the name of the
continuous response variable is logSalary
, then the scored data set contains a new variable, P_logSalary
, to represent the average value of the response variable logSalary
in the training data for observations on the same leaf. If you use validation data, the HPSPLIT procedure creates another
new variable, V_logSalary
, which provides the average value of the response variable for validation observations on the same leaf.