The HPSPLIT Procedure

Scoring

After you create a tree model, you can apply it to the training data for model diagnosis or to new data in order to make predictions. The process of applying a model to a data set is called scoring. You can use scoring to improve or deploy your model. There are two approaches to using PROC HPSPLIT to score a data set.

With the first approach, you can use the OUTPUT statement to score the training data. Usually, the purpose of scoring a training data set is to diagnose the model. The training data set is the data set that you specify by using the DATA= option. When scoring the training data, PROC HPSPLIT creates an output data set that contains one observation for each observation in the training data. You can specify the output data set by using the OUT= option in the OUTPUT statement.

In the following example, the input data set is scored after the tree model has been created:

proc hpsplit data=Sampsio.Hmeq;
   class Bad Delinq Derog Job Ninq Reason;
   model Bad = Delinq Derog Job Ninq Reason;
   output out=scored;
run;

With the second approach to scoring, the HPSPLIT procedure generates SAS DATA step code that you can use to score new data. Usually, the purpose of scoring new data is to make predictions. PROC HPSPLIT generates SAS DATA step code when you specify the CODE statement. For more information, see the section "Creating Score Code and Scoring New Data" in Example 16.4: Creating a Binary Classification Tree with Validation Data.

With either approach, scoring a data set creates a data set that contains the new variables shown in Table 16.3.

Table 16.3: Scoring Variables

Variable

Description

_Leaf_

Leaf number to which the observation is assigned

_Node_

Node number to which the observation is assigned


In addition, for classification trees, the scored data set also contains new variables with the prefix "P_" and can contain new variables with the prefix "V_". There is one "P_" variable that corresponds to each level of the response variable. For all observations in the same leaf, these variables represent the proportion of the training observations in that leaf that have that particular response level. Similarly, if validation data are used, the variables that have the prefix "V_" represent the proportion of the validation observations in a leaf that have the corresponding response. For example, if the name of the categorical response variable is Color and it has two levels, Blue and Green, then the scored data set contains the variables P_ColorBlue and P_ColorGreen, which provide the proportions of training data in this leaf that have the response levels Blue and Green. If you use validation data, the HPSPLIT procedure creates two more new variables, V_ColorBlue and V_ColorGreen, which provide the proportions of validation data in this leaf that have the response levels Blue and Green.

For regression trees, the scored data set contains a new variable with the prefix "P_" and can contain another new variable with the prefix "V_". For all observations in the same leaf, the new variable that has the prefix "P_" represents the average value of the response variable for those observations. If you use validation data, the new variable with the prefix "V_" represents the average value of the response variable for the validation observations in the same leaf. For example, if the name of the continuous response variable is logSalary, then the scored data set contains a new variable, P_logSalary, to represent the average value of the response variable logSalary in the training data for observations on the same leaf. If you use validation data, the HPSPLIT procedure creates another new variable, V_logSalary, which provides the average value of the response variable for validation observations on the same leaf.