The HPSPLIT Procedure

Example 9.2 Assessing Variable Importance

During the manufacture of a semiconductor device, the levels of temperature, atomic composition, and other parameters are vital to ensuring that the final device is usable. This example creates a decision tree model for the performance of finished devices.

The following statements create a data set named MBE_DATA, which contains measurements for 20 devices:

data mbe_data;
label gtemp  = 'Growth Temperature of Substrate';
label atemp  = 'Anneal Temperature';
label rot    = 'Rotation Speed';
label dopant = 'Dopant Atom';
label usable = 'Experiment Could be Performed';
    
input gtemp atemp rot dopant $ 35-37 usable $ 43-51;
datalines;
384.614    633.172    1.01933      C       Unusable
363.874    512.942    0.72057      C       Unusable
397.395    671.179    0.90419      C       Unusable
389.962    653.940    1.01417      C       Unusable
387.763    612.545    1.00417      C       Unusable
394.206    617.021    1.07188      Si      Usable  
387.135    616.035    0.94740      Si      Usable  
428.783    745.345    0.99087      Si      Unusable
399.365    600.932    1.23307      Si      Unusable
455.502    648.821    1.01703      Si      Unusable
387.362    697.589    1.01623      Ge      Usable  
408.872    640.406    0.94543      Ge      Usable  
407.734    628.196    1.05137      Ge      Usable  
417.343    612.328    1.03960      Ge      Usable  
482.539    669.392    0.84249      Ge      Unusable
367.116    564.246    0.99642      Sn      Unusable
398.594    733.839    1.08744      Sn      Unusable
378.032    619.561    1.06137      Sn      Usable  
357.544    606.871    0.85205      Sn      Unusable
384.578    635.858    1.12215      Sn      Unusable
;
run;

The variables GTEMP and ATEMP are temperatures, ROT is a rotation speed, and DOPANT is the atom that is used during device growth. The variable USABLE indicates whether the device is usable.

The following statements create the tree model:

proc hpsplit data=mbe_data maxdepth=1;
  target usable;
  input gtemp atemp rot dopant;
  output importance=import;
  prune none;
run;

There is only one INPUT statement because all of the numeric variables are interval inputs.

The MAXDEPTH=1 option specifies that the tree is to stop splitting when the maximum specified depth of one is reached. In other words, PROC HPSPLIT tries to split the data by each input variable and then chooses the best variable on which to split the data. The split that is chosen divides the data into higher and lower incidences of the target variable (USABLE). The PRUNE statement suppresses pruning because there is only one split.

The OUTPUT statement saves information about variable importance in a data set named IMPORT. The following statements list the relevant observation in IMPORT:

proc print data=import(where=(itype='Import'));
run;

The result of these statements is provided in Output 9.2.1.

Output 9.2.1: Variable Importance of the One-Split Decision Tree

Obs TREENUM CRITERION OBSMISS OBSUSED OBSVALID OBSTMISS ITYPE gtemp atemp rot dopant
3 1 Entropy 0 20 0 0 Import 0 0 0 1


The dopant atom is the most important consideration in determining the usability of the sample because the input DOPANT is used in the one-split decision tree (the other input variables are not used at all.)