The HPSPLIT Procedure

Example 15.2 Assessing Variable Importance

During the manufacture of a semiconductor device, the levels of temperature, atomic composition, and other parameters are vital to ensuring that the final device is usable. This example creates a decision tree model for the performance of finished devices.

The following statements create a data set named MBE_DATA, which contains measurements for 20 devices:

data mbe_data;
   label gtemp  = 'Growth Temperature of Substrate';
   label atemp  = 'Anneal Temperature';
   label rot    = 'Rotation Speed';
   label dopant = 'Dopant Atom';
   label usable = 'Experiment Could be Performed';
   
   input gtemp atemp rot dopant $ 38-40 usable $ 46-54;
   datalines;
   384.614    633.172    1.01933      C       Unusable
   363.874    512.942    0.72057      C       Unusable
   397.395    671.179    0.90419      C       Unusable
   389.962    653.940    1.01417      C       Unusable
   387.763    612.545    1.00417      C       Unusable
   394.206    617.021    1.07188      Si      Usable
   387.135    616.035    0.94740      Si      Usable
   428.783    745.345    0.99087      Si      Unusable
   399.365    600.932    1.23307      Si      Unusable
   455.502    648.821    1.01703      Si      Unusable
   387.362    697.589    1.01623      Ge      Usable
   408.872    640.406    0.94543      Ge      Usable
   407.734    628.196    1.05137      Ge      Usable
   417.343    612.328    1.03960      Ge      Usable
   482.539    669.392    0.84249      Ge      Unusable
   367.116    564.246    0.99642      Sn      Unusable
   398.594    733.839    1.08744      Sn      Unusable
   378.032    619.561    1.06137      Sn      Usable
   357.544    606.871    0.85205      Sn      Unusable
   384.578    635.858    1.12215      Sn      Unusable
   ;

The variables GTEMP and ATEMP are temperatures, ROT is a rotation speed, and DOPANT is the atom that is used during device growth. The variable USABLE indicates whether the device is usable.

The following statements create the tree model:

proc hpsplit data=mbe_data maxdepth=1;
   target usable;
   input gtemp atemp rot dopant;
   output importance=import;
   prune none;
run;

There is only one INPUT statement because all of the numeric variables are interval inputs.

The MAXDEPTH=1 option specifies that the tree is to stop splitting when the maximum specified depth of one is reached. In other words, PROC HPSPLIT tries to split the data by each input variable and then chooses the best variable on which to split the data. The split that is chosen divides the data into higher and lower incidences of the target variable (USABLE). The PRUNE statement suppresses pruning because there is only one split.

The OUTPUT statement saves information about variable importance in a data set named IMPORT. The following statements list the relevant observation in IMPORT:

proc print data=import(where=(itype='Import'));
run;

The result of these statements is provided in Output 15.2.1.

Output 15.2.1: Variable Importance of the One-Split Decision Tree

Obs _TREENUM_ _CRITERION_ _OBSMISS_ _OBSUSED_ _OBSVALID_ _OBSTMISS_ ITYPE gtemp atemp rot dopant
3 1 Entropy 0 20 0 0 Import 0 0 0 1



The dopant atom is the most important consideration in determining the usability of the sample because the input DOPANT is used in the one-split decision tree (the other input variables are not used at all.)