The HPREG Procedure

Example 14.2 Backward Selection in Single-Machine and Distributed Modes

This example shows how you can run PROC HPREG in single-machine and distributed modes. See the section Processing Modes in Chapter 3: Shared Concepts and Topics, for details about the execution modes of SAS High-Performance Statistics procedures. The focus of this example is to simply show how you can switch the modes of execution of PROC HPREG, rather than on any statistical features of the procedure. The following DATA step generates the data for this example. The response y depends on 20 of the 1,000 regressors.

  
 data ex2Data;
    array x{1000};

    do i=1 to 10000;
       y=1;
       sign=1;

       do j=1 to 1000;
          x{j} = ranuni(1);
          if j<=20  then do;
            y = y + sign*j*x{j}; 
            sign=-sign;
          end;
       end;
       y = y + 5*rannor(1);
       output;
   end;
 run;

The following statements use PROC HPREG to select a model by using BACKWARD selection:


 proc hpreg data=ex2Data;
     model y = x: ;
     selection method = backward;
     performance details;
 run;

Output 14.2.1 shows the "Performance Information" table. This shows that the HPREG procedure executes in single-machine mode using four threads because the client machine has four CPUs. You can force a certain number of threads on any machine involved in the computations with the NTHREADS option in the PERFORMANCE statement.

Output 14.2.1: Performance Information

The HPREG Procedure

Performance Information
Execution Mode Single-Machine
Number of Threads 4



Output 14.2.2 shows the parameter estimates for the selected model. You can see that the default BACKWARD selection with selection and stopping based on the SBC criterion retains all 20 of the true effects but also keeps two extraneous effects.

Output 14.2.2: Parameter Estimates for the Selected Model

Parameter Estimates
Parameter DF Estimate Standard
Error
t Value Pr > |t|
Intercept 1 1.506615 0.419811 3.59 0.0003
x1 1 1.054402 0.176930 5.96 <.0001
x2 1 -1.996080 0.176967 -11.28 <.0001
x3 1 3.293331 0.177032 18.60 <.0001
x4 1 -3.741273 0.176349 -21.22 <.0001
x5 1 4.908310 0.176047 27.88 <.0001
x6 1 -5.772356 0.176642 -32.68 <.0001
x7 1 7.398822 0.175792 42.09 <.0001
x8 1 -7.958471 0.176281 -45.15 <.0001
x9 1 8.899407 0.177624 50.10 <.0001
x10 1 -9.687667 0.176431 -54.91 <.0001
x11 1 11.083373 0.175195 63.26 <.0001
x12 1 -12.046504 0.176324 -68.32 <.0001
x13 1 13.009052 0.176967 73.51 <.0001
x14 1 -14.456393 0.175968 -82.15 <.0001
x15 1 14.928731 0.174868 85.37 <.0001
x16 1 -15.762907 0.177651 -88.73 <.0001
x17 1 16.842889 0.177037 95.14 <.0001
x18 1 -18.468844 0.176502 -104.64 <.0001
x19 1 18.810193 0.176616 106.50 <.0001
x20 1 -20.212291 0.176325 -114.63 <.0001
x87 1 -0.542384 0.176293 -3.08 0.0021
x362 1 -0.560999 0.176594 -3.18 0.0015



Output 14.2.3 shows timing information for the PROC HPREG run. This table is produced when you specify the DETAILS option in the PERFORMANCE statement. You can see that, in this case, the majority of time is spent forming the crossproducts matrix for the model that contains all the regressors.

Output 14.2.3: Timing

Procedure Task Timing
Task Seconds Percent
Reading and Levelizing Data 0.20 6.72%
Loading Design Matrix 0.03 1.06%
Computing Moments 0.02 0.82%
Computing Cross Products Matrix 2.14 72.79%
Performing Model Selection 0.55 18.62%



You can switch to running PROC HPREG in distributed mode by specifying valid values for the NODES=, INSTALL=, and HOST= options in the PERFORMANCE statement. An alternative to specifying the INSTALL= and HOST= options in the PERFORMANCE statement is to set appropriate values for the GRIDHOST and GRIDINSTALLLOC environment variables by using OPTIONS SET commands. See the section Processing Modes in Chapter 3: Shared Concepts and Topics, for details about setting these options or environment variables.

The following statements provide an example. To run these statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to macro variables with appropriate values.

 proc hpreg data=ex2Data;
     model y = x: ;
     selection method = backward;
     performance details nodes = 10
                 host="&GRIDHOST" install="&GRIDINSTALLLOC";
 run;

The execution mode in the "Performance Information" table shown in Output 14.2.4 indicates that the calculations were performed in a distributed environment that uses 10 nodes, each of which uses eight threads.

Output 14.2.4: Performance Information in Distributed Mode

Performance Information
Host Node << your grid host >>
Install Location << your grid install location >>
Execution Mode Distributed
Number of Compute Nodes 10
Number of Threads per Node 32



Another indication of distributed execution is the following message issued by all high-performance statistical procedures in the SAS Log:

NOTE: The HPREG procedure is executing in the distributed
      computing environment with 10 worker nodes.

Output 14.2.5 shows timing information for this distributed run of the HPREG procedure. In contrast to the single-machine mode (where forming the crossproducts matrix dominated the time spent), the majority of time in distributed mode is spent distributing the data and performing the model selection.

Output 14.2.5: Timing

Procedure Task Timing
Task Seconds Percent
Distributing Data 0.73 46.40%
Reading and Levelizing Data 0.02 1.07%
Loading Design Matrix 0.00 0.29%
Computing Moments 0.00 0.13%
Computing Cross Products Matrix 0.18 11.18%
Performing Model Selection 0.32 20.37%
Waiting on Client 0.32 20.55%