This example shows how you can run PROC HPREG in single-machine and distributed modes. See the section Processing Modes in Chapter 3: Shared Concepts and Topics, for details about the execution modes of SAS High-Performance Statistics procedures. The focus of this example is to simply show how you can switch the modes of execution of PROC HPREG, rather than
on any statistical features of the procedure. The following DATA step generates the data for this example. The response y
depends on 20 of the 1,000 regressors.
data ex2Data; array x{1000}; do i=1 to 10000; y=1; sign=1; do j=1 to 1000; x{j} = ranuni(1); if j<=20 then do; y = y + sign*j*x{j}; sign=-sign; end; end; y = y + 5*rannor(1); output; end; run;
The following statements use PROC HPREG to select a model by using BACKWARD selection:
proc hpreg data=ex2Data; model y = x: ; selection method = backward; performance details; run;
Output 14.2.1 shows the "Performance Information" table. This shows that the HPREG procedure executes in single-machine mode using four threads because the client machine has four CPUs. You can force a certain number of threads on any machine involved in the computations with the NTHREADS option in the PERFORMANCE statement.
Output 14.2.2 shows the parameter estimates for the selected model. You can see that the default BACKWARD selection with selection and stopping based on the SBC criterion retains all 20 of the true effects but also keeps two extraneous effects.
Output 14.2.2: Parameter Estimates for the Selected Model
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | DF | Estimate | Standard Error |
t Value | Pr > |t| |
Intercept | 1 | 1.506615 | 0.419811 | 3.59 | 0.0003 |
x1 | 1 | 1.054402 | 0.176930 | 5.96 | <.0001 |
x2 | 1 | -1.996080 | 0.176967 | -11.28 | <.0001 |
x3 | 1 | 3.293331 | 0.177032 | 18.60 | <.0001 |
x4 | 1 | -3.741273 | 0.176349 | -21.22 | <.0001 |
x5 | 1 | 4.908310 | 0.176047 | 27.88 | <.0001 |
x6 | 1 | -5.772356 | 0.176642 | -32.68 | <.0001 |
x7 | 1 | 7.398822 | 0.175792 | 42.09 | <.0001 |
x8 | 1 | -7.958471 | 0.176281 | -45.15 | <.0001 |
x9 | 1 | 8.899407 | 0.177624 | 50.10 | <.0001 |
x10 | 1 | -9.687667 | 0.176431 | -54.91 | <.0001 |
x11 | 1 | 11.083373 | 0.175195 | 63.26 | <.0001 |
x12 | 1 | -12.046504 | 0.176324 | -68.32 | <.0001 |
x13 | 1 | 13.009052 | 0.176967 | 73.51 | <.0001 |
x14 | 1 | -14.456393 | 0.175968 | -82.15 | <.0001 |
x15 | 1 | 14.928731 | 0.174868 | 85.37 | <.0001 |
x16 | 1 | -15.762907 | 0.177651 | -88.73 | <.0001 |
x17 | 1 | 16.842889 | 0.177037 | 95.14 | <.0001 |
x18 | 1 | -18.468844 | 0.176502 | -104.64 | <.0001 |
x19 | 1 | 18.810193 | 0.176616 | 106.50 | <.0001 |
x20 | 1 | -20.212291 | 0.176325 | -114.63 | <.0001 |
x87 | 1 | -0.542384 | 0.176293 | -3.08 | 0.0021 |
x362 | 1 | -0.560999 | 0.176594 | -3.18 | 0.0015 |
Output 14.2.3 shows timing information for the PROC HPREG run. This table is produced when you specify the DETAILS option in the PERFORMANCE statement. You can see that, in this case, the majority of time is spent forming the crossproducts matrix for the model that contains all the regressors.
You can switch to running PROC HPREG in distributed mode by specifying valid values for the NODES=, INSTALL=, and HOST= options in the PERFORMANCE statement. An alternative to specifying the INSTALL= and HOST= options in the PERFORMANCE statement is to set appropriate values for the GRIDHOST and GRIDINSTALLLOC environment variables by using OPTIONS SET commands. See the section Processing Modes in Chapter 3: Shared Concepts and Topics, for details about setting these options or environment variables.
The following statements provide an example. To run these statements successfully, you need to set the macro variables GRIDHOST
and GRIDINSTALLLOC
to resolve to appropriate values, or you can replace the references to macro variables with appropriate values.
proc hpreg data=ex2Data; model y = x: ; selection method = backward; performance details nodes = 10 host="&GRIDHOST" install="&GRIDINSTALLLOC"; run;
The execution mode in the "Performance Information" table shown in Output 14.2.4 indicates that the calculations were performed in a distributed environment that uses 10 nodes, each of which uses eight threads.
Another indication of distributed execution is the following message issued by all high-performance statistical procedures in the SAS Log:
NOTE: The HPREG procedure is executing in the distributed computing environment with 10 worker nodes.
Output 14.2.5 shows timing information for this distributed run of the HPREG procedure. In contrast to the single-machine mode (where forming the crossproducts matrix dominated the time spent), the majority of time in distributed mode is spent distributing the data and performing the model selection.