The HPPLS Procedure

Example 12.2 Fitting a PLS Model in Single-Machine and Distributed Modes

This example shows how you can run PROC HPPLS in single-machine and distributed modes. For more information about the execution modes of SAS high-performance analytical procedures, see the section Processing Modes. The focus of this example is to show how you can switch the modes of execution in PROC HPPLS. The following DATA step generates the data:

data ex2Data;
   drop i j k sign n n1 n2 n3 n4;

   n  = 100000;
   n1 = n*0.1;
   n2 = n*0.25;
   n3 = n*0.45;
   n4 = n*0.7;

   array y{10};
   array x{100};

   do i=1 to n;
      do j=1 to dim(y);
         y{j} = 1;
      end;
      sign = 1;

      do j=1 to dim(x);
         x{j} = ranuni(1);
         do k=1 to dim(y);
            y{k} = y{k} + sign*j*x{j};
            sign = -sign;
         end;
      end;

      do j=1 to dim(y);
         y{j} = y{j} + 7*rannor(1);
      end;

      if      i <= n1 then z='verytiny';
      else if i <= n2 then z='small';
      else if i <= n3 then z='medium';
      else if i <= n4 then z='large';
      else                 z='huge';

      output;
   end;
run;   

The following statements use PROC HPPLS to fit a PLS model by using the SIMPLS method and test set validation:


proc hppls data=ex2Data method=simpls cvtest(stat=press seed=12345);
   class z;
   model y: = x: z:;
   partition fraction(test=0.4 seed=67890);
   performance details;
run;

In this example, any particular observation has a 40% probability of being assigned the testing role. All nonassigned observations are in training roles.

Output 12.2.1 shows the "Performance Information" table. This table shows that the HPPLS procedure executes in single-machine mode on four threads (the client machine has four CPUs). You can force a certain number of threads on any machine to be involved in the computations by specifying the NTHREADS= option in the PERFORMANCE statement.

Output 12.2.1: Performance Information in Single-Machine Mode

The HPPLS Procedure

Performance Information
Execution Mode Single-Machine
Number of Threads 4



Output 12.2.2 shows timing information for the PROC HPPLS run. This table is produced when you specify the DETAILS option in the PERFORMANCE statement. You can see that, in this case, the majority of time is spent fitting a PLS model.

Output 12.2.2: Timing in Single-Machine Mode

Procedure Task Timing
Task Seconds Percent
Reading and Levelizing Data 0.71 0.64%
Fitting Model 110.53 99.36%



To switch to running PROC HPPLS in distributed mode, specify valid values for the NODES=, INSTALL=, and HOST= options in the PERFORMANCE statement. An alternative to specifying the INSTALL= and HOST= options in the PERFORMANCE statement is to use the OPTIONS SET commands to set appropriate values for the GRIDHOST and GRIDINSTALLLOC environment variables. For information about setting these options or environment variables, see the section Processing Modes.

Note: Distributed mode requires SAS High-Performance Statistics .

The following statements provide an example. To run these statements successfully, you need to set the macro variables GRIDHOST and GRIDINSTALLLOC to resolve to appropriate values, or you can replace the references to macro variables with appropriate values.

proc hppls data=ex2Data method=simpls cvtest(stat=press seed=12345);
   class z;
   model y: = x: z:;
   partition fraction(test=0.4 seed=67890);
   performance details nodes = 4
               host="&GRIDHOST" install="&GRIDINSTALLLOC";
run;

The execution mode in the "Performance Information" table shown in Output 12.2.3 indicates that the calculations were performed in a distributed environment that uses four nodes, each of which uses 32 threads.

Output 12.2.3: Performance Information in Distributed Mode

Performance Information
Host Node << your grid host >>
Install Location << your grid install location >>
Execution Mode Distributed
Number of Compute Nodes 4
Number of Threads per Node 32



Another indication of distributed execution is the following message, which is issued by all high-performance analytical procedures (with the corresponding procedure name) in the SAS log:

NOTE: The HPPLS procedure is executing in the distributed
      computing environment with 4 worker nodes.

Output 12.2.4 shows timing information for this distributed run of the HPPLS procedure. The majority of time in the distributed mode run is also spent fitting a model.

Output 12.2.4: Timing in Distributed Mode

Procedure Task Timing
Task Seconds Percent
Distributing Data 1.54 16.40%
Reading and Levelizing Data 0.37 3.91%
Fitting Model 7.29 77.73%
Waiting on Client 0.18 1.96%