Shared Concepts and Topics


Determining Single-Machine Mode or Distributed Mode

High-performance analytical procedures use the following rules to determine whether they run in single-machine mode or distributed mode:

  • If a grid host is not specified, the analysis is carried out in single-machine mode on the client machine that runs the SAS session.

  • If a grid host is specified, the behavior depends on whether the execution is alongside the database or alongside HDFS. If the data are local to the client (that is, not stored in the distributed database or HDFS on the appliance), you need to use the NODES= option in the PERFORMANCE statement to specify the number of nodes on the appliance or cluster that you want to engage in the analysis. If the procedure executes alongside the database or alongside HDFS, you do not need to specify the NODES= option.

The following example shows single-machine and client-data distributed configurations for a data set of 100,000 observations that are simulated from a logistic regression model. The following DATA step generates the data:

 
data simData; 
   array _a{8} _temporary_ (0,0,0,1,0,1,1,1); 
   array _b{8} _temporary_ (0,0,1,0,1,0,1,1); 
   array _c{8} _temporary_ (0,1,0,0,1,1,0,1); 
   do obsno=1 to 100000; 
      x  = rantbl(1,0.28,0.18,0.14,0.14,0.03,0.09,0.08,0.06); 
      a  = _a{x}; 
      b  = _b{x}; 
      c  = _c{x}; 
      x1 = int(ranuni(1)*400); 
      x2 = 52 + ranuni(1)*38; 
      x3 = ranuni(1)*12; 
      lp = 6. -0.015*(1-a) + 0.7*(1-b) + 0.6*(1-c) + 0.02*x1 -0.05*x2 - 0.1*x3; 
      y  = ranbin(1,1,(1/(1+exp(lp)))); 
      output; 
   end; 
   drop x lp; 
run; 

The following statements run PROC HPLOGISTIC to fit a logistic regression model:


proc hplogistic data=simData;
   class a b c;
   model y = a b c x1 x2 x3;
run;

Figure 3.1 shows the results from the analysis.

Figure 3.1: Results from Logistic Regression in Single-Machine Mode

The HPLOGISTIC Procedure

Performance Information
Execution Mode Single-Machine
Number of Threads 4

Data Access Information
Data Engine Role Path
WORK.SIMDATA V9 Input On Client

Model Information
Data Source WORK.SIMDATA
Response Variable y
Class Parameterization GLM
Distribution Binary
Link Function Logit
Optimization Technique Newton-Raphson with Ridging

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t|
Intercept 5.7011 0.2539 Infty 22.45 <.0001
a 0 -0.01020 0.06627 Infty -0.15 0.8777
a 1 0 . . . .
b 0 0.7124 0.06558 Infty 10.86 <.0001
b 1 0 . . . .
c 0 0.8036 0.06456 Infty 12.45 <.0001
c 1 0 . . . .
x1 0.01975 0.000614 Infty 32.15 <.0001
x2 -0.04728 0.003098 Infty -15.26 <.0001
x3 -0.1017 0.009470 Infty -10.74 <.0001



The entries in the "Performance Information" table show that the HPLOGISTIC procedure runs in single-machine mode and uses four threads, which are chosen according to the number of CPUs on the client machine. You can force a certain number of threads on any machine that is involved in the computations by specifying the NTHREADS option in the PERFORMANCE statement. Another indication of execution on the client is the following message, which is issued in the SAS log by all high-performance analytical procedures:


NOTE: The HPLOGISTIC procedure is executing in single-machine mode.

The following statements use 10 nodes (in distributed mode) to analyze the data on the appliance; results appear in Figure 3.2:


proc hplogistic data=simData;
   class a b c;
   model y = a b c x1 x2 x3;
   performance host="hpa.sas.com" nodes=10;
run;

Figure 3.2: Results from Logistic Regression in Distributed Mode

The HPLOGISTIC Procedure

Performance Information
Host Node hpa.sas.com
Execution Mode Distributed
Number of Compute Nodes 10
Number of Threads per Node 24

Data Access Information
Data Engine Role Path
WORK.SIMDATA V9 Input From Client

Model Information
Data Source WORK.SIMDATA
Response Variable y
Class Parameterization GLM
Distribution Binary
Link Function Logit
Optimization Technique Newton-Raphson with Ridging

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t|
Intercept 5.7011 0.2539 Infty 22.45 <.0001
a 0 -0.01020 0.06627 Infty -0.15 0.8777
a 1 0 . . . .
b 0 0.7124 0.06558 Infty 10.86 <.0001
b 1 0 . . . .
c 0 0.8036 0.06456 Infty 12.45 <.0001
c 1 0 . . . .
x1 0.01975 0.000614 Infty 32.15 <.0001
x2 -0.04728 0.003098 Infty -15.26 <.0001
x3 -0.1017 0.009470 Infty -10.74 <.0001



The specification of a host causes the "Performance Information" table to display the name of the host node of the appliance. The "Performance Information" table also indicates that the calculations were performed in a distributed environment on the appliance. Twenty-four threads on each of 10 nodes were used to perform the calculations—for a total of 240 threads.

Another indication of distributed execution on the appliance is the following message, which is issued in the SAS log by all high-performance analytical procedures:


NOTE: The HPLOGISTIC procedure is executing in the distributed
      computing environment with 10 worker nodes.

You can override the presence of a grid host and force the computations into single-machine mode by specifying the NODES=0 option in the PERFORMANCE statement:

proc hplogistic data=simData;
   class a b c;
   model y = a b c x1 x2 x3;
   performance host="hpa.sas.com" nodes=0;
run;

Figure 3.3 shows the "Performance Information" table. The numeric results are not reproduced here, but they agree with the previous analyses, which are shown in Figure 3.1 and Figure 3.2.

Figure 3.3: Single-Machine Mode Despite Host Specification

The HPLOGISTIC Procedure

Performance Information
Execution Mode Single-Machine
Number of Threads 4

Data Access Information
Data Engine Role Path
WORK.SIMDATA V9 Input On Client



The "Performance Information" table indicates that the HPLOGISTIC procedure executes in single-machine mode on the client. This information is also reported in the following message, which is issued in the SAS log:


NOTE: The HPLOGISTIC procedure is executing in single-machine mode.

In the analysis shown previously in Figure 3.2, the data set Work.simData is local to the client, and the HPLOGISTIC procedure distributed the data to 10 nodes on the appliance. The High-Performance Analytics infrastructure does not keep these data on the appliance. When the procedure terminates, the in-memory representation of the input data on the appliance is freed.

When the input data set is large, the time that is spent sending client-side data to the appliance might dominate the execution time. In practice, transfer speeds are usually lower than the theoretical limits of the network connection or disk I/O rates. At a transfer rate of 40 megabytes per second, sending a 10-gigabyte data set to the appliance requires more than four minutes. If analytic execution time is in the range of seconds, the "performance" of the process is dominated by data movement.

The alongside-the-database execution model, unique to high-performance analytical procedures, enables you to read and write data in distributed form from the database that is installed on the appliance.