High-performance analytical procedures use the following rules to determine whether they run in single-machine mode or distributed mode:
If a grid host is not specified, the analysis is carried out in single-machine mode on the client machine that runs the SAS session.
If a grid host is specified, the behavior depends on whether the execution is alongside the database or alongside HDFS. If the data are local to the client (that is, not stored in the distributed database or HDFS on the appliance), you need to use the NODES= option in the PERFORMANCE statement to specify the number of nodes on the appliance or cluster that you want to engage in the analysis. If the procedure executes alongside the database or alongside HDFS, you do not need to specify the NODES= option.
The following example shows single-machine and client-data distributed configurations for a data set of 100,000 observations that are simulated from a logistic regression model. The following DATA step generates the data:
data simData; array _a{8} _temporary_ (0,0,0,1,0,1,1,1); array _b{8} _temporary_ (0,0,1,0,1,0,1,1); array _c{8} _temporary_ (0,1,0,0,1,1,0,1); do obsno=1 to 100000; x = rantbl(1,0.28,0.18,0.14,0.14,0.03,0.09,0.08,0.06); a = _a{x}; b = _b{x}; c = _c{x}; x1 = int(ranuni(1)*400); x2 = 52 + ranuni(1)*38; x3 = ranuni(1)*12; lp = 6. -0.015*(1-a) + 0.7*(1-b) + 0.6*(1-c) + 0.02*x1 -0.05*x2 - 0.1*x3; y = ranbin(1,1,(1/(1+exp(lp)))); output; end; drop x lp; run;
The following statements run PROC HPLOGISTIC to fit a logistic regression model:
proc hplogistic data=simData; class a b c; model y = a b c x1 x2 x3; run;
Figure 2.1 shows the results from the analysis.
Figure 2.1: Results from Logistic Regression in Single-Machine Mode
Performance Information | |
---|---|
Execution Mode | Single-Machine |
Number of Threads | 4 |
Model Information | |
---|---|
Data Source | WORK.SIMDATA |
Response Variable | y |
Class Parameterization | GLM |
Distribution | Binary |
Link Function | Logit |
Optimization Technique | Newton-Raphson with Ridging |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | 5.7011 | 0.2539 | Infty | 22.45 | <.0001 |
a 0 | -0.01020 | 0.06627 | Infty | -0.15 | 0.8777 |
a 1 | 0 | . | . | . | . |
b 0 | 0.7124 | 0.06558 | Infty | 10.86 | <.0001 |
b 1 | 0 | . | . | . | . |
c 0 | 0.8036 | 0.06456 | Infty | 12.45 | <.0001 |
c 1 | 0 | . | . | . | . |
x1 | 0.01975 | 0.000614 | Infty | 32.15 | <.0001 |
x2 | -0.04728 | 0.003098 | Infty | -15.26 | <.0001 |
x3 | -0.1017 | 0.009470 | Infty | -10.74 | <.0001 |
The entries in the “Performance Information” table show that the HPLOGISTIC procedure runs in single-machine mode and uses four threads, which are chosen according to the number of CPUs on the client machine. You can force a certain number of threads on any machine that is involved in the computations by specifying the NTHREADS option in the PERFORMANCE statement. Another indication of execution on the client is the following message, which is issued in the SAS log by all high-performance analytical procedures:
NOTE: The HPLOGISTIC procedure is executing on the client.
The following statements use 10 nodes (in distributed mode) to analyze the data on the appliance; results appear in Figure 2.2:
proc hplogistic data=simData; class a b c; model y = a b c x1 x2 x3; performance host="hpa.sas.com" nodes=10; run;
Figure 2.2: Results from Logistic Regression in Distributed Mode
Performance Information | |
---|---|
Host Node | hpa.sas.com |
Execution Mode | Distributed |
Grid Mode | Symmetric |
Number of Compute Nodes | 10 |
Number of Threads per Node | 24 |
Model Information | |
---|---|
Data Source | WORK.SIMDATA |
Response Variable | y |
Class Parameterization | GLM |
Distribution | Binary |
Link Function | Logit |
Optimization Technique | Newton-Raphson with Ridging |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | 5.7011 | 0.2539 | Infty | 22.45 | <.0001 |
a 0 | -0.01020 | 0.06627 | Infty | -0.15 | 0.8777 |
a 1 | 0 | . | . | . | . |
b 0 | 0.7124 | 0.06558 | Infty | 10.86 | <.0001 |
b 1 | 0 | . | . | . | . |
c 0 | 0.8036 | 0.06456 | Infty | 12.45 | <.0001 |
c 1 | 0 | . | . | . | . |
x1 | 0.01975 | 0.000614 | Infty | 32.15 | <.0001 |
x2 | -0.04728 | 0.003098 | Infty | -15.26 | <.0001 |
x3 | -0.1017 | 0.009470 | Infty | -10.74 | <.0001 |
The specification of a host causes the “Performance Information” table to display the name of the host node of the appliance. The “Performance Information” table also indicates that the calculations were performed in a distributed environment on the appliance. Twenty-four threads on each of 10 nodes were used to perform the calculations—for a total of 240 threads.
Another indication of distributed execution on the appliance is the following message, which is issued in the SAS log by all high-performance analytical procedures:
NOTE: The HPLOGISTIC procedure is executing in the distributed computing environment with 10 worker nodes.
You can override the presence of a grid host and force the computations into single-machine mode by specifying the NODES=0 option in the PERFORMANCE statement:
proc hplogistic data=simData; class a b c; model y = a b c x1 x2 x3; performance host="hpa.sas.com" nodes=0; run;
Figure 2.3 shows the “Performance Information” table. The numeric results are not reproduced here, but they agree with the previous analyses, which are shown in Figure 2.1 and Figure 2.2.
Figure 2.3: Single-Machine Mode Despite Host Specification
Performance Information | |
---|---|
Execution Mode | Single-Machine |
Number of Threads | 4 |
The “Performance Information” table indicates that the HPLOGISTIC procedure executes in single-machine mode on the client. This information is also reported in the following message, which is issued in the SAS log:
NOTE: The HPLOGISTIC procedure is executing on the client.
In the analysis shown previously in Figure 2.2, the data set Work.simData
is local to the client, and the HPLOGISTIC procedure distributed the data to 10 nodes on the appliance. The High-Performance
Analytics infrastructure does not keep these data on the appliance. When the procedure terminates, the in-memory representation
of the input data on the appliance is freed.
When the input data set is large, the time that is spent sending client-side data to the appliance might dominate the execution time. In practice, transfer speeds are usually lower than the theoretical limits of the network connection or disk I/O rates. At a transfer rate of 40 megabytes per second, sending a 10-gigabyte data set to the appliance requires more than four minutes. If analytic execution time is in the range of seconds, the “performance” of the process is dominated by data movement.
The alongside-the-database execution model, unique to high-performance analytical procedures, enables you to read and write data in distributed form from the database that is installed on the appliance.