Alongside-the-Database Execution

High-performance analytical procedures interface with the distributed database management system (DBMS) on the appliance in a unique way. If the input data are stored in the DBMS and the grid host is the appliance that houses the data, high-performance analytical procedures create a distributed computing environment in which an analytic process is co-located with the nodes of the DBMS. Data then pass from the DBMS to the analytic process on each node. Instead of moving across the network and possibly back to the client machine, the data pass locally between the processes on each node of the appliance.

Because the analytic processes on the appliance are separate from the database processes, the technique is referred to as alongside-the-database execution in contrast to in-database execution, where the analytic code executes in the database process.

In general, when you have a large amount of input data, you can achieve the best performance from high-performance analytical procedures if execution is alongside the database.

Before you can run alongside the database, you must distribute the data to the appliance. The following statements use the HPDS2 procedure to distribute the data set Work.simData into the mydb database on the hpa.sas.com appliance. In this example, the appliance houses a Greenplum database.

 
option set=GRIDHOST="hpa.sas.com"; 
libname applianc greenplm  
        server  ="hpa.sas.com"  
        user    =XXXXXX 
        password=YYYYY 
        database=mydb; 
 
proc datasets lib=applianc nolist; delete simData; 
proc hpds2 data=simData 
           out =applianc.simData(distributed_by='distributed randomly'); 
  performance commit=10000 nodes=all; 
  data DS2GTF.out; 
     method run(); 
        set DS2GTF.in; 
     end; 
  enddata; 
run; 

If the output table applianc.simData exists, the DATASETS procedure removes the table from the Greenplum database because a DBMS does not usually support replacement operations on tables.

Note that the libref for the output table points to the appliance. The data set option informs the HPDS2 procedure to distribute the records randomly among the data segments of the appliance. The statements that follow the PERFORMANCE statement are the DS2 program that copies the input data to the output data without further transformations.

Because you loaded the data into a database on the appliance, you can use the following HPLOGISTIC statements to perform the analysis on the appliance in the alongside-the-database mode. These statements are almost identical to the first PROC HPLOGISTIC example in the previous section, which executed in single-machine mode.

 
proc hplogistic data=applianc.simData; 
   class a b c; 
   model y = a b c x1 x2 x3; 
run; 

The subtle differences are as follows:

  • The grid host environment variable that you specified in an OPTION SET= command is still in effect.

  • The DATA= option in the high-performance analytical procedure uses a libref that identifies the data source as being housed on the appliance. This libref was specified in a prior LIBNAME statement.

Figure 2.4 shows the results from this analysis. The Performance Information table shows that the execution was in distributed mode. In this case the execution was alongside the Greenplum database. The numeric results agree with the previous analyses, which are shown in Figure 2.1 and Figure 2.2.

Figure 2.4: Alongside-the-Database Execution on Greenplum

The HPLOGISTIC Procedure

Performance Information
Host Node hpa.sas.com
Execution Mode Distributed
Grid Mode Symmetric
Number of Compute Nodes 8
Number of Threads per Node 24

Model Information
Data Source SIMDATA
Response Variable y
Class Parameterization GLM
Distribution Binary
Link Function Logit
Optimization Technique Newton-Raphson with Ridging

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t|
Intercept 5.7011 0.2539 Infty 22.45 <.0001
a 0 -0.01020 0.06627 Infty -0.15 0.8777
a 1 0 . . . .
b 0 0.7124 0.06558 Infty 10.86 <.0001
b 1 0 . . . .
c 0 0.8036 0.06456 Infty 12.45 <.0001
c 1 0 . . . .
x1 0.01975 0.000614 Infty 32.15 <.0001
x2 -0.04728 0.003098 Infty -15.26 <.0001
x3 -0.1017 0.009470 Infty -10.74 <.0001


When high-performance analytical procedures execute symmetrically alongside the database, any nonzero specification of the NODES= option in the PERFORMANCE statement is ignored. If the data are read alongside the database, the number of compute nodes is determined by the layout of the database and cannot be modified. In this example, the appliance contains 16 nodes. (See the Performance Information table.)

However, when high-performance analytical procedures execute asymmetrically alongside the database, the number of compute nodes that you specify in the PERFORMANCE statement can differ from the number of nodes across which the data are partitioned. For an example, see the section Running High-Performance Analytical Procedures in Asymmetric Mode.