Shared Concepts and Topics


Alongside-the-Database Execution

High-performance analytical procedures interface with the distributed database management system (DBMS) on the appliance in a unique way. If the input data are stored in the DBMS and the grid host is the appliance that houses the data, high-performance analytical procedures create a distributed computing environment in which an analytic process is collocated with the nodes of the DBMS. Data then pass from the DBMS to the analytic process on each node. Instead of moving across the network and possibly back to the client machine, the data pass locally between the processes on each node of the appliance.

Because the analytic processes on the appliance are separate from the database processes, the technique is referred to as alongside-the-database execution in contrast to in-database execution, where the analytic code executes in the database process.

In general, when you have a large amount of input data, you can achieve the best performance from high-performance analytical procedures if execution is alongside the database.

Before you can run alongside the database, you must distribute the data to the appliance. The following statements use the HPDS2 procedure to distribute the data set Work.simData into the mydb database on the hpa.sas.com appliance. In this example, the appliance houses a Greenplum database.


option set=GRIDHOST="green.sas.com";
libname applianc greenplm
        server  ="green.sas.com"
        user    =XXXXXX
        password=YYYYY
        database=mydb;

option set=GRIDHOST="compute_appliance.sas.com";

proc datasets lib=applianc nolist; delete simData;
proc hpds2 data=simData
           out =applianc.simData(distributed_by='distributed randomly');
  performance commit=10000 nodes=all;
  data DS2GTF.out;
     method run();
        set DS2GTF.in;
     end;
  enddata;
run;

If the output table applianc.simData exists, the DATASETS procedure removes the table from the Greenplum database because a DBMS does not usually support replacement operations on tables.

Note that the libref for the output table points to the appliance. The data set option informs the HPDS2 procedure to distribute the records randomly among the data segments of the appliance. The statements that follow the PERFORMANCE statement are the DS2 program that copies the input data to the output data without further transformations.

Because you loaded the data into a database on the appliance, you can use the following HPLOGISTIC statements to perform the analysis on the appliance in the alongside-the-database mode. These statements are almost identical to the first PROC HPLOGISTIC example in a previous section, which executed in single-machine mode.


proc hplogistic data=applianc.simData;
   class a b c;
   model y = a b c x1 x2 x3;
run;

The subtle differences are as follows:

  • The grid host environment variable that you specified in an OPTION SET= command is still in effect.

  • The DATA= option in the high-performance analytical procedure uses a libref that identifies the data source as being housed on the appliance. This libref was specified in a prior LIBNAME statement.

Figure 3.5 shows the results from this analysis. The "Performance Information" table shows that the execution was in distributed mode, and the "Data Access Information" table shows that the data were accessed asymmetrically in parallel from the Greenplum database. The numeric results agree with the previous analyses, which are shown in Figure 3.1 and Figure 3.2.

Figure 3.5: Alongside-the-Database Execution on Greenplum

The HPLOGISTIC Procedure

Performance Information
Host Node compute_appliance.sas.com
Execution Mode Distributed
Number of Compute Nodes 141
Number of Threads per Node 32

Data Access Information
Data Engine Role Path
APPLIANC.SIMDATA GREENPLM Input Parallel, Asymmetric

Model Information
Data Source APPLIANC.SIMDATA
Response Variable y
Class Parameterization GLM
Distribution Binary
Link Function Logit
Optimization Technique Newton-Raphson with Ridging

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t|
Intercept 5.7011 0.2539 Infty 22.45 <.0001
a 0 -0.01020 0.06627 Infty -0.15 0.8777
a 1 0 . . . .
b 0 0.7124 0.06558 Infty 10.86 <.0001
b 1 0 . . . .
c 0 0.8036 0.06456 Infty 12.45 <.0001
c 1 0 . . . .
x1 0.01975 0.000614 Infty 32.15 <.0001
x2 -0.04728 0.003098 Infty -15.26 <.0001
x3 -0.1017 0.009470 Infty -10.74 <.0001