High-performance analytical procedures interface with the distributed database management system (DBMS) on the appliance in a unique way. If the input data are stored in the DBMS and the grid host is the appliance that houses the data, high-performance analytical procedures create a distributed computing environment in which an analytic process is co-located with the nodes of the DBMS. Data then pass from the DBMS to the analytic process on each node. Instead of moving across the network and possibly back to the client machine, the data pass locally between the processes on each node of the appliance.
Because the analytic processes on the appliance are separate from the database processes, the technique is referred to as alongside-the-database execution in contrast to in-database execution, where the analytic code executes in the database process.
In general, when you have a large amount of input data, you can achieve the best performance from high-performance analytical procedures if execution is alongside the database.
Before you can run alongside the database, you must distribute the data to the appliance. The following statements use the
HPDS2 procedure to distribute the data set Work.simData
into the mydb
database on the hpa.sas.com
appliance. In this example, the appliance houses a Greenplum database.
option set=GRIDHOST="hpa.sas.com"; libname applianc greenplm server ="hpa.sas.com" user =XXXXXX password=YYYYY database=mydb;
proc datasets lib=applianc nolist; delete simData; proc hpds2 data=simData out =applianc.simData(distributed_by='distributed randomly'); performance commit=10000 nodes=all; data DS2GTF.out; method run(); set DS2GTF.in; end; enddata; run;
If the output table applianc.simData
exists, the DATASETS procedure removes the table from the Greenplum database because a DBMS does not usually support replacement
operations on tables.
Note that the libref for the output table points to the appliance. The data set option informs the HPDS2 procedure to distribute the records randomly among the data segments of the appliance. The statements that follow the PERFORMANCE statement are the DS2 program that copies the input data to the output data without further transformations.
Because you loaded the data into a database on the appliance, you can use the following HPLOGISTIC statements to perform the analysis on the appliance in the alongside-the-database mode. These statements are almost identical to the first PROC HPLOGISTIC example in the previous section, which executed in single-machine mode.
proc hplogistic data=applianc.simData; class a b c; model y = a b c x1 x2 x3; run;
The subtle differences are as follows:
The grid host environment variable that you specified in an OPTION SET= command is still in effect.
The DATA= option in the high-performance analytical procedure uses a libref that identifies the data source as being housed on the appliance. This libref was specified in a prior LIBNAME statement.
Figure 2.4 shows the results from this analysis. The “Performance Information” table shows that the execution was in distributed mode. In this case the execution was alongside the Greenplum database. The numeric results agree with the previous analyses, which are shown in Figure 2.1 and Figure 2.2.
Figure 2.4: Alongside-the-Database Execution on Greenplum
Performance Information | |
---|---|
Host Node | hpa.sas.com |
Execution Mode | Distributed |
Grid Mode | Symmetric |
Number of Compute Nodes | 8 |
Number of Threads per Node | 24 |
Model Information | |
---|---|
Data Source | SIMDATA |
Response Variable | y |
Class Parameterization | GLM |
Distribution | Binary |
Link Function | Logit |
Optimization Technique | Newton-Raphson with Ridging |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | 5.7011 | 0.2539 | Infty | 22.45 | <.0001 |
a 0 | -0.01020 | 0.06627 | Infty | -0.15 | 0.8777 |
a 1 | 0 | . | . | . | . |
b 0 | 0.7124 | 0.06558 | Infty | 10.86 | <.0001 |
b 1 | 0 | . | . | . | . |
c 0 | 0.8036 | 0.06456 | Infty | 12.45 | <.0001 |
c 1 | 0 | . | . | . | . |
x1 | 0.01975 | 0.000614 | Infty | 32.15 | <.0001 |
x2 | -0.04728 | 0.003098 | Infty | -15.26 | <.0001 |
x3 | -0.1017 | 0.009470 | Infty | -10.74 | <.0001 |
When high-performance analytical procedures execute symmetrically alongside the database, any nonzero specification of the NODES= option in the PERFORMANCE statement is ignored. If the data are read alongside the database, the number of compute nodes is determined by the layout of the database and cannot be modified. In this example, the appliance contains 16 nodes. (See the “Performance Information” table.)
However, when high-performance analytical procedures execute asymmetrically alongside the database, the number of compute nodes that you specify in the PERFORMANCE statement can differ from the number of nodes across which the data are partitioned. For an example, see the section Running High-Performance Analytical Procedures in Asymmetric Mode.