Shared Concepts and Topics


Alongside-HDFS Execution by Using the SASHDAT Engine

If the grid host is a cluster that houses data that have been distributed by using the SASHDAT engine, then high-performance analytical procedures can analyze those data in the alongside-HDFS mode. The procedures use the distributed computing environment in which an analytic process is collocated with the nodes of the cluster. Data then pass from HDFS to the analytic process on each node of the cluster.

Before you can run a procedure alongside HDFS, you must distribute the data to the cluster. The following statements use the SASHDAT engine to distribute to HDFS the simData data set that was used in the previous two sections:


option set=GRIDHOST="hpa.sas.com";

libname hdatLib sashdat
        path="/hps";

 data hdatLib.simData (replace = yes) ;
     set simData;
 run;

In this example, the GRIDHOST is a cluster where the SAS Data in HDFS Engine is installed. If a data set that is named simData already exists in the hps directory in HDFS, it is overwritten because the REPLACE=YES data set option is specified. For more information about using this LIBNAME statement, see the section "LIBNAME Statement for the SAS Data in HDFS Engine" in the SAS LASR Analytic Server: Reference Guide.

The following HPLOGISTIC procedure statements perform the analysis in alongside-HDFS mode. These statements are almost identical to the PROC HPLOGISTIC example in the previous two sections, which executed in single-machine mode and alongside-the-database distributed mode, respectively.

Figure 3.12 shows the "Performance Information" and "Data Access Information" tables. You see that the procedure ran in distributed mode and that the input data were read in parallel symmetric mode. The numeric results shown in Figure 3.13 agree with the previous analyses shown in Figure 3.1, Figure 3.2, and Figure 3.5.

Figure 3.12: Alongside-HDFS Execution Performance Information

The HPLOGISTIC Procedure

Performance Information
Host Node hpa.sas.com
Execution Mode Distributed
Number of Compute Nodes 13
Number of Threads per Node 24

Data Access Information
Data Engine Role Path
HDATLIB.SIMDATA SASHDAT Input Parallel, Symmetric



Figure 3.13: Alongside-HDFS Execution Model Information

Model Information
Data Source HDATLIB.SIMDATA
Response Variable y
Class Parameterization GLM
Distribution Binary
Link Function Logit
Optimization Technique Newton-Raphson with Ridging

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t|
Intercept 5.7011 0.2539 Infty 22.45 <.0001
a 0 -0.01020 0.06627 Infty -0.15 0.8777
a 1 0 . . . .
b 0 0.7124 0.06558 Infty 10.86 <.0001
b 1 0 . . . .
c 0 0.8036 0.06456 Infty 12.45 <.0001
c 1 0 . . . .
x1 0.01975 0.000614 Infty 32.15 <.0001
x2 -0.04728 0.003098 Infty -15.26 <.0001
x3 -0.1017 0.009470 Infty -10.74 <.0001