If the grid host is a cluster that houses data that have been distributed by using the SASHDAT engine, then high-performance analytical procedures can analyze those data in the alongside-HDFS mode. The procedures use the distributed computing environment in which an analytic process is co-located with the nodes of the cluster. Data then pass from HDFS to the analytic process on each node of the cluster.
Before you can run a procedure alongside HDFS, you must distribute the data to the cluster. The following statements use the
SASHDAT engine to distribute to HDFS the simData
data set that was used in the previous two sections:
option set=GRIDHOST="hpa.sas.com"; libname hdatLib sashdat path="/hps"; data hdatLib.simData (replace = yes) ; set simData; run;
In this example, the GRIDHOST is a cluster where the SAS Data in HDFS Engine is installed. If a data set that is named simData
already exists in the hps
directory in HDFS, it is overwritten because the REPLACE=YES data set option is specified. For more information about using
this LIBNAME statement, see the section “LIBNAME Statement for the SAS Data in HDFS Engine” in the
SAS LASR Analytic Server: Administration Guide.
The following HPLOGISTIC procedure statements perform the analysis in alongside-HDFS mode. These statements are almost identical to the PROC HPLOGISTIC example in the previous two sections, which executed in single-machine mode and alongside-the-database distributed mode, respectively.
proc hplogistic data=hdatLib.simData; class a b c; model y = a b c x1 x2 x3; run;
Figure 2.11 shows the “Performance Information” table. You see that the procedure ran in distributed mode. The numeric results shown in Figure 2.12 agree with the previous analyses shown in Figure 2.1, Figure 2.2, and Figure 2.4.
Figure 2.11: Alongside-HDFS Execution Performance Information
Performance Information | |
---|---|
Host Node | hpa.sas.com |
Execution Mode | Distributed |
Grid Mode | Symmetric |
Number of Compute Nodes | 206 |
Number of Threads per Node | 8 |
Figure 2.12: Alongside-HDFS Execution Model Information
Model Information | |
---|---|
Data Source | HDATLIB.SIMDATA |
Response Variable | y |
Class Parameterization | GLM |
Distribution | Binary |
Link Function | Logit |
Optimization Technique | Newton-Raphson with Ridging |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | 5.7011 | 0.2539 | Infty | 22.45 | <.0001 |
a 0 | -0.01020 | 0.06627 | Infty | -0.15 | 0.8777 |
a 1 | 0 | . | . | . | . |
b 0 | 0.7124 | 0.06558 | Infty | 10.86 | <.0001 |
b 1 | 0 | . | . | . | . |
c 0 | 0.8036 | 0.06456 | Infty | 12.45 | <.0001 |
c 1 | 0 | . | . | . | . |
x1 | 0.01975 | 0.000614 | Infty | 32.15 | <.0001 |
x2 | -0.04728 | 0.003098 | Infty | -15.26 | <.0001 |
x3 | -0.1017 | 0.009470 | Infty | -10.74 | <.0001 |