Shared Concepts and Topics


Alongside-HDFS Execution by Using the Hadoop Engine

The following LIBNAME statement sets up a libref that you can use to access data that are stored in HDFS and have metadata in Hive:


libname hdoopLib hadoop
        server       = "hpa.sas.com"
        user         = XXXXX
        password     = YYYYY
        database     = myDB
        config       = "demo.xml" ;

For more information about LIBNAME options available for the Hadoop engine, see the LIBNAME topic in the Hadoop section of SAS/ACCESS for Relational Databases: Reference. The configuration file that you specify in the CONFIG= option contains information that is needed to access the Hive server. It also contains information that enables this configuration file to be used to access data in HDFS without using the Hive server. This information can also be used to specify replication factors and block sizes that are used when the engine writes data to HDFS.

The following DATA step uses the Hadoop engine to distribute to HDFS the simData data set that was used in the previous sections. The engine creates metadata for the data set in Hive.

data hdoopLib.simData;
    set simData;
run;

After you have loaded data or if you are accessing preexisting data in HDFS that have metadata in Hive, you can access this data alongside HDFS by using high-performance analytical procedures. The following HPLOGISTIC procedure statements perform the analysis in alongside-HDFS mode. These statements are similar to the PROC HPLOGISTIC example in the previous sections.


proc hplogistic data=hdoopLib.simData;
   class a b c;
   model y = a b c x1 x2 x3;
   performance host = "compute_appliance.sas.com";
run;

Figure 3.12 shows the "Performance Information" and "Data Access Information" tables. You see that the procedure ran in distributed mode and that the input data were read in parallel asymmetric mode. The numeric results shown in Figure 3.13 agree with the previous analyses.

Figure 3.12: Alongside-HDFS Execution by Using the Hadoop Engine

The HPLOGISTIC Procedure

Performance Information
Host Node compute_appliance.sas.com
Execution Mode Distributed
Number of Compute Nodes 141
Number of Threads per Node 32

Data Access Information
Data Engine Role Path
GRIDLIB.SIMDATA HADOOP Input Parallel, Asymmetric



Figure 3.13: Alongside-HDFS Execution by Using the Hadoop Engine

Model Information
Data Source GRIDLIB.SIMDATA
Response Variable y
Class Parameterization GLM
Distribution Binary
Link Function Logit
Optimization Technique Newton-Raphson with Ridging

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t|
Intercept 5.7011 0.2539 Infty 22.45 <.0001
a 0 -0.01020 0.06627 Infty -0.15 0.8777
a 1 0 . . . .
b 0 0.7124 0.06558 Infty 10.86 <.0001
b 1 0 . . . .
c 0 0.8036 0.06456 Infty 12.45 <.0001
c 1 0 . . . .
x1 0.01975 0.000614 Infty 32.15 <.0001
x2 -0.04728 0.003098 Infty -15.26 <.0001
x3 -0.1017 0.009470 Infty -10.74 <.0001



The Hadoop engine also enables you to access tables in HDFS that are stored in various formats and that are not registered in Hive. You can use the HDMD procedure to generate metadata for tables that are stored in the following file formats:

  • delimited text

  • fixed-record length binary

  • sequence files

  • XML text

To read any other kind of file in Hadoop, you can write a custom file reader plug-in in Java for use with PROC HDMD. For more information about LIBNAME options available for the Hadoop engine, see the LIBNAME topic in the Hadoop section of SAS/ACCESS for Relational Databases: Reference.

The following example shows how you can use PROC HDMD to register metadata for CSV data independently from Hive and then analyze these data by using high-performance analytical procedures. The CSV data in the table csvExample.csv is stored in HDFS in the directory /user/demo/data. Each record in this table consists of the following fields, in the order shown and separated by commas.

  1. a string of at most six characters

  2. a numeric field with values of 0 or 1

  3. a numeric field with real numbers

Suppose you want to fit a logistic regression model to these data, where the second field represents a target variable named Success, the third field represents a regressor named Dose, and the first field represents a classification variable named Group.

The first step is to use PROC HDMD to create metadata that are needed to interpret the table, as in the following statements:


libname hdoopLib hadoop
                 server       = "hpa.sas.com"
                 user         = XXXXX
                 password     = YYYYY
                 HDFS_PERMDIR = "/user/demo/data"
                 HDFS_METADIR = "/user/demo/meta"
                 config       = "demo.xml"
                 DBCREATE_TABLE_EXTERNAL=YES;

proc hdmd name=hdoopLib.csvExample data_file='csvExample.csv'
          format=delimited encoding=utf8 sep = ',';

     column Group    char(6);
     column Success  double;
     column Dose     double;
run;

The metadata that are created by PROC HDMD for this table are stored in the directory /user/demo/meta that you specified in the HDFS_METADIR = option in the preceding LIBNAME statement. After you create the metadata, you can execute high-performance analytical procedures with these data by using the hdoopLib libref. For example, the following statements fit a logistic regression model to the CSV data that are stored in csvExample.csv table.

proc hplogistic data=hdoopLib.csvExample;
    class Group;
    model Success = Dose;
    performance host     = "compute_appliance.sas.com"
                gridmode = asym;
run;

Figure 3.14 shows the results of this analysis. You see that the procedure ran in distributed mode and that the input data were read in parallel asymmetric mode. The metadata that you created by using the HDMD procedure have been used successfully in executing this analysis.

Figure 3.14: Alongside-HDFS Execution with CSV Data

The HPLOGISTIC Procedure

Performance Information
Host Node compute_appliance.sas.com
Execution Mode Distributed
Number of Compute Nodes 141
Number of Threads per Node 32

Data Access Information
Data Engine Role Path
GRIDLIB.CSVEXAMPLE HADOOP Input Parallel, Asymmetric

Model Information
Data Source GRIDLIB.CSVEXAMPLE
Response Variable Success
Class Parameterization GLM
Distribution Binary
Link Function Logit
Optimization Technique Newton-Raphson with Ridging

Class Level Information
Class Levels Values
Group 3 group1 group2 group3

Number of Observations Read 1000
Number of Observations Used 1000

Parameter Estimates
Parameter Estimate Standard
Error
DF t Value Pr > |t|
Intercept 0.1243 0.1295 Infty 0.96 0.3371
Dose -0.2674 0.2216 Infty -1.21 0.2277