The following LIBNAME statement sets up a libref that you can use to access data that are stored in HDFS and have metadata in Hive:
libname hdoopLib hadoop server = "hpa.sas.com" user = XXXXX password = YYYYY database = myDB config = "demo.xml" ;
For more information about LIBNAME options available for the Hadoop engine, see the LIBNAME topic in the Hadoop section of
SAS/ACCESS for Relational Databases: Reference. The configuration file that you specify in the CONFIG= option contains information that is needed to access the Hive server.
It also contains information that enables this configuration file to be used to access data in HDFS without using the Hive
server. This information can also be used to specify replication factors and block sizes that are used when the engine writes
data to HDFS. The following XML shows the contents of the file demo.xml
that is used in this example:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://hpa.sas.com:8020</value> </property> <property> <name>mapred.job.tracker</name> <value>hpa.sas.com:8021</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.block.size</name> <value>33554432</value> </property> </configuration>
The following DATA step uses the Hadoop engine to distribute to HDFS the simData
data set that was used in the previous sections. The engine creates metadata for the data set in Hive.
data hdoopLib.simData; set simData; run;
After you have loaded data or if you are accessing preexisting data in HDFS that have metadata in Hive, you can access this data alongside HDFS by using high-performance analytics procedures. The following HPLOGISTIC procedure statements perform the analysis in alongside-HDFS mode. These statements are similar to the PROC HPLOGISTIC example in the previous sections. However, whenever you use the Hadoop engine, you must execute the analysis in asymmetric mode to cause the execution to occur alongside HDFS.
proc hplogistic data=hdoopLib.simData; class a b c; model y = a b c x1 x2 x3; performance host = "compute_appliance.sas.com" gridmode = asym; run;
Figure 3.13 shows the “Performance Information” table. You see that the procedure ran asymmetrically in distributed mode. The numeric results shown in Figure 3.14 agree with the previous analyses.
Figure 3.13: Alongside-HDFS Execution by Using the Hadoop Engine
Performance Information | |
---|---|
Host Node | compute_appliance.sas.com |
Execution Mode | Distributed |
Grid Mode | Asymmetric |
Number of Compute Nodes | 15 |
Number of Threads per Node | 24 |
Figure 3.14: Alongside-HDFS Execution by Using the Hadoop Engine
Model Information | |
---|---|
Data Source | HDOOPLIB.SIMDATA |
Response Variable | y |
Class Parameterization | GLM |
Distribution | Binary |
Link Function | Logit |
Optimization Technique | Newton-Raphson with Ridging |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | 5.7011 | 0.2539 | Infty | 22.45 | <.0001 |
a 0 | -0.01020 | 0.06627 | Infty | -0.15 | 0.8777 |
a 1 | 0 | . | . | . | . |
b 0 | 0.7124 | 0.06558 | Infty | 10.86 | <.0001 |
b 1 | 0 | . | . | . | . |
c 0 | 0.8036 | 0.06456 | Infty | 12.45 | <.0001 |
c 1 | 0 | . | . | . | . |
x1 | 0.01975 | 0.000614 | Infty | 32.15 | <.0001 |
x2 | -0.04728 | 0.003098 | Infty | -15.26 | <.0001 |
x3 | -0.1017 | 0.009470 | Infty | -10.74 | <.0001 |
The Hadoop engine also enables you to access tables in HDFS that are stored in various formats and that are not registered in Hive. You can use the HDMD procedure to generate metadata for tables that are stored in the following file formats:
delimited text
fixed-record length binary
sequence files
XML text
To read any other kind of file in Hadoop, you can write a custom file reader plug-in in Java for use with PROC HDMD. For more information about LIBNAME options available for the Hadoop engine, see the LIBNAME topic in the Hadoop section of SAS/ACCESS for Relational Databases: Reference.
The following example shows how you can use PROC HDMD to register metadata for CSV data independently from Hive and then analyze
these data by using high-performance analytics procedures. The CSV data in the table csvExample.csv
is stored in HDFS in the directory /user/demo/data
. Each record in this table consists of the following fields, in the order shown and separated by commas.
a string of at most six characters
a numeric field with values of 0 or 1
a numeric field with real numbers
Suppose you want to fit a logistic regression model to these data, where the second field represents a target variable named
Success
, the third field represents a regressor named Dose
, and the first field represents a classification variable named Group
.
The first step is to use PROC HDMD to create metadata that are needed to interpret the table, as in the following statements:
libname hdoopLib hadoop server = "hpa.sas.com" user = XXXXX password = YYYYY HDFS_PERMDIR = "/user/demo/data" HDFS_METADIR = "/user/demo/meta" config = "demo.xml" DBCREATE_TABLE_EXTERNAL=YES; proc hdmd name=hdoopLib.csvExample data_file='csvExample.csv' format=delimited encoding=utf8 sep = ','; column Group char(6); column Success double; column Dose double; run;
The metadata that are created by PROC HDMD for this table are stored in the directory /user/demo/meta
that you specified in the HDFS_METADIR = option in the preceding LIBNAME statement. After you create the metadata, you can
execute high-performance analytics procedures with these data by using the hdoopLib libref. For example, the following statements
fit a logistic regression model to the CSV data that are stored in csvExample.csv
table.
proc hplogistic data=hdoopLib.csvExample; class Group; model Success = Dose; performance host = "compute_appliance.sas.com" gridmode = asym; run;
Figure 3.15 shows the results of this analysis. You see that the procedure ran asymmetrically in distributed mode. The metadata that you created by using the HDMD procedure have been used successfully in executing this analysis.
Figure 3.15: Alongside-HDFS Execution with CSV Data
Performance Information | |
---|---|
Host Node | compute_appliance.sas.com |
Execution Mode | Distributed |
Grid Mode | Asymmetric |
Number of Compute Nodes | 15 |
Number of Threads per Node | 24 |
Model Information | |
---|---|
Data Source | GRIDLIB.CSVEXAMPLE |
Response Variable | Success |
Class Parameterization | GLM |
Distribution | Binary |
Link Function | Logit |
Optimization Technique | Newton-Raphson with Ridging |
Class Level Information | ||
---|---|---|
Class | Levels | Values |
Group | 3 | group1 group2 group3 |
Number of Observations Read | 1000 |
---|---|
Number of Observations Used | 1000 |
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | 0.1243 | 0.1295 | Infty | 0.96 | 0.3371 |
Dose | -0.2674 | 0.2216 | Infty | -1.21 | 0.2277 |