Bulk Loading for Hadoop

Loading

SAS/ACCESS Interface to Hadoop has no differentiation between a bulk load and a standard load process. Although BULKLOAD=YES syntax is supported, it does not change the underlying load process.
Here is how bulk loading works in the Hadoop interface.
  1. SAS issues a CREATE TABLE command to the Hive server. The command contains all table metadata (column definitions) and the table properties that are specific to SAS that refine Hive metadata to handle maximum string lengths and date/time formats.
  2. SAS initiates an HDFS streaming connection to upload table data to the HDFS /tmp directory. The resulting file is a UTF-8 text file that uses CTRL-A (‘\001’) as a field delimiter and a newline (‘\n’) character as a record separator. These are the defaults that SAS uses for Hive.
  3. SAS issues a LOAD DATA command to the Hive server to move the data file from the HDFS /tmp directory to the appropriate Hive warehouse location. The data file is now considered to be an internal table.
Hives considers a table to be a collection of files in a directory that bears the table name. The CREATE TABLE command creates this directory either directly in the Hive warehouse or in a subdirectory, if a nondefault schema is used. The LOAD DATA command moves the table to the correct location.
When PROC APPEND is used, the CREATE TABLE command is skipped because the base table already exists. The data to be appended is streamed to the HDFS /tmp directory, and the LOAD DATA command moves the file as an additional file into the Hive warehouse directory. After that, multiple files exist for a given table, using unique names. During read, Hive concatenates all files belonging to a table to render the table as a whole.
Note: These functions are not currently supported.
  • loading Hbase tables
  • loading tables that use nontext format and custom serializers and deserializers (serdes)

External Tables

External tables are stored in HDFS but outside of the directories set aside for Hive. Reading an external table that was created manually is transparent to SAS because Hive handles the physical location of the file. However, Hadoop bulk loading always creates internal tables within the Hive warehouse directory structure. Follow these manual steps to create an external table.
  1. Issue the appropriate CREATE TABLE command, using either the Hive command interface or by using explicit SQL in SAS.
  2. Create the table file in HDFS. One way to do this is by using the Hadoop Access Method.
  3. Issue the LOAD DATA command by using the LOCAL keyword, which indicates that the table file is local to HDFS or external to Hive.
For details about the Hadoop Access Method, see the Base SAS Procedures Guide.

File Formats

SAS can read Hive tables that are formatted in any way that is compatible with Hive. However, Hive tables that SAS creates through bulk loading always use these characteristics:
  • They are stored as text files. Numeric data is expressed as strings of formatted ASCII characters.
  • The field delimiter is a CTRL-A (\001), which is the Hive default.
  • The record delimiter is a newline (\n) character, which also the Hive default.
  • Null (empty or missing) values for strings are represented as \N (backslash with a capital N)—not to be confused with \n, which is only a single newline character.
  • The Hive default SerDe (serializer/deserializer) for text files is used (LazySimpleSerDe).

File Locations

Internal Hive tables are stored in the Hive warehouse. By default, this is in the HDFS file system under /user/hive/warehouse. Creating a table creates a directory with the table name, and all files in that directory are considered to be part of the table. HDFS also splits and replicates the data among the Hadoop member nodes, but that is transparent to operations and commands that are described in this document. To manually check these files, log on to the Hadoop namenode and use the hadoop fs command to navigate the HDFS file system. Here are some examples.
Command
Description
hadoop fs -lsr /user/hive/warehouse
Lists all Hive files
hadoop fs -ls /user/hive/warehouse/myschema.db
Lists all Hive table directories in myschema
hadoop fs -ls /user/hive/warehouse/mytable
Lists all Hive table files that belong to the mytable table in the default schema
hadoop fs -cat /user/hive/warehouse/mytable/<filename>
Displays the actual contents of the file
hadoop fs -ls /user/hive/warehouse
Lists all Hive table directories in the default schema

Examples

This example creates and loads the FLIGHTS98 Hadoop table from the SASFLT.FLT98 SAS data set.
libname sasflt 'SAS-library';
libname hdp_air hadoop user=louis pwd=louispwd server='hdpcluster' schema=statsdiv;

proc sql;
create table hdp_air.flights98
        as select * from sasflt.flt98;
quit;
This example also creates and loads the ALLFLIGHTS Hadoop table from the SASFLT.ALLFLIGHTS SAS data set.
data hdp_air.allflights;
set sasflt.allflights;
run;
In this example, the SASFLT.FLT98 SAS data set is appended to the existing ALLFLIGHTS Hadoop table.
proc append base=hdp_air.allflights
data=sasflt.flt98;
run;