SAS/ACCESS
Interface to Hadoop has no differentiation between a bulk load and
a standard load process. Although BULKLOAD=YES syntax is supported,
it does not change the underlying load process.
Here is how bulk loading
works in the Hadoop interface.
-
SAS issues a CREATE TABLE command to the Hive server.
The command contains all table metadata (column definitions) and the
table properties that are specific to SAS that refine Hive metadata
to handle maximum string lengths and date/time formats.
-
SAS initiates an HDFS streaming connection to upload
table data to the HDFS /tmp directory. The resulting file is a UTF-8
text file that uses CTRL-A (‘\001’) as a field delimiter
and a newline (‘\n’) character as a record separator.
These are the defaults that SAS uses for Hive.
-
SAS issues a LOAD DATA command to the Hive server
to move the data file from the HDFS /tmp directory to the appropriate
Hive warehouse location. The data file is now considered to be an
internal table.
Hives considers a table
to be a collection of files in a directory that bears the table name.
The CREATE TABLE command creates this directory either directly in
the Hive warehouse or in a subdirectory, if a nondefault schema is
used. The LOAD DATA command moves the table to the correct location.
When PROC APPEND is
used, the CREATE TABLE command is skipped because the base table already
exists. The data to be appended is streamed to the HDFS /tmp directory,
and the LOAD DATA command moves the file as an additional file into
the Hive warehouse directory. After that, multiple files exist for
a given table, using unique names. During read, Hive concatenates
all files belonging to a table to render the table as a whole.
Note: These functions are not currently
supported.
-
-
loading tables that use nontext
format and custom serializers and deserializers (
serdes)