SAS In-Database Code Accelerator for Hadoop

Overview

When you use the SAS In-Database Code Accelerator for Hadoop, the data and thread programs run in the MapReduce framework, either MapReduce 1 or YARN/MapReduce 2.
Note: The SAS In-Database Code Accelerator for Hadoop supports only Cloudera 5.2 and Hortonworks 2.1 or later.

Supported File Types

The SAS In-Database Code Accelerator for Hadoop supports these file types:
  • Hive: Avro*
  • Hive: delimited
  • Hive: ORC*
  • Hive: Parquet*
  • Hive: RCFile*
  • Hive: sequence
  • HDMD: binary
  • HDMD: delimited
  • HDMD: sequence
  • HDMD: XML
*In the February 2015 release, the SAS In-Database Code Accelerator for Hadoop supports these file types.
Tip
Partitioned Avro or Parquet data is not supported as input to the SAS In-Database Code Accelerator for Hadoop.
Tip
The availability of these file types depends on the version of Hive that you use.
The following file types are not supported:
  • SPD file format
  • SASHDAT

Automatic File Compression with SAS Hadoop

By default, the SAS In-Database Code Accelerator for Hadoop automatically compresses certain output files.
The following default file compressions apply to Hive files unless the user has explicitly configured another compression algorithm:
  • Delimited files are not automatically compressed.
  • ORC files compress themselves using ZLIB.
  • Avro, Parquet, and Sequence files are automatically compressed using Snappy.
HDMD files are never automatically compressed.

Using HCatalog within the SAS Environment

HCatalog is a table management layer that presents a relational view of data in the HDFS to applications within the Hadoop ecosystem. With HCatalog, data structures that are registered in the Hive metastore, including SAS data, can be accessed through standard MapReduce code and Pig. HCatalog is part of Apache Hive.
In the February 2015 release, the SAS In-Database Code Accelerator for Hadoop uses HCatalog to process complex, non-delimited files.
Summary of HCatalog File I/O
File Type
Input
Output
delimited
HDFS direct-read
HCat if partitioned, skewed, or escaped
HDFS direct-read
HCat if partitioned, skewed, or escaped
RCFile
HCat
HCat
ORC
HCat
HCat
Parquet
HCat1
CREATE TABLE AS SELECT2
sequence
HDMD
HCat if partitioned or skewed
HCat
Avro
HCat1
CREATE TABLE AS SELECT3
1Partitioned input is not supported.
2Unable to write output directly to Parquet files due to these issues: https://issues.apache.org/jira/browse/HIVE-8838.
3Unable to write output directly to Avro files due to these issues: https://issues.apache.org/jira/browse/HIVE-8687.
Consider these requirements when using HCatalog:
  • Data that you want to access with HCatalog must first be registered in the Hive metastore.
  • The recommended Hive version for the SAS In-Database Code Accelerator for Hadoop is 0.13.0.
  • Avro is not a native file type. Additional JAR files are required and must be defined in the SAS_HADOOP_JAR_PATH environment variable.
  • Support for HCatalog varies by vendor. For more information, see the documentation for your Hadoop vendor.

Additional Prerequisites When Accessing Files That Are Processed Using HCatalog

If you plan to access complex, non-delimited file types such as Avro or Parquet, there are additional prerequisites:
  • To access Avro file types, the avro-1.7.4.jar file must be added to the SAS_HADOOP_JAR_PATH environment variable. To access Parquet file types, the parquet-hadoop-bundle.jar file must be added to the SAS_HADOOP_JAR_PATH environment variable.
  • If you have performed a default CLASSPATH installation, the HCatalog JAR files that are required to access the Hive metastore within a MapReduce job are automatically included in the Hadoop CLASSPATH. Otherwise, you must manually include the HCatalog JAR files in either the MapReduce library or the Hadoop CLASSPATH.

BY-Group Processing with Hadoop

When there is no BY statement in the thread program, the number of reducers is set to 0, and the program is run as a map-only task. When there is a BY statement in the thread program and the PROC DS2 statement uses the BYPARTITION=YES option, a MapReduce task runs, where the map task partitions the data, and the reducer task runs the DS2 thread program.
Note: The SAS In-Database Code Accelerator for Hadoop might not produce sorted BY groups when re-partitioning is involved.

Using the DBCREATE_TABLE_OPTS Table Option

The DBCREATE_TABLE_OPTS table option is used to provide a free form string in the DATA statement. For the SAS In-Database Code Accelerator for Hadoop, you can use the DBCREATE_TABLE_OPTS table option to specify the output SerDe, the output delimiter of the Hive table, the output escaped by, and any other CREATE TABLE syntax allowed by Hive.
For more information, see DBCREATE_TABLE_OPTS= Table Option in SAS DS2 Language Reference.