SAS In-Database Code Accelerator for Hadoop

Overview

When you use the SAS In-Database Code Accelerator for Hadoop, the data and thread programs run in the MapReduce framework, either MapReduce 1 or YARN or MapReduce 2.
Note: The SAS In-Database Code Accelerator for Hadoop is available by licensing SAS Data Loader. SAS In-Database Code Accelerator functionality is available for use with SAS Data Loader directives. If you install the SAS In-Database Technologies for Hadoop software, you can also submit DS2 code directly from a SAS session without using SAS Data Loader for Hadoop directives. For more information about installing the SAS In-Database Technologies for Hadoop software, see SAS In-Database Products: Administrator’s Guide.
Note: The SAS In-Database Code Accelerator for Hadoop supports only specific versions of the Hadoop distributions. For more information, see SAS 9.4 Supported Hadoop Distributions.

Supported File Types

The SAS In-Database Code Accelerator for Hadoop supports these file types:
  • Hive: Avro
  • Hive: delimited
  • Hive: ORC
  • Hive: Parquet
  • Hive: RCFile
  • Hive: sequence
  • Hive: SPD Engine*
  • HDMD: binary
  • HDMD: delimited
  • HDMD: sequence
  • HDMD: XML
  • HDFS: SPD Engine
* In the fourth maintenance release of SAS 9.4, Hive tables that use the SPD Engine SerDe are supported as input to the SAS In-Database Code Accelerator for Hadoop. For more information, see SAS SPD Engine: Storing Data in the Hadoop Distributed File System.
Note: Only SPD Engine data sets whose architectures match the architecture of the Hadoop cluster (that is, 64-bit Solaris or Linux) run inside the database. Otherwise, the data and thread program run on the client machine.
Tip
Partitioned Avro or Parquet data is supported as input to the SAS In-Database Code Accelerator for Hadoop when running with Hive version 0.14 or later.
Tip
The availability of these file types depends on the version of Hive that you use.
SASHDAT file types are not supported.

Automatic File Compression with SAS Hadoop

By default, the SAS In-Database Code Accelerator for Hadoop automatically compresses certain output files.
The following default file compressions apply to Hive files unless the user has explicitly configured another compression algorithm:
  • Delimited files are not automatically compressed.
  • ORC files compress themselves using ZLIB.
  • Avro, Parquet, and Sequence files are automatically compressed using Snappy.
HDMD files are never automatically compressed.

Using HCatalog within the SAS Environment

HCatalog is a table management layer that presents a relational view of data in the HDFS to applications within the Hadoop ecosystem. With HCatalog, data structures that are registered in the Hive metastore, including SAS data, can be accessed through standard MapReduce code and Pig. HCatalog is part of Apache Hive.
The SAS In-Database Code Accelerator for Hadoop uses HCatalog to process complex, non-delimited files.
Summary of HCatalog File I/O
File Type
Input
Output
delimited
HDFS direct-read
HCatalog if partitioned, skewed, or escaped
HDFS direct-read
HCatalog if partitioned, skewed, or escaped
RCFile
HCatalog
HCatalog
ORC
HCatalog
HCatalog
Parquet
HCatalog1
CREATE TABLE AS SELECT2
sequence
HDMD
HCatalog if partitioned or skewed
HCatalog
Avro
HCatalog1
CREATE TABLE AS SELECT3
1In the fourth maintenance release of SAS 9.4, partitioned Avro or Parquet data is supported as input to the SAS In-Database Code Accelerator for Hadoop.
2Unable to write output directly to Parquet files due to these issues: https://issues.apache.org/jira/browse/HIVE-8838.
3Unable to write output directly to Avro files due to these issues: https://issues.apache.org/jira/browse/HIVE-8687.
Consider these requirements when using HCatalog:
  • Data that you want to access with HCatalog must first be registered in the Hive metastore.
  • The recommended Hive version for the SAS In-Database Code Accelerator for Hadoop is 0.13 or later.
  • Avro is not a native file type. Additional JAR files are required and must be defined in the SAS_HADOOP_JAR_PATH environment variable.
  • Support for HCatalog varies by vendor. For more information, see the documentation for your Hadoop vendor.

Additional Prerequisites When Accessing Files That Are Processed Using HCatalog

If you plan to access complex, non-delimited file types such as Avro or Parquet, through HCatalog, there are additional prerequisites:
  • To access Avro file types, the avro-1.7.4.jar file must be added to the SAS_HADOOP_JAR_PATH environment variable. To access Parquet file types, the parquet-hadoop-bundle.jar file must be added to the SAS_HADOOP_JAR_PATH environment variable. In addition, you need to add the following HCatalog JAR files to the SAS_HADOOP_JAR_PATH environment variable:
    webhcat-java-client*.jar
    hbase-storage-handler*.jar
    hcatalog-server-extensions*.jar
    hcatalog-core*.jar
    hcatalog-pig-adapter*.jar
  • On the Hadoop cluster, the SAS Embedded Process for Hadoop install script automatically adds HCatalog JAR files to its configuration file. The HCatalog JAR files are added to the Embedded Process Map Reduce job class path during job submission.
    For information about installing and configuring the SAS Embedded Process for Hadoop, see SAS In-Database Products: Administrator’s Guide.
  • You must include the HCatalog JAR files in your SAS_HADOOP_JAR_PATH environment variable.
    For more information about the SAS_HADOOP_JAR_PATH environment variable, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.
  • The hive-site.xml file must be in your SAS_HADOOP_CONFIG_PATH.
    For more information about the SAS_HADOOP_CONFIG_PATH environment variable, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.
  • If you are using IBM BigInsights v3.0 or Pivotal HD 2.x, do not use quotation marks around table names. The version of Hive that is used by IBM BigInsights v3.0 and Pivotal HD 2.x does not support quoted identifiers.
  • To access Hive SPD Engine tables from the SAS In-Database Code Accelerator, the path to the SerDe JAR files must be included in the SAS_HADOOP_JAR_PATH.
    For more information about the SAS_HADOOP_JAR_PATH environment variable, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

BY-Group Processing with Hadoop

When there is no BY statement in the thread program, the number of reducers is set to 0, and the program is run as a map-only task. When there is a BY statement in the thread program and the PROC DS2 statement uses the BYPARTITION=YES option, a MapReduce task runs, where the map task partitions the data, and the reducer task runs the DS2 thread program.
Note: The SAS In-Database Code Accelerator for Hadoop might not produce sorted BY groups when re-partitioning is involved.

Using the MapReduce Job Logs to View DS2 Error Messages

The SAS In-Database Code Accelerator provides an HTTP job location when a job fails.
When the MSGLEVEL=I option is set and a job fails, a link to the HTTP location of the MapReduce logs is also produced. Here is an example.
ERROR: Job job_1424277669708_2919 has failed. Please, see job log for
       details. Job tracking URL :
       http://name.unx.company.com:8088/proxy/application_1424277669708_2919/
The HTTP link is to a site that contains the job summary. On the job summary page, you can see the number of failed and successful tasks. If you click on the failed tasks, you see a list of task attempts. A log is assigned to each attempt. Once in the log page, you are able to see the error messages.

Using the DBCREATE_TABLE_OPTS Table Option

The DBCREATE_TABLE_OPTS table option is used to provide a free form string in the DATA statement. For the SAS In-Database Code Accelerator for Hadoop, you can use the DBCREATE_TABLE_OPTS table option to specify the output SerDe, the output delimiter of the Hive table, the output escaped by, and any other CREATE TABLE syntax allowed by Hive.
For more information, see DBCREATE_TABLE_OPTS= Table Option in SAS DS2 Language Reference.