Additional Configuration Needed to Use HCatalog File Formats

Overview of HCatalog File Types

HCatalog is a table management layer that presents a relational view of data in the HDFS to applications within the Hadoop ecosystem. With HCatalog, data structures that are registered in the Hive metastore, including SAS data, can be accessed through standard MapReduce code and Pig. HCatalog is part of Apache Hive.
The SAS Embedded Process for Hadoop uses HCatalog to process the following complex, non-delimited file formats: Avro, ORC, Parquet, and RCFile.

Prerequisites for HCatalog Support

If you plan to access complex, non-delimited file types such as Avro or Parquet, you must perform these additional prerequisites:
  • Hive and HCatalog must be installed on all nodes of the cluster.
  • HCatalog support depends on the version of Hive that is running on your Hadoop distribution. See the following table for more information.
    Note: For MapR distributions, Hive 0.13.0 build: 1501 or later must be installed for access to any HCatalog file type.
File Type
Required Hive Version
Avro
0.14
ORC
0.11
Parquet
0.13
RCFile
0.6

SAS Client Configuration

Note: If you used the SAS Deployment Manager to install the SAS Embedded Process, these configuration tasks are not necessary. It was completed using the SAS Deployment Manager.
The following additional configuration tasks must be performed:
  • The hive-site.xml configuration file must be in the SAS_HADOOP_CONFIG_PATH.
  • The following Hive or HCatalog JAR files must be in the SAS_HADOOP_JAR_PATH.
    hive-hcatalog-core-*.jar
    hive-webhcat-java-client-*.jar
    jdo-api*.jar
  • If you are using MapR, the following Hive or HCatalog JAR files must be in the SAS_HADOOP_JAR_PATH.
    hive-hcatalog-hbase-storage-handler-0.13.0-mapr-1408.jar
    hive-hcatalog-server-extensions-0.13.0-mapr-1408.jar
    hive-hcatalog-pig-adapter-0.13.0-mapr-1408.jar
    datanucleus-api-jdo-3.2.6.jar
    datanucleus-core-3.2.10.jar
    datanucleus-rdbms-3.2.9.jar
  • To access Avro file types, the avro-1.7.4.jar file must be added to the SAS_HADOOP_JAR_PATH environment variable.
  • To access Parquet file types with Cloudera 5.1, the parquet-hadoop-bundle.jar file must be added to the directory defined in the SAS_HADOOP_JAR_PATH environment variable.
  • If your distribution is running Hive 0.12, the jersey-client-1.9.jar must be added to the SAS_HADOOP_JAR_PATH environment variable.
For more information about the SAS_HADOOP_JAR_PATH and SAS_HADOOP_CONFIG_PATH environment variables, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

SAS Server-Side Configuration

If your distribution is running MapReduce 2 and YARN, the SAS Embedded Process installation automatically sets the HCatalog CLASSPATH in the ep-config.xml file. Otherwise, you must manually include the HCatalog JAR files in either the MapReduce 2 library or the Hadoop CLASSPATH. For Hadoop distributions that run with MapReduce 1, you must also manually add the HCatalog CLASSPATH to the MapReduce CLASSPATH.
Here is an example for a Cloudera distribution.
<property>
   <name>mapreduce.application.classpath</name>
   <value>/EPInstallDir/SASEPHome/jars/sas.hadoop.ep.apache205.jar,/EPInstallDir
   /SASEPHome/jars/sas.hadoop.ep.apache205.nls.jar,/opt/cloudera/parcels/
   CDH-5.2.0-1.cdh5.2.0.p0.36/bin/../lib/hive/lib/*,
   /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive-hcatalog/libexec/
   ../share/hcatalog/*,/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/
   lib/hive-hcatalog/libexec/../share/hcatalog/storage-handlers/hbase/lib/*,
   $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH</value>
</property>
Here is an example for a Hortonworks distribution.
<property>
   <name>mapreduce.application.classpath</name>
   <value>/EPInstallDir/SASEPHome/jars/sas.hadoop.ep.apache205.jar,/SASEPHome/
   jars/sas.hadoop.ep.apache205.nls.jar,/usr/lib/hive-hcatalog/libexec/
   ../share/hcatalog/*,/usr/lib/hive-hcatalog/libexec/../share/hcatalog/
   storage-handlers/hbase/lib/*,/usr/lib/hive/lib/*,$HADOOP_MAPRED_HOME/
   share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/
   lib/*</value>
</property>

Additional Configuration for Cloudera 5.4 or IBM BigInsights 4.0

If you are using Cloudera 5.4 or IBM Big Insights 4.0 with HCatalog sources, you must add the HADOOP_HOME environment variable in the Windows environment. An example of the value for this option is HADOOP_HOME=c:\hadoop.
The directory should contain a subdirectory named bin which must contain the winutils.exe for your distribution. Please contact your distribution vendor for a copy of the winutils.exe file.