Reading and Writing HCatalog File Types

Overview of HCatalog File Types

HCatalog is a table management layer that presents a relational view of data in the HDFS to applications within the Hadoop ecosystem. With HCatalog, data structures that are registered in the Hive metastore, including SAS data, can be accessed through standard MapReduce code and Pig. HCatalog is part of Apache Hive.
In the third maintenance release of SAS 9.4, the SAS In-Database Scoring Accelerator for Hadoop uses HCatalog to read the native Hive file types Avro, ORC, Parquet, and RCFile.
By default, an output file is delimited. You can use the %INDHD_RUN_MODEL macro’s OUTRECORDFORMAT argument to write a binary file.
Consider these requirements when using HCatalog:
  • Data that you want to access with HCatalog must first be registered in the Hive metastore.
  • The recommended Hive version for the SAS In-Database Scoring Accelerator for Hadoop is 0.13.0.
  • Support for HCatalog varies by vendor. For more information, see the documentation for your Hadoop vendor.
  • There are additional configuration steps that are needed when processing HCatalog files.

Additional Configuration Steps When Accessing Files That Are Processed Using HCatalog

If you plan to access complex, non-delimited file types such as Avro or Parquet, through HCatalog, there are additional prerequisites:
  • To access Avro file types, the avro-1.7.4.jar file must be added to the SAS_HADOOP_JAR_PATH environment variable. To access Parquet file types, the parquet-hadoop-bundle.jar file must be added to the SAS_HADOOP_JAR_PATH environment variable. In addition, you need to add the following HCatalog JAR files to the SAS_HADOOP_JAR_PATH environment variable:
    webhcat-java-client*.jar
    hbase-storage-handler*.jar
    hcatalog-server-extensions*.jar
    hcatalog-core*.jar
    hcatalog-pig-adapter*.jar
  • On the Hadoop cluster, the SAS Embedded Process for Hadoop install script automatically adds HCatalog JAR files to its configuration file. The HCatalog JAR files are added to the Embedded Process Map Reduce job class path during job submission.
    For information about installing and configuring the SAS Embedded Process for Hadoop, see SAS In-Database Products: Administrator’s Guide.
  • You must include the HCatalog JAR files in your SAS_HADOOP_JAR_PATH environment variable.
    For more information about the SAS_HADOOP_JAR_PATH environment variable, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.
  • The hive-site.xml file must be in your SAS_HADOOP_CONFIG_PATH.
    For more information about the SAS_HADOOP_CONFIG_PATH environment variable, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.
  • If you are using IBM BigInsights v3.0 or Pivotal HD 2.x, do not use quotation marks around table names. The version of Hive that is used by IBM BigInsights v3.0 and Pivotal HD 2.x does not support quoted identifiers.