SAS Federation Server Driver for Apache Hive

About the SAS Federation Server Driver for Apache Hive

The SAS Federation Server Driver for Apache Hive (Driver for Hive) allows SAS Federation Server to query and manage large data sets that reside in distributed storage. As with other drivers, to realize the full benefits of the SAS Federation Server such as security and data federation, you should use the FedSQL driver (dialect) with the driver. Using FedSQL, the Driver for Hive supports Read, and Insert using Bulk Operations, but the Driver for Hive does not support Write operations such as update, delete and index creation. For additional information about FedSQL, see the SAS FedSQL Language Reference.
The driver also supports a native dialect (HiveQL). Check your Hive version to determine functionality. For example, if using a version of Hive that supports the ORC file format, you should be able to Insert, Delete, and Update.
You can use Kerberos with the Driver for Hive by specifying the authentication mode option in a connection string. For more information, see AUTH_MODE in the connection options below.

Prerequisites

Before configuring SAS Federation Server drivers, you must set environment variables that point to the client libraries required for your data source.
Hadoop JAR files must be installed and the SAS_HADOOP_JAR_PATH environment variable defined before using the Driver for Hive. The variable points to the location of the Hadoop JAR files, and is defined using the SetEnv option set in the dfs_serv.xml configuration file. Here is an example:
<OptionSet name="SetEnv">
       <Option name="SAS_HADOOP_JAR_PATH">\SAS\Config\Lev1\FederationServer
        \lib\Hadoop</Option>
 </OptionSet>
If the JAR file location changes, you must update the SAS Federation Server configuration file with the new location.
Note: The SAS Deployment Wizard installs the necessary Hadoop JAR files and sets the SAS_HADOOP_JAR_PATH environment variable when the Driver for Hive is included with a SAS Federation Server plan.

Data Service Connection Options for Hive

Overview

To access data that is hosted on SAS Federation Server, a client must submit a DSN that defines how to connect to the data. DSNs are associated with a data service, which provides the foundation for the connection, such as user access control. See Working with Data Services for additional information.

Connection Options

You can specify one or more connection options when defining a DSN or data service. The Driver for Hive supports the following connection options.
Option
Description
DRIVER
DRIVER=HIVE
Required. Identifies the HIVE data source to which you want to connect.
SERVER
SERVER=‘server-name’
Specifies the host name of the Hive server. If the server name contains spaces or nonalphanumeric characters, enclose it in quotation marks.
PORT
PORT=port_number
Specifies the port number that is used to connect to the specified Hive Server. The default is 10000.
SUBPROTOCOL
SUBPROTOCOL=Hive|Hive2
Specifies whether you are connecting to a Hive service or a HiveServer2 (Hive2) service. The default is Hive2.
CATALOG
CATALOG=catalog-name
Specifies an arbitrary identifier for an SQL catalog, which groups logically related schemas.
SCHEMA
SCHEMA=hive-schema-name
Specifies a Hive schema name that is used to define a name other than ‘default’.
Alias: DATABASE, DB
AUTH_MODE
AUTH_MODE=default | Kerberos
Specifies the authentication mode for the connection. The options are default and Kerberos. If using Kerberos for the authentication mode, you must specify the Hive principal host name using the HIVE_PRINCIPAL connection option.
HIVE_PRINCIPAL
HIVE_PRINCIPAL=service-principal-hostname
Specifies the Hive principal string in an environment that uses Kerberos(for example, HIVE_PRINCIPAL=hive/_HOST@UNX.SAS.COM). Required with AUTH_MODE=Kerberos.
USER_PRINCIPAL
USER_PRINCIPAL=
Specifies that the HDFS path and JDBC path use JAAS to perform a doAs for the given Kerberos user principal. Alias: auth_mode=Kerberos;uid=
UID
UID=(user-name)
Specifies the user name with the necessary permissions to perform Read and Write operations. UID and PWD are not needed when connecting to a server that uses Kerberos authentication.
Alias: USER
PWD
PWD=’user-password’
Specifies a password that correlates with the user ID (UID) value. If the password contains spaces or nonalphanumeric characters, enclose it in quotation marks. UID and PWD are not needed when connecting to a server that uses Kerberos authentication.
Alias: PASSWORD
PROPERTIES (JDBC session configuration properties)
Use the PROPERTIES option to specify one or more JDBC connection properties to override the default JDBC connection properties. In a JDBC URL, custom properties are separated from the default properties by the question mark (?) character. Multiple properties are separated by the semicolon (;) character. Here is an example:
PROPERTIES="hive.default.fileformat=ORC;hive.exec.compress.intermediate=true"
Site-wide Hive properties are specified in the hive-site.xml file in the Hive configuration directory.
You can use the properties option to set the default file format for managed and unmanaged tables respectively. The following example specifies optimized row columnar (ORC) as the default file format when creating a table:
PROPERTIES=hive.default.fileformat.managed=orc
PROPERTIES=hive.default.fileformat=orc
Note: The ORC file format is available beginning with Hive version 0.11.
The following example sets the partition mode to nonstrict, which allows dynamic inserts against a partitioned table (for example, when a static partition has not been explicitly defined in the SQL statement).
PROPERTIES=hive.exec.dynamic.partition.mode=nonstrict
Note: These Hive–defined properties can be changed or removed by Hadoop vendors at any time.
HD_CONFIG
HD_CONFIG=path to hadoop configuration file
Specifies the name and path for the Hadoop cluster configuration file. This file contains entries for Hadoop system information, including file system properties such as fs.defaultFS. The configuration file can be a copy of the Hadoop core-site.xml file. However, if your Hadoop cluster is running with HDFS failover enabled, you must create a file that combines the contents of the Hadoop core-site.xml and hdfs-site.xml files.
Note: The name and path values for the Hadoop cluster configuration file are normally set in the SAS_HADOOP_CONFIG_PATH option in the dfs_serv.xml configuration file. This is the recommended configuration method. Use HD_CONFIG only if you want to override the settings in the server configuration file. See "SAS Federation Server Configuration Reference" for details.
HDFS_TEMPDIR
HDFS_TEMPDIR=’path’
Specifies the path to the HDFS directory that is used for read and write of temporary data. The default is HDFS_TEMPDIR=’/tmp’
DBMAX_TEXT
DBMAX_TEXT=32767
Specifies the length for a string data type. The maximum length is 2 gigabytes. The default is 32767.
LOGIN_TIMEOUT
LOGIN_TIMEOUT=number_of_seconds
Specifies a login time-out, in seconds, for non-responsive connections. A value of 0 indicates that there is no time-out and the connection will ‘wait forever’. The default value is 30 seconds.
DRIVER TRACE
DRIVER_TRACE=’API | SQL | ALL’;
Requests tracing information, which logs transaction records to an external file that can be used for debugging purposes. The SAS Federation Server driver writes a record of each command that is sent to the database to the trace log based on the specified tracing level, which determines the type of tracing information. The tracing levels are as follows:
  • ALL Activates all trace levels.
  • API Specifies that API method calls be sent to the trace log. This option is most useful if you are having a problem and need to send a trace log to Technical Support for troubleshooting.
  • DRIVER Specifies that driver-specific information be sent to the trace log.
  • SQL Specifies that SQL statements that are sent to the database management system (DBMS) be sent to the trace log. Tracing information is DBMS specific, but most SAS Federation Server drivers log SQL statements such as SELECT and COMMIT.
Default: Tracing is not activated.
Note: If you activate tracing, you must also specify the location of the trace log with DRIVER_TRACEFILE=. Note that DRIVER_TRACEFILE= is resolved against the TRACEFILEPATH set in ALTER SERVER. TRACEFILEPATH is relative to the server's content root location.
(Optional) You can control trace log formatting with DRIVER_TRACEOPTIONS=.
Interaction: You can specify one trace level, or you can concatenate more than one by including the | (OR) symbol. For example, driver_trace='api|sql' generates tracing information for API calls and SQL statements.
DRIVER TRACE FILE
DRIVER_TRACEFILE=’filename’
Used to specify the name of the text file for the trace log. Include the filename and extension in single or double quotation marks. For example: driver_tracefile='\mytrace.log'
Default: The default TRACEFILE location applies to a relative filename, and it is placed relative to TRACEFILEPATH.
Requirement: DRIVER_TRACEFILE is required when activating tracing using DRIVER_TRACE.
Interaction: (Optional) You can control trace log formatting with DRIVER_TRACEOPTIONS=.
DRIVER TRACE OPTIONS
DRIVER_TRACEOPTIONS=APPEND | THREADSTAMP | TIMESTAMP;
Specifies options in order to control formatting and other properties for the trace log:
  • APPEND Adds trace information to the end of an existing trace log. The contents of the file are not overwritten.
  • THREADSTAMP Prepends each line of the trace log with a thread identification.
  • TIMESTAMP Prepends each line of the trace log with a time stamp.
Default: The trace log is overwritten with no thread identification or time stamp.

ODBC Apache Hive Wire Protocol Driver Usage Notes

Configuring ODBC Options

SAS Federation Server provides a number of wire protocol ODBC drivers that communicate directly with a database server, without having to communicate through a client library. When you configure the ODBC drivers on Windows or UNIX, you have the opportunity to set certain options. SAS products run best when these options are selected. Some, but not all, are selected by default.
Windows
The options are located on the Advanced or Performance tabs in the ODBC Administrator.
UNIX
The options are available when configuring data sources using the dfdbconf tool. Values can also be set by editing the odbc.ini file in which their data sources are defined.
Note: The behavior of a DSN using a wire protocol driver with the catalog option selected, returns only the schemas that have associated tables or views. To list all existing schemas, create a DSN without the catalog option selected.
When configuring an ODBC DSN using the Apache Hive Wire Protocol driver, select the following options on the Advanced tab:
  • Remove Column Qualifiers
    Note: This option might be appended with (Microsoft Access Compatibility) in the ODBC Administrator.

Setting the Maximum Character String Size for Hive

Hive has a single data type for storing text, STRING, which is a variable-length character string with a maximum size of 2G. As a result, this can create very large character fields when processing data. Since Hive’s string type is comparable to VARCHAR in other data sources, you can set the ODBC attribute, Max Varchar Size to specify the maximum character string size. Set the Max Varchar Size value using Advanced Options in Windows ODBC Administrator, or in UNIX by editing odbc.ini in the specified path or $HOME directory.
You can also specify this option in a connection string using the CONOPTS container. Here is an example: DRIVER=ODBC;DB=hive;UID=dbitest;PWD=dbigrp1;SCHEMA=default;CONOPTS=(MaxVarcharSize=300);CATALOG=FOO;.
Last updated: March 6, 2018