Enable Support for Impala and Spark :: SAS(R) Data Loader 2.4 for Hadoop: User's Guide

Introduction

If your Hadoop cluster uses the Cloudera Impala SQL environment or the Apache Spark runtime target or both, then your SAS Data Loader directives can benefit from enhanced performance. Cloudera Impala SQL and Apache Spark enhance performance using distributed processes and an increased level of in-memory processing.

The following directives support Cloudera Impala SQL:

Query or Join Data
Sort and De-Duplicate Data
Run a Hadoop SQL Program

The following directives support Apache Spark:

Cleanse Data
Transform Data
Cluster-Survive Data (requires Spark)

Note: The only directive that requires Impala or Spark is Cluster-Survive Data, which requires Spark.

Support for Impala and Spark is seen primarily in the code that is generated by the supporting directives. The directives change only in their support of user-written expressions.

User-written expressions can be used to filter source rows from the target or to calculate values for new target columns. When you enable Impala and Spark, you change the functions that can appear in your user-written expressions. In Impala, Impala SQL functions are supported rather than HiveQL functions. In Spark, DataFlux EEL functions are supported rather than SAS DS2 functions. (EEL stands for the Expression Engine Language, and DS2 stands for DATA Step 2.) All supported functions are documented in the Advanced Editor and in syntax reference documents.

When Impala and Spark are enabled, you retain the ability to write and execute new and existing directives in Hive. Continued Hive support is provided because Impala and Spark run in coordination with Hive. Any existing directives that use Hive will continue to run as they have in the past.

For information about the supported versions of Impala and Spark, see SAS 9.4 Supported Hadoop Distributions, at https://support.sas.com/resources/thirdpartysupport/v94/hadoop/hadoop-distributions.html.

Prerequisites

Meet the following prerequisites before you enable Impala or Spark:

Impala and Spark must be fully operational on your Hadoop cluster.
The Spark features in SAS Data Loader require the installation of the SAS Data Management Accelerator for Spark on your Hadoop cluster. Hadoop administrators can refer to the “SAS Data Management Accelerator for Spark” chapter in the SAS 9.4 In-Database Products: Administrator’s Guide.
The Spark features in SAS Data Loader require HCatalog to be enabled on the Hadoop cluster in order to read data from Hive. Hadoop administrators can refer to the topic “Additional Configuration Needed to Use HCatalog File Format” in the SAS 9.4 In-Database Products: Administrator’s Guide.
For Impala, Cloudera recommends that you install the Cloudera Impala JDBC Driver on your client host. See In an Environment that Does Not Use Kerberos, Install and Use a Cloudera Impala JDBC Driver.

Enable Impala as the Default SQL Environment

Follow these steps to enable the Impala SQL environment for new instances of the directives that support Impala.

Confirm with your administrator that Impala has been installed and configured on the nodes of your Hadoop cluster.
Click More and select Configuration.
In the Hadoop Configuration panel of the Configuration window, confirm that the Host field under Impala server contains a server name. If this field is blank, enter the fully qualified network name of the Impala host. Contact your Hadoop Administrator as needed.
In the Port field under Impala server, confirm that a value is specified by default. If this field is blank, obtain the port number for the Impala server from your Hadoop Administrator and enter that number in the field.
In the SQL environment field, select Impala.
To confirm your entries, click Test Connection.
Click OK to activate Impala as the default SQL environment.

To restore Hive as the default SQL environment, select Hive in the preceding Step 4.

Enable Spark as the Default Runtime Target

Follow these steps to enable Spark as the default runtime target. The default runtime target is applied to all new instances of the directives that support Spark.

Note: The default Hive runtime target does not apply to the directive Cluster-Survive Data, which requires Spark.

Confirm with your administrator that Apache Spark has been installed and configured on the nodes of your Hadoop cluster.
Click More and select Configuration.
Click Preferred runtime target and select Spark.
Click OK.

To restore MapReduce as the default runtime target, click MapReduce in the preceding Step 3.

Override the Impala or Spark Default

In an individual directive, you can override the default setting for the SQL environment or the runtime target. Use the Settings menu at the top of the directive to specify an override, or to return to the default setting that is specified in the Configuration window.

About Saved Directives and Impala or Spark

Saved directives that were created prior to SAS Data Loader 2.4 for Hadoop continue to run in Hive after you enable Impala and Spark. To run these directives in Impala or Spark, you need to create new directives.

Saved directives that were created in SAS Data Loader 2.4 for Hadoop or later for the HiveQL environment, can be upgraded to use Impala or Spark. To upgrade, follow these steps:

Open the saved directive.
Click Settings and select Impala or Spark.
Replace any user-written expressions in the Filter and Manage Columns tasks. Replace the existing Hive functions with Impala SQL or the existing SAS DS2 functions with DataFlux EEL functions, as provided in the Advanced Editor.
Save and close the directive.

The next time the saved directive is run, new code will be generated for the selected environment.

Usage Notes for Spark

Introduction

As you create and run the directives that support Spark, keep the following subjects in mind.

String Truncation in Spark-Enabled Directives

In directives where Spark is not the preferred runtime target, character columns are truncated based on the value of the field Maximum length for SAS columns. This field is available in the General Preferences panel of the Configuration window. The default value is 1024 characters. Source columns with string data types such as VAR and VARCHAR are truncated in SAS when their length exceeds the specified limit. The truncation occurs when SAS reads source columns into memory.

In Spark-enabled directives, the truncation of string columns differs between source columns that return a length, and source columns that do not return a length. Hive releases prior to 0.14.0 do not return a length for VAR and VARCHAR columns.

When Spark is enabled, and when columns do return a string length, strings are truncated according to the value of the configuration option EXPRESS_MAX_STRING_LENGTH. The value of the Maximum length for SAS columns field is ignored.

When Spark is enabled, and when string columns do not return a length, strings are truncated differently. The maximum string length is determined by the lesser value of the configuration option EXPRESS_MAX_STRING_LENGTH or the field Maximum length for SAS columns.

The default value of the EXPRESS_MAX_STRING_LENGTH configuration option is 5 MB. To specify a different value, ask your Hadoop administrator to update the app.cfg file on each node that runs the SAS Data Management Accelerator for Spark. In those files, add or update the label/value pair for EXPRESS_MAX_STRING_LENGTH.

Note: The value of EXPRESS_MAX_STRING_LENGTH also specifies the maximum amount of memory that is allocated for the underlying expression. For this reason, Hadoop administrators should be judicious when changing the default value.

VAR and VARCHAR columns that do not return a length are converted to the STRING type in the target so that they can receive a default length. To retain the original column types, use the Manage Columns task in the directive. In Manage Columns, the type of the target column needs to be VAR or VARCHAR and a length specification is required.

Spark Date Error

When Spark is the runtime environment, and if you run a Hive release earlier than 1.2, then dates starting with January 1, 1970 and older might be incorrect. To learn more, see https://issues.apache.org/jira/browse/HIVE-10178.

Hive Views Cannot Be Source Tables

When Spark is the preferred runtime environment, Hive views cannot be used as source tables.

Parquet Cannot Be Specified as a Format for Target Tables

When Spark is the preferred runtime environment, the Parquet table format cannot be selected for target tables. Parquet source tables are supported.

Spark Bin Directory Required in the Hadoop PATH

Spark support requires the addition of the Spark bin directory to the PATH environment variable in each Hadoop node. If your Spark-enabled directives fail early, contact your Hadoop administrator to research this issue.

Most Hadoop distributions include the Spark bin directory in /usr/bin, which resolves the issue. Your configuration might differ.

In the MapR distribution of Hadoop, the Spark bin directory is not included in the PATH variable by default. To resolve the issue in MapR, your Hadoop administrator can add a line to yarn-env.sh on each node manager node. The following example illustrates a typical addition to yarn-env.sh:

/* In MapR 5.0, using Spark 1.3.1 */
export PATH=$PATH:/opt/mapr/spark/spark-1.3.1/bin

Usage Notes for Impala

Introduction

To create and run the directives that support Impala, install a JDBC driver and avoid metadata synchronizaton errors between Impala and Hive.

To Use Impala in a Kerberos Environment, Use the Hive JDBC Driver

If your site uses Cloudera Impala and Kerberos authentication, then use the Hive JDBC driver that is currently in use on your Hadoop cluster. The Hive JDBC driver is normally installed during the deployment of the vApp, as described in the SAS Data Loader for Hadoop: vApp Deployment Guide.

Do not install any of the available Cloudera Impala JDBC drivers in vApp-path\SASWorkspace\JDBCdrivers.

In an Environment that Does Not Use Kerberos, Install and Use a Cloudera Impala JDBC Driver

To use Impala in environments that do not use Kerberos, Cloudera recommends that you install a Cloudera Impala JDBC driver.

By default, if any of the following drivers are found in vApp-path\SASWorkspace\JDBCdrivers, then SAS Data Loader uses the latest:

com.cloudera.impala.jdbc3.Driver
com.cloudera.impala.jdbc4.Driver
com.cloudera.impala.jdbc41.Driver

To override the default behavior and specify a Cloudera Impala JDBC driver, follow these steps:

In a text editor, open or create the file data-loader-site.xml in the folder vApp-path\SASWorkspace\conf.

Enter the following text:

<?xml version="1.0" encoding="UTF-8"?>

<configuration>

    <property>

        <name>cloudera.impala.driver.class</name>

        <value>com.cloudera.impala.jdbc4.Driver</value>

    </property>

</configuration>

As needed, replace the contents of the value tag above with the name of your JDBC driver class.
Save the XML file and restart the vApp to enable the use of your required JDBC driver.

Information about the Cloudera Impala JDBC drivers is provided at http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_jdbc.html. The download page for the JDBC drivers is http://www.cloudera.com/content/cloudera/en/downloads/connectors/impala/jdbc/impala-jdbc-v2-5-24.html.

Avoid Metadata Errors between Impala and Hive

To avoid metadata errors, avoid using a table in Hive and then using that same table soon thereafter in Impala (or in Impala and then in Hive). One place that could generate synchronization errors is a serial chain directive. One directive can use a table as a target and the next directive can use the same table as a source.