Configuring Components on the Cluster

Overview

After deploying the in-database deployment package, you must configure several components and settings on the Hadoop cluster in order for SAS Data Loader for Hadoop to operate correctly. These components and settings are explained in the following topics:

SQOOP and OOZIE

Your Hadoop cluster must be configured to use OOZIE scripts.
Note: Ensure that Oozie 4.0 or later is installed. You must add the following as entries in the list for the oozie.service.SchemaService.wf.ext.schemas property:
  • sqoop-action-0.4.xsd
  • hive-action-0.3.xsd
  • oozie-workflow-0.4.xsd
  • shell-action-0.3.xsd (for Spark submission)

JDBC Drivers

SAS Data Loader for Hadoop leverages the SQOOP and OOZIE components installed with the Hadoop cluster to move data to and from a DBMS. The SAS Data Loader for Hadoop vApp client also accesses databases directly using JDBC for the purpose of selecting either source or target schemas and tables to move.
You must install on the Hadoop cluster the JDBC driver or drivers required by the DBMSs that users need to access.
SAS Data Loader for Hadoop supports the Teradata and Oracle DBMSs directly. You can support additional databases by selecting Other in the Type option on the SAS Data Loader for Hadoop Database Configuration dialog box. For more information about the dialog box, see the SAS Data Loader for Hadoop: User’s Guide.
For Teradata and Oracle, SAS recommends that you download the following JDBC files from the vendor site:
JDBC Files
Database
Required Files
Oracle
ojdbc6.jar
Teradata
tdgssconfig.jar and terajdbc4.jar
Note: You must also download the Teradata connector JAR file that is matched to your cluster distribution, if available.
The JDBC and connector JAR files must be located in the OOZIE shared libs directory in HDFS, not in /var/lib/sqoop. The correct path is available from the oozie.service.WorkflowAppService.system.libpath property.
The default directories in the Hadoop file system are as follows:
  • Hortonworks, Pivotal HD, IBM BigInsights Hadoop clusters: /user/oozie/share/lib/lib_version/sqoop
  • Cloudera Hadoop clusters: /user/oozie/share/lib/sharelibversion/sqoop
  • MapR Hadoop clusters: /oozie/share/lib/sqoop
You must have, at a minimum, -rw-r--r-- permissions on the JDBC drivers.
After JDBC drivers have been installed and configured along with SQOOP and OOZIE, you must refresh sharelib, as follows:
oozie admin -oozie oozie_url -sharelibupdate
SAS Data Loader for Hadoop users must also have the same version of the JDBC drivers on their client machines in the SASWorkspace\JDBCDrivers directory. Provide a copy of the JDBC drivers to SAS Data Loader for Hadoop users.

Spark Bin Directory Required in the Hadoop PATH

SAS Data Loader for Hadoop supports the Apache Spark cluster computing framework. Spark support requires the addition of the Spark bin directory to the PATH environment variable on each Hadoop node.
Most Hadoop distributions include the Spark bin directory in /usr/bin. In some distributions, such as MapR, the Spark bin directory is not included by default in the PATH variable. You must add a line to yarn-env.sh on each nodemanager node. The following example illustrates a typical addition to yarn-env.sh:
/* In MapR 5.0, using Spark 1.3.1 */
  export PATH=$PATH:/opt/mapr/spark/spark-1.3.1/bin
You can use the command echo $PATH to verify that the path has been added.

User IDs

Kerberos

If your installation uses Kerberos authentication, see Configuring Kerberos.

UNIX User Accounts and Home Directories

You must create one or more user IDs and enable certain permissions for the SAS Data Loader for Hadoop vApp user.
Note: MapR users must create a special user ID file. For more information, see For MapR Users.
To configure user IDs, follow these steps:
  1. Choose one of the following options for user IDs:
    • Create one user ID that any vApp user can use for login.
      Note: Do not use the super user, which is typically hdfs.
    • Create an individual user ID for each vApp user.
    • Map the user ID to a user principal for clusters using Kerberos.
  2. Create UNIX user IDs on all nodes of the cluster and assign them to a group.
  3. Create a user home directory and Hadoop staging directory in HDFS. The user home directory is /user/myuser. The Hadoop staging directory is controlled by the setting yarn.app.mapreduce.am.staging-dir in mapred-site.xml and defaults to /user/myuser.
  4. Change the permissions and owner of /user/myuser to match the UNIX user.
    Note: The user ID must have at least the following permissions:
    • Read,Write, and Delete permission for files in the HDFS directory (used for Oozie jobs)
    • Read, Write, and Delete permission for tables in Hive

Configuration Values

You must provide the vApp user with values for fields in the SAS Data Loader for Hadoop Configuration dialog box. For more information about the SAS Data Loader for Hadoop Configuration dialog box, see the SAS Data Loader for Hadoop: vApp Deployment Guide. The fields are as follows:
Host
specifies the full host name of the machine on the cluster running the HiveServer2 server.
Port
specifies the number of the HiveServer2 server port on your Hadoop cluster. For most distributions, the default is 10000.
User ID
specifies the Hadoop user account that you have created on your Hadoop cluster for each user or for all of the vApp users.
Note:
Password
if your enterprise uses LDAP, you must supply the vApp user with the LDAP password. This field must be blank otherwise.
Oozie URL
specifies the Oozie base URL. The URL is the property oozie.base.url in the file oozie-site.xml. The URL is similar to the following example: http://host_name:port_number/oozie/.
Although the Oozie web UI at this URL does not have to be enabled for Data Loader to function, it is useful for monitoring and debugging Oozie jobs. Confirm that the Oozie Web UI is enabled before providing it to the vApp user. Consult your cluster documentation for more information.