Configuring Components on the Cluster

Overview

You must configure several components and settings on the Hadoop cluster in order for SAS Data Loader for Hadoop to operate correctly. These are explained in the following four topics:

SQOOP and OOZIE

Your Hadoop cluster must be configured to use SQOOP commands and OOZIE scripts.
Note: You must add sqoop-action-0.4.xsd as an entry in the list for the oozie.service.SchemaService.wf.ext.schemas property.

JDBC Driver

SAS Data Loader for Hadoop leverages the SQOOP and OOZIE components installed with Hadoop cluster to move data to and from a DBMS. The SAS Data Loader for Hadoop vApp client also accesses the databases directly using JDBC for the purpose of selecting either source or target schemas and tables to move.
You must install on the Hadoop cluster the JDBC driver that is required by the DBMS that users need to access. Follow the JDBC driver vendor installation instructions.
SAS Data Loader for Hadoop supports the Teradata and Oracle DBMSs directly. You can support additional databases selecting Other in the Type option on the SAS Data Loader for Hadoop Database Configuration dialog box. See the SAS Data Loader for Hadoop: User’s Guide for more information about the dialog box.
For Teradata and Oracle, SAS recommends that you download the following JDBC files from the vendor site:
JDBC Files
Database
Required Files
Oracle
ojdbc6.jar
Teradata
tdgssconfig.jar and terajdbc4.jar
Note: You must also download the Teradata connector JAR file that is matched to your cluster distribution.
The JDBC driver and Teradata clients are installed under the OOZIE shared lib directory in the Hadoop file system as follows:
  • Hortonworks Hadoop clusters: /user/oozie/share/lib/sqoop
  • Cloudera Hadoop clusters: /user/oozie/share/lib/sharelib<version>/sqoop
The JDBC and connector JAR files must be located in the OOZIE shared libs directory in HDFS, not in /var/lib/sqoop. The correct path is available from the oozie.service.WorkflowAppService.system.libpath property.
You must have, at a minimum, -rw-r--r-- permissions on the JDBC drivers.
After JDBC drivers have been installed and configured along with SQOOP and OOZIE, you must restart OOZIE.
SAS Data Loader for Hadoop users must also have the same version of the JDBC drivers on their client machines in the SASWorkspace\JDBCDrivers directory. Provide a copy of the JDBC drivers to SAS Data Loader for Hadoop users.

Hadoop Configuration Files

You must make the following configuration files from the Hadoop cluster available to be copied to the client machine of the SAS Data Loader for Hadoop vApp user:
core-site.xml
hdfs-site.xml
hive-site.xml
mapred-site.xml
yarn-site.xml
Note:
  • For a MapReduce 2 and YARN cluster, both the mapred-site.xml and yarn-site.xml files are required.
For the Copy to Hadoop and Copy from Hadoop directives to work correctly, you must set the following entries to False:
  • the dfs.permissions.enabled entry in the hdfs-site.xml file on the Hadoop cluster
  • the hive.resultset.use.unique.column.names entry in the hive-site.xml file on the target Hadoop cluster

User ID, Permissions, and Configuration Values

Your Hadoop cluster can use Kerberos authentication or a different type of authentication of users. If you are using Kerberos authentication, see Security.
For clusters that do not use Kerberos, you must create one or more user IDs and enable certain permissions for the SAS Data Loader for Hadoop vApp user.
Use the following procedure to establish user IDs:
  1. Choose one of the following options for user IDs:
    • create one Hadoop Configuration User ID for all vApp users
      Note: Do not use the Hadoop super user, which is typically hdfs.
    • create one Hadoop Configuration User ID for every vApp user
  2. Create UNIX users in /etc/passwd and /etc/groups.
  3. Create a map reduce staging hdfs directory defined in the mapreduce configuration. The default is /users/myuser
  4. Change the permissions and owner of hdfs /users/myuser to match the UNIX user.
    Note: The user ID must have at least the following permissions:
    • Read,Write, and Delete permission for files in the HDFS directory (used for Oozie jobs)
    • Read, Write, and Delete permission for tables in Hive
  5. If you are using Hive, create the Hive database and set appropriate permissions on the folder.
  6. The user= value in the SAS LIBNAME must match the hdfs user that you created on the Hadoop cluster, as in the following example:
    libname testlib hadoop 
    server = "hadoopdb" 
    user = user2 
    database = mydatabase
    ;
    
You must provide the vApp user with values for fields in the SAS Data Loader for Hadoop Configuration dialog box. See the SAS Data Loader for Hadoop: vApp Deployment Guide for more information about the SAS Data Loader for Hadoop Configuration dialog box. The fields are as follows:
User ID
the Hadoop Configuration User ID user account that you have created on your Hadoop cluster for each user or all of the vApp users.
Host
the full host name of the machine on the cluster running the Hive server.
Port
the number of the Hadoop port on the host that supports your cluster.
  • For Cloudera, the HiveServer2 port default is 10000.
  • For HortonWorks, the Hive server port default is 10000.
Oozie URL
the URL to the Oozie Web Console, which is an interface to the Oozie server. The URL is similar to the following example: http://host_name:port_number/oozie/.
Confirm that the Oozie Web UI is enabled before providing it to the vApp user. If it is not, use Oozie Web Console to enable it.