Configuring Components on the Cluster

Overview

After deploying the in-database deployment package, you must configure several components and settings on the Hadoop cluster in order for SAS Data Loader for Hadoop to operate correctly. These are explained in the following topics:

SQOOP and OOZIE

Your Hadoop cluster must be configured to use OOZIE scripts.
Note:
  • Ensure that Oozie 4.1 is installed.
  • You must add sqoop-action-0.4.xsd as an entry in the list for the oozie.service.SchemaService.wf.ext.schemas property.

JDBC Drivers

SAS Data Loader for Hadoop leverages the SQOOP and OOZIE components installed with Hadoop cluster to move data to and from a DBMS. The SAS Data Loader for Hadoop vApp client also accesses databases directly using JDBC for the purpose of selecting either source or target schemas and tables to move.
You must install on the Hadoop cluster the JDBC driver or drivers required by the DBMSs that users need to access. Follow the JDBC driver vendor installation instructions.
SAS Data Loader for Hadoop supports the Teradata and Oracle DBMSs directly. You can support additional databases selecting Other in the Type option on the SAS Data Loader for Hadoop Database Configuration dialog box. For more information about the dialog box, see the SAS Data Loader for Hadoop: User’s Guide.
For Teradata and Oracle, SAS recommends that you download the following JDBC files from the vendor site:
JDBC Files
Database
Required Files
Oracle
ojdbc6.jar
Teradata
tdgssconfig.jar and terajdbc4.jar
Note: You must also download the Teradata connector JAR file that is matched to your cluster distribution (except in the case of MapR, which does not use a connector JAR file).
The JDBC and connector JAR files must be located in the OOZIE shared libs directory in HDFS, not in /var/lib/sqoop. The correct path is available from the oozie.service.WorkflowAppService.system.libpath property.
The default directories in the Hadoop file system are as follows:
  • Hortonworks Hadoop clusters: /user/oozie/share/lib/sqoop
  • Cloudera Hadoop clusters: /user/oozie/share/lib/sharelib<version>/sqoop
  • MapR Hadoop clusters: /oozie/share/lib/sqoop
You must have, at a minimum, -rw-r--r-- permissions on the JDBC drivers.
After JDBC drivers have been installed and configured along with SQOOP and OOZIE, you must refresh sharelib, as follows:
oozie admin -oozie oozie_url -sharelibupdate
SAS Data Loader for Hadoop users must also have the same version of the JDBC drivers on their client machines in the SASWorkspace\JDBCDrivers directory. Provide a copy of the JDBC drivers to SAS Data Loader for Hadoop users.

Configuration Files for SAS Data Loader for Hadoop

About the Configuration Files

The SAS Deployment Manager creates the following folders on your Hadoop cluster:
  • installation_path\conf
  • installation_path\lib
The conf folder contains the required XML and JSON files for the vApp client. The lib folder contains the required JAR files. You must make these folders available on all active instances of the vApp client by copying them to the shared folder (SASWorkspace\hadoop) on the client. All files in each folder are required for the vApp to connect to Hadoop successfully.

Cloudera and Hortonworks

If you are using a Cloudera distribution without Cloudera Manager or a Hortonworks distribution without Ambari, then you must manually create an additional JSON file named inventory.json. Both the filename and all values within the file are case-sensitive.
Note: You must add this file to the conf directory that is to be provided to the vApp user.
The following syntax must be followed precisely:
{
"hadoop_distro": "value",           /* either cloudera or hortonworks */
"hadoop_distro_rpm": "value",       /* either cloudera or hortonworks */
"hadoop_distro_version": "value",   /* either DH5 or HDP-2.2 or HDP-2.1 */
"hadoop_distro_yum": "value",       /* either cloudera or hortonworks */
"hive_hosts": [                                                                  
    "value"                         /* the HiveServer2 name, including domain name */
]
}
The following is an example of file content for Cloudera:
{
"hadoop_distro": "cloudera",
"hadoop_distro_rpm": "cloudera",
"hadoop_distro_version": "CDH5",
"hadoop_distro_yum": "cloudera",
"hive_hosts": [
  "dmmlax39.unx.sas.com"
] 
}
The following is an example of file content for Hortonworks:
{
"hadoop_distro": "hortonworks",
"hadoop_distro_rpm": "hortonworks",
"hadoop_distro_version": "HDP-2.2",
"hadoop_distro_yum": "hortonworks",
"hive_hosts": [
  "dmmlax04.unx.sas.com"
]
}

MapR

If you are using a MapR distribution, then you must manually create an additional JSON file named mapr-user.json file (case-sensitive). For information about how to create this file, see MapR.

User IDs

Cloudera and Hortonworks

Your Cloudera and Hortonworks deployments can use Kerberos authentication or a different type of authentication of users. If you are using Kerberos authentication, see Configuring Security for more information.
For clusters that do not use Kerberos, you must create one or more user IDs and enable certain permissions for the SAS Data Loader for Hadoop vApp user.
To configure user IDs, follow these steps:
  1. Choose one of the following options for user IDs:
    • Create one User ID for all vApp users.
      Note: Do not use the super user, which is typically hdfs.
    • Create one User ID for each vApp user.
  2. Create UNIX user IDs on all nodes of the cluster and assign them to a group.
  3. Create a MapReduce staging HDFS directory defined in the MapReduce configuration. The default is /users/myuser.
  4. Change the permissions and owner of HDFS /users/myuser to match the UNIX user.
    Note: The user ID must have at least the following permissions:
    • Read,Write, and Delete permission for files in the HDFS directory (used for Oozie jobs)
    • Read, Write, and Delete permission for tables in HiveServer2
  5. Create a MapReduce staging MapR-FS directory defined in the MapReduce configuration. The default is /users/myuser.

MapR

MapR deployments do not support Kerberos.
For MapR deployments only, you must manually create a file named mapr-user.json (case-sensitive) that specifies user information required by the SAS Data Loader for Hadoop vApp in order for the vApp to interact with the Hadoop cluster. You must supply a user name, user ID, and group ID in this file. The user name must be a valid user on the MapR cluster.
Note: You must add this file to the conf directory that is to be provided to the vApp user. For more information, seeConfiguration Files for SAS Data Loader for Hadoop .
To configure user IDs, follow these steps:
  1. Create one User ID for each vApp user.
  2. Create UNIX user IDs on all nodes of the cluster and assign them to a group.
  3. Create the mapr-user.json file containing user ID information. You can obtain this information by logging on to a cluster node and running the ID command. You might create a file similar to the following:
    {
    "user_name"       : "myuser",
    "user_id"         : "2133",
    "user_group_id"   : "2133",
    "take_ownership"  : "true"
    }
  4. Manually add the mapr-user.json file to the conf directory that is to be provided to the vApp user.
    Note: To log on to the MapR Hadoop cluster with a different valid user ID, you must edit the information in the mapr-user.json file and in the User ID field of the SAS Data Loader for HadoopConfiguration dialog box. See User ID.
  5. Create a MapReduce staging MapR-FS directory defined in the MapReduce configuration. The default is /users/myuser.
  6. Change the permissions and owner of MapR-FS/users/myuser to match the UNIX user.
    Note: The user ID must have at least the following permissions:
    • Read, Write, and Delete permission for files in the MapR-FS directory (used for Oozie jobs)
    • Read, Write, and Delete permission for tables in HiveServer2
  7. SAS Data Loader for Hadoop uses HiveServer2 as its source of tabular data. Ensure that the UNIX user has appropriate permissions on HDFS for the locations of the HiveServer2 tables on which the user is permitted to operate.

Configuration Values

You must provide the vApp user with values for fields in the SAS Data Loader for Hadoop Configuration dialog box. For more information about the SAS Data Loader for Hadoop Configuration dialog box, see the SAS Data Loader for Hadoop: vApp Deployment Guide. The fields are as follows:
Host
specifies the full host name of the machine on the cluster running the HiveServer2 server.
Port
specifies the number of the HiveServer2 server port on your Hadoop cluster.
  • For Cloudera, the HiveServer2 port default is 10000.
  • For Hortonworks, the HiveServer2 server port default is 10000.
  • For MapR, the HiveServer2 server port default is 10000.
User ID
specifies the Hadoop user account that you have created on your Hadoop cluster for each user or all of the vApp users.
Note:
  • For Cloudera and Hortonworks user IDs, see Cloudera and Hortonworks.
  • For MapR user IDs, the user ID information is supplied through the mapr-user.json file. For more information, see MapR .
Password
if your enterprise uses LDAP, you must supply the vApp user with the LDAP password. This field is typically blank otherwise.
Oozie URL
specifies the Oozie base URL. The URL is the property oozie.base.url in the file oozie-site.xml. The URL is similar to the following example: http://host_name:port_number/oozie/.
Confirm that the Oozie Web UI is enabled before providing it to the vApp user. If it is not, use Oozie Web Console to enable it.