In-Database Deployment Package for Hadoop

Prerequisites

The following are required before you install and configure the in-database deployment package for Hadoop:
  • You have working knowledge of the Hadoop vendor distribution that you are using.
    You also need working knowledge of the Hadoop Distributed File System (HDFS), MapReduce 2, YARN, Hive, and HiveServer2 services. For more information, see the Apache website or the vendor’s website.
  • The HDFS, MapReduce, YARN, and Hive services must be running on the Hadoop cluster.
  • You have root or sudo access. Your user has Write permission to the root of HDFS.
  • You know the location of the MapReduce home.
  • You know the host name of the Hive server and the NameNode.
  • You understand and can verify your Hadoop user authentication.
  • You understand and can verify your security setup.
    If you are using Kerberos, you need the ability to get a Kerberos ticket.
  • You have permission to restart the Hadoop MapReduce service.
  • In order to avoid SSH key mismatches during installation, add the following two options to the SSH config file, under the user's home .ssh folder. An example of a home .ssh folder is /root/.ssh/. nodes is a list of nodes separated by a space.
    host nodes
       StrictHostKeyChecking no
       UserKnownHostsFile /dev/null
    For more details about the SSH config file, see the SSH documentation.
  • All machines in the cluster are set up to communicate with passwordless SSH. Verify that the nodes can access the node that you chose to be the master node by using SSH.
    Traditionally, public key authentication in Secure Shell (SSH) is used to meet the passwordless access requirement. SSH keys can be generated with the following example.
    [root@raincloud1 .ssh]# ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/root/.ssh/id_rsa):
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /root/.ssh/id_rsa.
    Your public key has been saved in /root/.ssh/id_rsa.pub.
    The key fingerprint is:
    09:f3:d7:15:57:8a:dd:9c:df:e5:e8:1d:e7:ab:67:86 root@raincloud1
    
    add id_rsa.pub public key from each node to the master node authorized 
       key file under /root/.ssh/authorized_keys
    
For Secure Mode Hadoop, GSSAPI with Kerberos is used as the passwordless SSH mechanism. GSSAPI with Kerberos not only meets the passwordless SSH requirements, but also supplies Hadoop with the credentials required for users to perform operations in HDFS with SAS LASR Analytic Server and SASHDAT files. Certain options must be set in the SSH daemon and SSH client configuration files. Those options are as follows and assume a default configuration of sshd.
To configure passwordless SSH to use Kerberos, follow these steps:
  1. In the sshd_config file, set:
    GSSAPIAuthentication yes
  2. In the ssh_config file, set:
    Host *.domain.net
    GSSAPIAuthentication yes
    GSSAPIDelegateCredentials yes
    
    where domain.net is the domain name used by the machine in the cluster.
Tip
Although you can specify Host *, this is not recommended because it allows GSSAPI Authentication with any host name.

Overview of the In-Database Deployment Package for Hadoop

This section describes how to install and configure the in-database deployment package for Hadoop (SAS Embedded Process).
The in-database deployment package for Hadoop must be installed and configured before you can transform data in Hadoop and extract transformed data out of Hadoop for analysis.
The in-database deployment package for Hadoop includes the SAS Embedded Process and two SAS Hadoop MapReduce JAR files. The SAS Embedded Process is a SAS server process that runs within Hadoop to read and write data. The SAS Embedded Process contains macros, run-time libraries, and other software that is installed on your Hadoop system.
The SAS Embedded Process must be installed on all nodes capable of executing MapReduce 2 and YARN tasks. The SAS Hadoop MapReduce JAR files must be installed on all nodes of a Hadoop cluster.
The SAS Embedded Process must be installed on all nodes capable of executing MapReduce 2 and YARN tasks, that is, nodes where a NodeManager is running. Usually, every DataNode node has a YARN NodeManager running. By default, the SAS Embedded Process install script (sasep-servers.sh) discovers the cluster topology and installs the SAS Embedded Process on all DataNode nodes, including the host node from where you run the script (the Hadoop master NameNode). This occurs even if a DataNode is not present. If you want to limit the list of nodes on which you want the SAS Embedded Process installed, you should run the sasep-servers.sh script with the -host <hosts> option. The SAS Hadoop MapReduce JAR files must be installed on all nodes of a Hadoop cluster.