In-Database Deployment Package for Hadoop

Overview of the In-Database Deployment Package for Hadoop

Prerequisites

The following prerequisites are required before you install and configure the in-database deployment package for Hadoop:

SAS Foundation and the SAS/ACCESS Interface to Hadoop are installed.
You have working knowledge of the Hadoop vendor distribution that you are using (for example, Cloudera or Hortonworks).

You also need working knowledge of the Hadoop Distributed File System (HDFS), MapReduce 1, MapReduce 2, YARN, Hive, and HiveServer2 services. For more information, see the Apache website or the vendor’s website.
The SAS Scoring Accelerator for Hadoop requires a specific configuration file. For more information, see How to Merge Configuration File Properties.
The HDFS, MapReduce, YARN, and Hive services must be running on the Hadoop cluster.
The SAS Scoring Accelerator for Hadoop requires a specific version of the Hadoop distribution. For more information, see the SAS Foundation system requirements documentation for your operating environment.
You have root or sudo access. Your user has Write permission to the root of HDFS.
You know the location of the MapReduce home.
You know the host name of the Hive server and the NameNode.
You understand and can verify your Hadoop user authentication.
You understand and can verify your security setup.

If you are using Kerberos, you need the ability to get a Kerberos ticket.
You have permission to restart the Hadoop MapReduce service.
In order to avoid SSH key mismatches during installation, add the following two options to the SSH config file, under the user's home .ssh folder. An example of a home .ssh folder is /root/.ssh/. nodes is a list of nodes separated by a space.
```
host nodes
   StrictHostKeyChecking no
   UserKnownHostsFile /dev/null
```
For more details about the SSH config file, see the SSH documentation.
All machines in the cluster are set up to communicate with passwordless SSH. Verify that the nodes can access the node that you chose to be the master node by using SSH.
Traditionally, public key authentication in Secure Shell (SSH) is used to meet the passwordless access requirement. SSH keys can be generated with the following example.
```
[root@raincloud1 .ssh]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
09:f3:d7:15:57:8a:dd:9c:df:e5:e8:1d:e7:ab:67:86 root@raincloud1

add id_rsa.pub public key from each node to the master node authorized 
   key file under /root/.ssh/authorized_keys
```
For Secure Mode Hadoop, GSSAPI with Kerberos is used as the passwordless SSH mechanism. GSSAPI with Kerberos not only meets the passwordless SSH requirements, but also supplies Hadoop with the credentials required for users to perform operations in HDFS with SAS LASR Analytic Server and SASHDAT files. Certain options must be set in the SSH daemon and SSH client configuration files. Those options are as follows and assume a default configuration of sshd.

To configure passwordless SSH to use Kerberos, follow these steps:
1. In the sshd_config file, set:
  GSSAPIAuthentication yes
2. In the ssh_config file, set:
  Host *.domain.net GSSAPIAuthentication yes GSSAPIDelegateCredentials yes
  
  where domain.net is the domain name used by the machine in the cluster.
  
  Tip
  Although you can specify host *, this is not recommended because it allows GSSAPI Authentication with any host name.

Overview of the In-Database Deployment Package for Hadoop

This section describes how to install and configure the in-database deployment package for Hadoop (SAS Embedded Process).

The in-database deployment package for Hadoop must be installed and configured before you can perform the following tasks:

Run a scoring model in Hadoop Distributed File System (HDFS) using the SAS Scoring Accelerator for Hadoop.
Run DATA step scoring programs in Hadoop.
Run DS2 threaded programs in Hadoop using the SAS In-Database Code Accelerator for Hadoop.
Read and write data to HDFS in parallel for SAS High-Performance Analytics.

Note: For deployments that use SAS High-Performance Deployment of Hadoop for the co-located data provider, and access SASHDAT tables exclusively, SAS/ACCESS and SAS Embedded Process are not needed.

Note: If you are installing the SAS High-Performance Analytics environment, you must perform additional steps after you install the SAS Embedded Process. For more information, see SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide.
Transform data in Hadoop and extract transformed data out of Hadoop for analysis in SAS with the SAS Data Loader for Hadoop. For more information, see SAS Data Loader for Hadoop: User’s Guide.
Perform data quality operations in Hadoop using the SAS Data Loader for Hadoop. For more information, see SAS Data Loader for Hadoop: User’s Guide.

The in-database deployment package for Hadoop includes the SAS Embedded Process and two SAS Hadoop MapReduce JAR files. The SAS Embedded Process is a SAS server process that runs within Hadoop to read and write data. The SAS Embedded Process contains macros, run-time libraries, and other software that is installed on your Hadoop system.

The SAS Embedded Process must be installed on all nodes capable of executing either MapReduce 1 or MapReduce 2 tasks. For MapReduce 1, this would be nodes where a TaskTracker is running. For MapReduce 2, this would be nodes where a NodeManager is running. Usually, every DataNode node has a YARN NodeManager or a MapReduce 1 TaskTracker running. By default, the SAS Embedded Process install script (sasep-servers.sh) discovers the cluster topology and installs the SAS Embedded Process on all DataNode nodes, including the host node from where you run the script (the Hadoop master NameNode). This occurs even if a DataNode is not present. If you want to limit the list of nodes on which you want the SAS Embedded Process installed, you should run the sasep-servers.sh script with the -host <hosts> option. The SAS Hadoop MapReduce JAR files must be installed on all nodes of a Hadoop cluster.