Hadoop Installation and Configuration

Hadoop Installation and Configuration Steps

Before you begin the Hadoop installation and configuration, review Prerequisites.
  1. If you are upgrading from or reinstalling a previous release, follow the instructions in Upgrading from or Reinstalling a Previous Version before installing the in-database deployment package.
  2. Move the SAS Embedded Process and SAS Hadoop MapReduce JAR file install scripts to the Hadoop master node (the NameNode).
    Note: The location where you transfer the install scripts becomes the SAS Embedded Process home and is referred to as SASEPHome throughout this chapter.
    Note: Both the SAS Embedded Process install script and the SAS Hadoop MapReduce JAR file install script must be transferred to the SASEPHome directory.
  3. Install the SAS Embedded Process and the SAS Hadoop MapReduce JAR files.
  4. If you want to use the SAS Scoring Accelerator for Hadoop, merge the properties of several configuration files into one configuration file.
  5. Copy the Hadoop core and common Hadoop JAR files to the client machine.
Note: If you are installing the SAS High-Performance Analytics environment, you must perform additional steps after you install the SAS Embedded Process. For more information, see SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide.

Upgrading from or Reinstalling a Previous Version

To upgrade or reinstall a previous version, follow these steps.
  1. If you are upgrading from SAS 9.3, follow these steps. If you are upgrading from SAS 9.4, start with Step 2.
    1. Stop the Hadoop SAS Embedded Process.
      SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-stop.all.sh
      SASEPHome is the master node where you installed the SAS Embedded Process.
    2. Delete the Hadoop SAS Embedded Process from all nodes.
      SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-delete.all.sh
    3. Verify that the sas.hadoop.ep.distribution-name.jar files have been deleted.
      The JAR files are located at HadoopHome/lib.
      For Cloudera, the JAR files are typically located here:
      /opt/cloudera/parcels/CDH/lib/hadoop/lib
      For Hortonworks, the JAR files are typically located here:
      /usr/lib/hadoop/lib
    4. Continue with Step 3.
  2. If you are upgrading from SAS 9.4, follow these steps.
    1. Stop the Hadoop SAS Embedded Process.
      SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.*/bin/sasep-servers.sh
         -stop -hostfile host-list-filename | -host <">host-list<">
      SASEPHome is the master node where you installed the SAS Embedded Process.
      For more information, see SASEP-SERVERS.SH Script.
    2. Remove the SAS Embedded Process from all nodes.
      SASEPHome/SAS/SASTKInDatabaseForServerHadoop/9.*/bin/sasep-servers.sh
         -remove -hostfile host-list-filename | -host <">host-list<">
         -mrhome dir
      Note: This step ensures that all old SAS Hadoop MapReduce JAR files are removed.
      For more information, see SASEP-SERVERS.SH Script.
    3. Verify that the sas.hadoop.ep.apache*.jar files have been deleted.
      The JAR files are located at HadoopHome/lib.
      For Cloudera, the JAR files are typically located here:
      /opt/cloudera/parcels/CDH/lib/hadoop/lib
      For Hortonworks, the JAR files are typically located here:
      /usr/lib/hadoop/lib
    4. Manually remove the SAS Embedded Process directories and files on the node from which you ran the script.
      The sasep-servers.sh -remove script removes the file everywhere except on the node from which you ran the script. The sasep-servers.sh -remove script displays instructions that are similar to the following example.
      localhost WARN: Apparently, you are trying to uninstall SAS Embedded Process 
         for Hadoop from the local node.
      The binary files located at 
         local_node/SAS/SASTKInDatabaseServerForHadoop/local_node/
      SAS/SASACCESStoHadoopMapReduceJARFiles will not be removed.
      localhost WARN: The init script will be removed from /etc/init.d and the 
         SAS Map Reduce JAR files will be removed from /usr/lib/hadoop-mapreduce/lib.
      localhost WARN: The binary files located at local_node/SAS 
         should be removed manually.
      
  3. Continue the installation process.

Moving the SAS Embedded Process and SAS Hadoop MapReduce JAR File Install Scripts

Creating the SAS Embedded Process Directory

Before you can install the SAS Embedded Process and the SAS Hadoop MapReduce JAR files, you must move the SAS Embedded Process and SAS Hadoop MapReduce JAR file install scripts to a directory on the Hadoop master node (the NameNode).
Create a new directory that is not part of an existing directory structure, such as /sasep.
This path will be created on each node in the Hadoop cluster during the SAS Embedded Process installation. Do not use existing system directories such as /opt or /usr. This new directory becomes the SAS Embedded Process home and is referred to as SASEPHome throughout this chapter.

Moving the SAS Embedded Process Install Script

The SAS Embedded Process install script is contained in a self-extracting archive file named tkindbsrv-9.42-n_lax.sh. n is a number that indicates the latest version of the file. If this is the initial installation, n has a value of 1. Each time you reinstall or upgrade, n is incremented by 1. The self-extracting archive file is located in the SAS-installation-directory/SASTKInDatabaseServer/9.4/HadooponLinuxx64/ directory.
Using a method of your choice, transfer the SAS Embedded Process install script to your Hadoop master node.
This example uses secure copy, and SASEPHome is the location where you want to install the SAS Embedded Process.
scp tkindbsrv-9.42-n_lax.sh username@hadoop:/SASEPHome
Note: Both the SAS Embedded Process install script and the SAS Hadoop MapReduce JAR file install script must be transferred to the SASEPHome directory.

Moving the SAS Hadoop MapReduce JAR File Install Script

The SAS Hadoop MapReduce JAR file install script is contained in a self-extracting archive file named hadoopmrjars-9.42-n_lax.sh. n is a number that indicates the latest version of the file. If this is the initial installation, n has a value of 1. Each time you reinstall or upgrade, n is incremented by 1. The self-extracting archive file is located in the SAS-installation-directory/SASACCESStoHadoopMapReduceJARFiles/9.41 directory.
Using a method of your choice, transfer the SAS Hadoop MapReduce JAR file install script to your Hadoop master node.
This example uses Secure Copy, and SASEPHome is the location where you want to install the SAS Hadoop MapReduce JAR files.
scp hadoopmrjars-9.42-n_lax.sh username@hadoop:/SASEPHome
Note: Both the SAS Embedded Process install script and the SAS Hadoop MapReduce JAR file install script must be transferred to the SASEPHome directory.

Installing the SAS Embedded Process and SAS Hadoop MapReduce JAR Files

To install the SAS Embedded Process, follow these steps.
Note: Permissions are needed to install the SAS Embedded Process and SAS Hadoop MapReduce JAR files. For more information, see Hadoop Permissions.
  1. Log on to the server using SSH as root with sudo access.
    ssh username@serverhostname
    sudo su - root
  2. Move to your Hadoop master node where you want the SAS Embedded Process installed.
    cd /SASEPHome
    SASEPHome is the same location to which you copied the self-extracting archive file. For more information, see Moving the SAS Embedded Process Install Script.
    Note: Before continuing with the next step, ensure that each self-extracting archive file has Execute permission.
  3. Use the following script to unpack the tkindbsrv-9.42-n_lax.sh file.
    ./tkindbsrv-9.42-n_lax.sh
    
    n is a number that indicates the latest version of the file. If this is the initial installation, n has a value of 1. Each time you reinstall or upgrade, n is incremented by 1.
    Note: If you unpack in the wrong directory, you can move it after the unpack.
    After this script is run and the files are unpacked, the script creates the following directory structure where SASEPHome is the master node from Step 1.
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/misc
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/sasexe
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/utilities
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/build
    The content of the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin directory should look similar to this.
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/sas.ep4hadoop.template
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/sasep-servers.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/sasep-common.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/sasep-server-start.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/sasep-server-status.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/sasep-server-stop.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/InstallTKIndbsrv.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/MANIFEST.MF
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/qkbpush.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/sas.tools.qkb.hadoop.jar
  4. Use this command to unpack the SAS Hadoop MapReduce JAR files.
    ./hadoopmrjars-9.42-n_lax.sh
    After the script is run, the script creates the following directory and unpacks these files to that directory.
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.42/lib/ep-config.xml
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.42/lib/
       sas.hadoop.ep.apache023.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.42/lib/
        sas.hadoop.ep.apache023.nls.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.42/lib/
       sas.hadoop.ep.apache121.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.42/lib/
       sas.hadoop.ep.apache121.nls.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.42/lib/
       sas.hadoop.ep.apache205.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.42/lib/
       sas.hadoop.ep.apache205.nls.jar
  5. Use the sasep-servers.sh -add script to deploy the SAS Embedded Process installation across all nodes. The SAS Embedded Process is installed as a Linux service.
    Tip
    There are many options available when installing the SAS Embedded Process. We recommend that you review the script syntax before running it. For more information, see SASEP-SERVERS.SH Script.
    Note: If you are running on a cluster with Kerberos, complete both steps a and b. If you are not running with Kerberos, complete only step b.
    1. If you are running on a cluster with Kerberos, you must kinit the HDFS user.
      sudo su - root
      su - hdfs | hdfs-userid
      kinit -kt location of keytab file
         user for which you are requesting a ticket
      exit
      Here is an example:
      sudo su - root
      su - hdfs
      kinit -kt hdfs.keytab hdfs
      exit
      Note: The default HDFS user is hdfs. You can specify a different user ID with the -hdfsuser argument when you run the sasep-servers.sh -add script.
      Note: If you are running on a cluster with Kerberos, a keytab is required for the -hdfsuser running the sasep-servers.sh -add script.
      Note: You can run klist while you are running as the -hdfsuser user to check the status of your Kerberos ticket on the server. Here is an example:
      klist
      Ticket cache: FILE/tmp/krb5cc_493
      Default principal: hdfs@HOST.COMPANY.COM
      
      Valid starting    Expires           Service principal
      06/20/14 09:51:26 06/27/14 09:51:26 krbtgt/HOST.COMPANY.COM@HOST.COMPANY.COM
           renew until 06/22/14 09:51:26
    2. Run the sasep-servers.sh script. Review all of the information in this step before running the script.
      cd SASEPHOME/SAS/SASTKInDatabaseServerForHadoop/9.42/bin
      ./sasep-servers.sh -add
    During the install process, the script asks whether you want to start the SAS Embedded Process. If you choose Y or y, the SAS Embedded Process is started on all nodes after the install is complete. If you choose N or n, you can start the SAS Embedded Process later by running ./sasep-servers.sh -start.
    Note: When you run the sasep-servers.sh -add script, a user and group named sasep is created. You can specify a different user and group name with the -epuser and -epgroup arguments when you run the sasep-servers.sh -add script.
    Note: The sasep-servers.sh script can be run from any location. You can also add its location to the PATH environment variable.
    Tip
    Although you can install the SAS Embedded Process in multiple locations, the best practice is to install only one instance. Only one version of the SASEP JAR files will be installed in your HadoopHome/lib directory.
    Note: The SAS Embedded Process must be installed on all nodes capable of executing either MapReduce 1 or MapReduce 2 tasks. For MapReduce 1, this would be nodes where a TaskTracker is running. For MapReduce 2, this would be nodes where a NodeManager is running. Usually, every DataNode node has a YARN NodeManager or a MapReduce 1 TaskTracker running. By default, the SAS Embedded Process install script (sasep-servers.sh) discovers the cluster topology and installs the SAS Embedded Process on all DataNode nodes, including the host node from where you run the script (the Hadoop master NameNode). This occurs even if a DataNode is not present. If you want to limit the list of nodes on which you want the SAS Embedded Process installed, you should run the sasep-servers.sh script with the -host <hosts> option.
    Note: If you install the SAS Embedded Process on a large cluster, the SSHD daemon might reach the maximum number of concurrent connections. The ssh_exchange_identification: Connection closed by remote host SSHD error might occur. To work around the problem, edit the /etc/ssh/sshd_config file, change the MaxStartups option to the number that accommodates your cluster, and save the file. Then, reload the SSHD daemon by running the /etc/init.d/sshd reload command.
  6. Verify that the SAS Embedded Process is installed and running. Change directories and then run the sasep-servers.sh script with the -status option.
    cd SASEPHOME/SAS/SASTKInDatabaseServerForHadoop/9.42/bin
    ./sasep-servers.sh -status
    This command returns the status of the SAS Embedded Process running on each node of the Hadoop cluster. Verify that the SAS Embedded Process home directory is correct on all the nodes.
    Note: The sasep-servers.sh -status script will not run successfully if the SAS Embedded Process is not installed.
  7. Verify that the sas.hadoop.ep.apache*.jar files are now in place on all nodes.
    The JAR files are located at HadoopHome/lib.
    For Cloudera, the JAR files are typically located here:
    /opt/cloudera/parcels/CDH/lib/hadoop/lib
    For Hortonworks, the JAR files are typically located here:
    /usr/lib/hadoop/lib
  8. Restart the Hadoop YARN or MapReduce service.
    This enables the cluster to reload the SAS Hadoop JAR files (sas.hadoop.ep.*.jar).
    Note: It is preferable to restart the service by using Cloudera Manager or Hortonworks Ambari.
  9. Verify that an init.d service with a sas.ep4hadoop file was created in the following directory.
    /etc/init.d
    View the sas.ep4hadoop file and verify that the SAS Embedded Process home directory is correct.
    The init.d service is configured to start at level 3 and level 5.
    Note: The SAS Embedded Process needs to run on all nodes in the Hadoop cluster.
  10. Verify that the configuration file, ep-config.xml, was written to the HDFS file system.
    hadoop fs -ls /sas/ep/config
    Note: If you are running on a cluster with Kerberos, you need a Kerberos ticket. If not, you can use the WebHDFS browser.
    Note: The /sas/ep/config directory is created automatically when you run the install script.

How to Merge Configuration File Properties

Requirements for Configuration File Properties

The SAS Scoring Accelerator for Hadoop requires that some specific configuration files be merged into a single configuration file. This configuration file is used by the %INDHD_RUN_MODEL macro. For more information about the %INDHD_RUN_MODEL macro, see the SAS In-Database Products: User's Guide.
Note: In addition to the SAS Scoring Accelerator for Hadoop, a merged configuration file is also required for using PROC HADOOP and the FILENAME Statement Hadoop Access Method with Base SAS.
Note: A merged configuration file is not required for the following products:
  • SAS/ACCESS Interface to Hadoop
  • Scalable Performance Data Engine (SPD Engine)
  • High-Performance Analytics
  • SAS Code Accelerator for Hadoop
In the August 2014 release, a new environment variable, SAS_HADOOP_CONFIG_PATH, was created to replace the use of a merged configuration file for the preceding list of products. The configuration files are copied to a location on the client machine, and the SAS_HADOOP_CONFIG_PATH variable is set to that location.
Note: The configuration file properties that must be merged depend on which version of Hadoop you are using. The following topics are for Cloudera CDH4.x and 5.x and Hortonworks 1.3.2 and 2.x. If you are using a different version of Cloudera or Hortonworks or another vendor’s Hadoop distribution, you must use a comparable version of these configuration files. Otherwise, the installation of the SAS Embedded Process will fail.

How to Merge Configuration File Properties

To merge configuration file properties, follow these steps:
  1. Retrieve the Hadoop configuration files from this location on your cluster.
    /etc/hadoop/conf
  2. Using a method of your choice, concatenate the required configuration files into one configuration file.
    1. If you are using MapReduce 1, merge the properties from the Hadoop core (core-site.xml), Hadoop HDFS (hdfs-site.xml), and MapReduce (mapred-site.xml) configuration files into one single configuration file.
    2. If you are using MapReduce 2 or YARN, merge the properties from the Hadoop core (core-site.xml), Hadoop HDFS (hdfs-site.xml), MapReduce (mapred-site.xml), and YARN (yarn-site.xml) configuration files into one single configuration file.
    Note: The merged configuration file must have one beginning <configuration> tag and one ending </configuration> tag. Only properties should exist between the <configuration>…</configuration> tags. Here is a Cloudera example:
    <?xml version="1.0" encoding="UTF-8"?>
    
    <configuration>
      <property>
        <name>hive.metastore.local</name>
        <value>false</value>
      </property>
    
    <!-- lines omitted for sake of brevity -->
    
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value></value>
      </property>
    </configuration>
    
    Note: For information about setting a configuration property to avoid out of memory exceptions, see Configuring the Number of Proactive Reads in the SAS Embedded Process.
  3. Save the configuration file to a location of your choosing.

Configuring the Number of Proactive Reads in the SAS Embedded Process

The SAS Embedded Process for Hadoop implements a mechanism that proactively reads and caches records from the input files while the DS2 program is processing the current block of records. By default, the number of proactive reads is set to 100. On files with very large records, the default value of 100 might cause memory exhaustion in the Java Virtual Machine in which the SAS Embedded Process MapReduce task is running.
Tip
Java Virtual Machine out of memory exceptions might be avoided if the number of proactive reads is reduced.
To avoid out of memory exceptions, we recommend setting the following property in the merged configuration file:
<property>
   <name>sas.ep.superreader.proactive.reader.capacity</name>
   <value>10</value>
</property>

Copying Hadoop JAR Files to the Client Machine

For SAS components that interface with Hadoop, a specific set of common and core Hadoop JAR files must be in one location on the client machine. Examples of those components are the SAS Scoring Accelerator and SAS High-Performance Analytics.
When you run the sasep-servers.sh -add script to install the SAS Embedded Process, the script detects the Hadoop distribution and creates a HADOOP_JARS.zip file in the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.42/bin/ directory. This file contains the common and core Hadoop JAR files that are required for the SAS Embedded Process. For more information, see Installing the SAS Embedded Process and SAS Hadoop MapReduce JAR Files.
To get the Hadoop JAR files on your client machine, follow these steps:
  1. Move the HADOOP_JARS.zip file to a directory on your client machine and unzip the file.
    unzip HADOOP_JARS.zip
  2. Set the SAS_HADOOP_JAR_PATH environment variable to point to the directory that contains the core and common Hadoop JAR files.
Note: You can run the sasep-servers.sh -getjars script at any time to create a new ZIP file and refresh the JAR file list.
Note: The MapReduce 1 and MapReduce 2 JAR files cannot be on the same Java classpath.
Note: The JAR files in the SAS_HADOOP_JAR_PATH directory must match the Hadoop server to which SAS is connected. If multiple Hadoop servers are running different Hadoop versions, then create and populate separate directories with version-specific Hadoop JAR files for each Hadoop version. Then dynamically set SAS_HADOOP_JAR_PATH, based on the target Hadoop server to which each SAS job or SAS session is connected. One way to dynamically set SAS_HADOOP_JAR_PATH is to create a wrapper script associated with each Hadoop version. Then invoke SAS via a wrapper script that sets SAS_HADOOP_JAR_PATH appropriately to pick up the JAR files that match the target Hadoop server. Upgrading your Hadoop server version might involve multiple active Hadoop versions. The same multi-version instructions apply.