In-Database Deployment Package for Hadoop

Prerequisites

The following prerequisites are required before you install and configure the in-database deployment package for Hadoop:
  • SAS Foundation and the SAS/ACCESS Interface to Hadoop are installed.
  • You have working knowledge of the Apache Hadoop software framework and the Hadoop vendor distribution that you are using (for example, Cloudera or Hortonworks). You also need working knowledge of HDFS, MapReduce 1, MapReduce 2, YARN, Hive, and HiveServer2 services. For more information, see the Apache website or the vendor’s website.
  • In-database processing for Hadoop requires a specific version of the Hadoop environment and specific JAR files. For more information, see Co-Locating Hadoop JAR Files and Merging Configuration File Properties.
  • The following services must be running on the Hadoop cluster:
    • HDFS
    • MapReduce 1, or MapReduce 2 and YARN
    • Hive or HiveServer2
  • Cloudera CDH 4.2.2, Cloudera CDH 4.3.1 or later, or Hortonworks 1.3 or later is installed.
  • You have root or sudo access. Your user name has Write permission to the root of HDFS.
  • You know the location of the MapReduce home.
  • You know the host name or a file containing the host name of all nodes in the Hadoop cluster.
  • You have permission to restart the Hadoop MapReduce service.
  • In order to avoid SSH key mismatches during installation, add the following two options to the SSH config file, under the user's home .ssh folder. An example of a home .ssh folder is /root/.ssh/config.. nodes is a list of nodes separated by a space.
    host nodes
       StrictHostKeyChecking no
       UserKnownHostsFile /dev/null
    For more details about the SSH config file, see the SSH documentation.
  • All machines in the cluster are set up to communicate with passwordless SSH. Verify that the nodes can access the node that you chose to be the master node by using SSH.
    SSH keys can be generated with the following example.
    [root@raincloud1 .ssh]# ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/root/.ssh/id_rsa):
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /root/.ssh/id_rsa.
    Your public key has been saved in /root/.ssh/id_rsa.pub.
    The key fingerprint is:
    09:f3:d7:15:57:8a:dd:9c:df:e5:e8:1d:e7:ab:67:86 root@raincloud1
    
    add id_rsa.pub public key from each node to the master node authorized 
       key file under /root/.ssh/authorized_keys
    

Overview of the In-Database Deployment Package for Hadoop

This section describes how to install and configure the in-database deployment package for Hadoop (SAS Embedded Process).
The in-database deployment package for Hadoop must be installed and configured before you can perform the following tasks:
  • Run a scoring model in Hadoop Distributed File System (HDFS).
  • Run DATA step scoring programs in Hadoop.
  • Read and write data to HDFS in parallel for SAS High-Performance Analytics.
    Note: For deployments that use SAS High-Performance Deployment of Hadoop for the co-located data provider, and access SASHDAT tables exclusively, SAS/ACCESS and SAS Embedded Process are not needed.
    Note: If you are installing the SAS High-Performance Analytics environment, you must perform additional steps after you install the SAS Embedded Process. For more information, see SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide.
The in-database deployment package for Hadoop includes the SAS Embedded Process and two Hadoop MapReduce JAR files. The SAS Embedded Process is a SAS server process that runs within Hadoop to read and write data. The SAS Embedded Process contains macros, run-time libraries, and other software that is installed on your Hadoop system.
The SAS Embedded Process must be installed on all nodes capable of executing either MapReduce 1 or MapReduce 2 and YARN tasks. The two Hadoop MapReduce JAR files must be installed on all nodes of a Hadoop cluster.

Hadoop Installation and Configuration Steps

Before you begin the Hadoop installation and configuration, please review Prerequisites.
  1. If you are upgrading from or reinstalling a previous release, follow the instructions in Upgrading from or Reinstalling a Previous Version before installing the in-database deployment package.
  2. Move the SAS Embedded Process and Hadoop MapReduce JAR file install scripts to the Hadoop master node. For more information, see Moving the SAS Embedded Process and Hadoop MapReduce JAR File Install Scripts.
    Note: Both the SAS Embedded Process install script and the Hadoop MapReduce JAR file install script must be transferred to the same directory.
    Note: The location where you transfer the install scripts becomes the SAS Embedded Process home.
  3. Install the SAS Embedded Process and the Hadoop MapReduce JAR files. For more information, see Installing the SAS Embedded Process and Hadoop MapReduce JAR Files.
Note: If you are installing the SAS High-Performance Analytics environment, you must perform additional steps after you install the SAS Embedded Process. For more information, see SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide.

Upgrading from or Reinstalling a Previous Version

To upgrade or reinstall a previous version, follow these steps.
  1. If you are upgrading from SAS 9.3, follow these steps. If you are upgrading from SAS 9.4, start with Step 2.
    1. Stop the Hadoop SAS Embedded Process.
      SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-stop.all.sh
      SASEPHome is the master node where you installed the SAS Embedded Process.
    2. Delete the Hadoop SAS Embedded Process from all nodes.
      SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-delete.all.sh
    3. Continue with Step 3.
  2. If you are upgrading from SAS 9.4, follow these steps.
    1. Stop the Hadoop SAS Embedded Process.
      SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41/bin/sasep-servers.sh
         -stop -hostfile host-list-filename | -host <">host-list<">
      SASEPHome is the master node where you installed the SAS Embedded Process.
      For more information, see SASEP-SERVERS.SH Script.
    2. Remove the SAS Embedded Process from all nodes.
      SASEPHome/SAS/SASTKInDatabaseForServerHadoop/9.41/bin/sasep-servers.sh
         -remove -hostfile host-list-filename | -host <">host-list<">
         -mrhome dir
      Note: This step ensures that all old Hadoop MapReduce JAR files are removed.
      For more information, see SASEP-SERVERS.SH Script.
  3. Reinstall the SAS Embedded Process and the Hadoop MapReduce JAR files by running the sasep-servers.sh file. For more information, see Installing the SAS Embedded Process and Hadoop MapReduce JAR Files.
    Note: Starting in the December 2013 release, the SAS Embedded Process and Hadoop MapReduce JAR files are installed alongside the previous version of these files. Use the latest version of these files when performing the upgrade or reinstallation.

Moving the SAS Embedded Process and Hadoop MapReduce JAR File Install Scripts

Moving the SAS Embedded Process Install Script

The SAS Embedded Process install script is contained in a self-extracting archive file named tkindbsrv-9.41_M1-n_lax.sh. n is a number that indicates the latest version of the file. If this is the initial installation, n has a value of 1. Each time you reinstall or upgrade, n is incremented by 1. The self-extracting archive file is located in the SAS-installation-directory/SASTKInDatabaseServer/9.4/HadooponLinuxx64/ directory.
Using a method of your choice, transfer the SAS Embedded Process install script to your Hadoop master node.
This example uses secure copy, and SASEPHome is the location where you want to install the SAS Embedded Process.
scp tkindbsrv-9.41_M1-n_lax.sh username@hadoop:/SASEPHome
Note: The location where you transfer the install script becomes the SAS Embedded Process home.
Note: Both the SAS Embedded Process install script and the Hadoop MapReduce JAR file install script must be transferred to the same directory.

Moving the Hadoop MapReduce JAR File Install Script

The Hadoop MapReduce JAR file install script is contained in a self-extracting archive file named hadoopmrjars-9.41_M1-n_lax.sh. n is a number that indicates the latest version of the file. If this is the initial installation, n has a value of 1. Each time you reinstall or upgrade, n is incremented by 1. The self-extracting archive file is located in the SAS-installation-directory/SASACCESStoHadoopMapReduceJARFiles/9.41 directory.
Using a method of your choice, transfer the SAS Embedded Process install script to your Hadoop master node.
This example uses Secure Copy, and SASEPHome is the location where you want to install the SAS Embedded Process.
scp hadoopmrjars-9.41_M1-2_lax.sh username@hadoop:/SASEPHome
Note: Both the SAS Embedded Process install script and the Hadoop MapReduce JAR file install script must be transferred to the same directory.

Installing the SAS Embedded Process and Hadoop MapReduce JAR Files

To install the SAS Embedded Process, follow these steps.
Note: Permissions are needed to install the SAS Embedded Process and Hadoop MapReduce JAR files. For more information, see Hadoop Permissions.
  1. Move to your Hadoop master node where you want the SAS Embedded Process installed.
    cd /SASEPHome
    SASEPHome is the same location to which you copied the self-extracting archive file. For more information, see Moving the SAS Embedded Process and Hadoop MapReduce JAR File Install Scripts.
  2. Use the following script to unpack the tkindbsrv-9.41_M1-n_lax.sh file.
    ./tkindbsrv-9.41_M1-n_lax.sh
    
    n is a number that indicates the latest version of the file. If this is the initial installation, n has a value of 1. Each time you reinstall or upgrade, n is incremented by 1.
    Note: If you unpack in the wrong directory, you can move it after the unpack.
    After this script is run and the files are unpacked, the script creates the following directory structure where SASEPHome is the master node from Step 1.
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/bin
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/misc
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/sasexe
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/utilities
    
    The content of the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.4_M1/bin directory should look similar to this.
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/bin/sasep-servers.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/bin/sasep-common.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/bin/sasep-server-start.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/bin/sasep-server-status.sh
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/bin/sasep-server-stop.sh
    
  3. Use this command to unpack the Hadoop MapReduce JAR files.
    ./hadoopmrjars-9.41_M1-1_lax.sh
    After the script is run, the script creates the following directory and unpacks these files to that directory.
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.41_M1/lib/ep-config.xml
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.41_M1/lib/
       sas.hadoop.ep.apache023.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.41_M1/lib/
        sas.hadoop.ep.apache023.nls.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.41_M1/lib/
       sas.hadoop.ep.apache121.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.41_M1/lib/
       sas.hadoop.ep.apache121.nls.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.41_M1/lib/
       sas.hadoop.ep.apache205.jar
    SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.41_M1/lib/
       sas.hadoop.ep.apache205.nls.jar
  4. Use the sasep-servers.sh script to deploy the SAS Embedded Process installation across all nodes. This script also detects the Hadoop version and distributes the correct Hadoop MapReduce JAR files. The SAS Embedded Process is installed as a Linux service.
    cd SASEPHOME/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/bin
    ./sasep-servers.sh -add 
       -hostfile host-list-filename | host "host-list"
       -epscript path-to-ep-install-script 
       -mrscript path-to-mr-install-script
       -mrhome path-to-mr-home
    For more information, see SASEP-SERVERS.SH Script.
    Note: If you install the SAS Embedded Process on a large cluster, the SSHD daemon might reach the maximum number of concurrent connections. The ssh_exchange_identification: Connection closed by remote host SSHD error might occur. Follow these steps to work around the problem:
    1. Edit the /etc/ssh/sshd_config file and change the MaxStartups option to the number that accommodates your cluster.
    2. Save the file and reload the SSHD daemon by running the /etc/init.d/sshd reload command.
    Note: The sasep-servers.sh script can be run from any location. You can also add its location to the PATH environment variable.
    Note: Although you can install the SAS Embedded Process in multiple locations, the best practice is to install only one instance.
    Note: The SAS Embedded Process runs on all the nodes that are capable of running a MapReduce task. In some instances, the node that you chose to be the master node can also serve as a MapReduce task node.
  5. Copy the sas.hadoop.ep.apache023.jar and sas.hadoop.ep.apache023.nls.jar files to the SAS_HADOOP_JAR_PATH directory on the client side.
  6. If this is the first install of the SAS Embedded Process, a restart of the Hadoop MapReduce service is required.
  7. Verify that the SAS Embedded Process is installed and running. Change directories and then run the sasep-servers.sh script with the -status option.
    cd SASEPHOME/SAS/SASTKInDatabaseServerForHadoop/9.41_M1/bin
    ./sasep-servers.sh -status 
       -hostfile host-list-filename | host "host-list"
    This command returns the status of the SAS Embedded Process running on each node of the Hadoop cluster.
  8. Verify that an init.d service with a sas.ep4hadoop file was created in the following directory.
    /etc/init.d/sas.ep4hadoop
    View the sas.ep4hadoop file and verify that the SAS Embedded Process home directory is correct.
    The init.d service is configured to start at level 3 and level 5.
    Note: The SAS Embedded Process Linux service runs only on the nodes that you chose during installation.
  9. Verify that the SAS Embedded Process home directory is correct on all the nodes.
  10. Verify that the Hadoop MapReduce JAR files were installed in the following directories.
    map-reduce-home/lib/sas.hadoop.ep.apacheversion.jar
    map-reduce-home/lib/sas.hadoop.ep.apacheversion.nls.jar
  11. Verify that configuration files were written to the HDFS file system.
    hadoop fs -ls /sas/ep/config
    Note: The /sas/ep/config directory is created automatically when you run the install script.

Co-Locating Hadoop JAR Files and Merging Configuration File Properties

Requirements for Hadoop JAR Files and Configuration File Properties

For SAS components that interface with Hadoop, a specific set of JAR files must be in one location on the client machine, and properties from some specific configuration files must be merged into a single configuration file.
Note: Which JAR files need to be co-located and which configuration file properties need to be merged depend on which version of Hadoop you are using. The JAR files in the following topics are for Cloudera CDH4.2.2 and CDH4.3.1, and Hortonworks 1.3.2. If you are using a later version of Cloudera or Hortonworks or another vendor’s Hadoop distribution, you must use a comparable version of these JAR files. Otherwise, the installation of the SAS Embedded Process will fail.

Co-Locating JAR Files and Merging Configuration File Properties for Cloudera 4.2.2

If you are using Cloudera CDH 4.2.2, follow these steps to co-locate the JAR files and merge configuration file properties.
  1. Copy the following common and core JAR files into a directory of your choosing on your client machine. For example, you might want to name this directory comcore.
    activation-1.1.jar
    asm-3.2.jar
    avro-1.7.3.jar
    commons-beanutils-1.7.0.jar
    commons-beanutils-core-1.8.0.jar
    commons-cli-1.2.jar
    commons-codec-1.4.jar
    commons-collections-3.2.1.jar
    commons-configuration-1.6.jar
    commons-digester-1.8.jar
    commons-el-1.0.jar
    commons-httpclient-3.1.jar
    commons-io-2.1.jar
    commons-lang-2.5.jar
    commons-logging-1.1.1.jar
    commons-math-2.1.jar
    commons-net-3.1.jar
    guava-11.0.2.jar
    hadoop-annotations-2.0.0-cdh4.2.2.jar
    hadoop-auth-2.0.0-cdh4.2.2.jar
    hadoop-common-2.0.0-cdh4.2.2.jar
    hadoop-hdfs-2.0.0-cdh4.2.2.jar
    hadoop-yarn-api-2.0.0-cdh4.2.2.jar
    hadoop-yarn-client-2.0.0-cdh4.2.2.jar
    hadoop-yarn-common-2.0.0-cdh4.2.2.jar
    hadoop-yarn-server-common-2.0.0-cdh4.2.2.jar
    hive-beeline-0.10.0-cdh4.2.2.jar
    hive-builtins-0.10.0-cdh4.2.2.jar
    hive-cli-0.10.0-cdh4.2.2.jar
    hive-common-0.10.0-cdh4.2.2.jar
    hive-contrib-0.10.0-cdh4.2.2.jar
    hive-exec-0.10.0-cdh4.2.2.jar
    hive-hbase-handler-0.10.0-cdh4.2.2.jar
    hive-hwi-0.10.0-cdh4.2.2.jar
    hive-jdbc-0.10.0-cdh4.2.2.jar
    hive-metastore-0.10.0-cdh4.2.2.jar
    hive-pdk-0.10.0-cdh4.2.2.jar
    hive-serde-0.10.0-cdh4.2.2.jar
    hive-service-0.10.0-cdh4.2.2.jar
    hive-shims-0.10.0-cdh4.2.2.jar
    hue-plugins-2.2.0-cdh4.2.2.jar
    jackson-core-asl-1.8.8.jar
    jackson-jaxrs-1.8.8.jar
    jackson-mapper-asl-1.8.8.jar
    jackson-xc-1.8.8.jar
    jasper-compiler-5.5.23.jar
    jasper-runtime-5.5.23.jar
    jaxb-api-2.2.2.jar
    jaxb-impl-2.2.3-1.jar
    jersey-core-1.8.jar
    jersey-json-1.8.jar
    jersey-server-1.8.jar
    jets3t-0.6.1.jar
    jettison-1.1.jar
    jetty-6.1.26.cloudera.2.jar
    jetty-util-6.1.26.cloudera.2.jar
    jline-0.9.94.jar
    jsch-0.1.42.jar
    jsp-api-2.1.jar
    jsr305-1.3.9.jar
    junit-4.8.2.jar
    kfs-0.3.jar
    libfb303-0.9.0.jar
    log4j-1.2.17.jar
    mockito-all-1.8.5.jar
    paranamer-2.3.jar
    pig-0.10.0-cdh4.2.2-withouthadoop.jar
    protobuf-java-2.4.0a.jar
    servlet-api-2.5.jar
    slf4j-api-1.6.1.jar
    snappy-java-1.0.4.1.jar
    stax-api-1.0.1.jar
    xmlenc-0.52.jar
  2. Copy the following MapReduce 1 JAR files into a subfolder under the common and core JAR files directory that you created in Step 1. For example, you might want to name this subfolder MR1.
    hadoop-tools-2.0.0-mr1-cdh4.2.2.jar
    hadoop-core-2.0.0-mr1-cdh4.2.2.jar
  3. Copy the following MapReduce 2 JAR files into a subfolder under the common and core JAR files directory that you created in Step 1. For example, you might want to name this subfolder MR2.
    hadoop-mapreduce-client-common-2.0.0-cdh4.2.2.jar
    hadoop-mapreduce-client-shuffle-2.0.0-cdh4.2.2.jar
    hadoop-mapreduce-client-core-2.0.0-cdh4.2.2.jar
    hadoop-mapreduce-client-app-2.0.0-cdh4.2.2.jar
  4. Set the SAS_HADOOP_JAR_PATH variable to point to the main folder that contains the common and core JAR files and the subfolder that contains either the MapReduce 1 or the MapReduce 2 JAR files.
    Note: The MapReduce 1 and MapReduce 2 JAR files cannot be on the same Java classpath.
    Note: The JAR files in the SAS_HADOOP_JAR_PATH directory must match the Hadoop server to which SAS is connected. If multiple Hadoop servers are running different Hadoop versions, then create and populate separate directories with version-specific Hadoop JAR files for each Hadoop version. Then dynamically set SAS_HADOOP_JAR_PATH, based on the target Hadoop server to which each SAS job or SAS session is connected. One way to dynamically set SAS_HADOOP_JAR_PATH is to create a wrapper script associated with each Hadoop version. SAS is invoked via a wrapper script that sets SAS_HADOOP_JAR_PATH appropriately to pick up the JAR files that match the target Hadoop server. Upgrading your Hadoop server version might involve multiple active Hadoop versions. The same multi-version instructions apply.
  5. Retrieve the following Hadoop configuration files from your cluster and create one configuration file.
    1. If you are using MapReduce 1, merge the properties from the Hadoop core, Hadoop HDFS, and MapReduce configuration files into one single configuration file.
    2. If you are using MapReduce 2, merge the properties from the Hadoop core, Hadoop HDFS, MapReduce 2, and YARN configuration files into one single configuration file.
      Note: The merged configuration file must have one beginning <configuration> tag and one ending </configuration> tag. Only properties should exist between the <configuration>…</configuration> tags. Here is an example:
      <?xml version="1.0" encoding="UTF-8"?>
      
      <!--Autogenerated by Cloudera Manager-->
      <configuration>
        <property>
          <name>hive.metastore.local</name>
          <value>false</value>
        </property>
      
      <!-- lines omitted for sake of brevity -->
      
        <property>
          <name>yarn.nodemanager.aux-services</name>
          <value></value>
        </property>
      </configuration>
      
    3. Save the configuration file to a location of your choosing.
      The Hadoop configuration file can be used when assigning a SAS/ACCESS LIBNAME, running PROC HADOOP, or publishing and running the scoring macros.

Co-Locating JAR Files and Merging Configuration File Properties for Cloudera 4.3.1

If you are using Cloudera CDH 4.3.1, follow these steps to co-locate the JAR files and merge configuration file properties.
  1. Copy the following common and core JAR files into a directory of your choosing on your client machine. For example, you might want to name this directory comcore.
    activation-1.1.jar
    asm-3.2.jar
    avro-1.7.4.jar
    commons-beanutils-1.7.0.jar
    commons-beanutils-core-1.8.0.jar
    commons-cli-1.2.jar
    commons-codec-1.4.jar
    commons-collections-3.2.1.jar
    commons-compress-1.4.1.jar
    commons-configuration-1.6.jar
    commons-digester-1.8.jar
    commons-el-1.0.jar
    commons-httpclient-3.1.jar
    commons-io-2.1.jar
    commons-lang-2.5.jar
    commons-logging-1.1.1.jar
    commons-math-2.1.jar
    commons-net-3.1.jar
    guava-11.0.2.jar
    hadoop-annotations-2.0.0-cdh4.3.1.jar
    hadoop-auth-2.0.0-cdh4.3.1.jar
    hadoop-common-2.0.0-cdh4.3.1.jar
    hadoop-hdfs-2.0.0-cdh4.3.1.jar
    hadoop-yarn-api-2.0.0-cdh4.3.1.jar
    hadoop-yarn-client-2.0.0-cdh4.3.1.jar
    hadoop-yarn-common-2.0.0-cdh4.3.1.jar
    hadoop-yarn-server-common-2.0.0-cdh4.3.1.jar
    hive-beeline-0.10.0-cdh4.3.1.jar
    hive-builtins-0.10.0-cdh4.3.1.jar
    hive-cli-0.10.0-cdh4.3.1.jar
    hive-common-0.10.0-cdh4.3.1.jar
    hive-contrib-0.10.0-cdh4.3.1.jar
    hive-exec-0.10.0-cdh4.3.1.jar
    hive-hbase-handler-0.10.0-cdh4.3.1.jar
    hive-hwi-0.10.0-cdh4.3.1.jar
    hive-jdbc-0.10.0-cdh4.3.1.jar
    hive-metastore-0.10.0-cdh4.3.1.jar
    hive-pdk-0.10.0-cdh4.3.1.jar
    hive-serde-0.10.0-cdh4.3.1.jar
    hive-service-0.10.0-cdh4.3.1.jar
    hive-shims-0.10.0-cdh4.3.1.jar
    hue-plugins-2.3.0-cdh4.3.1.jar
    jackson-core-asl-1.8.8.jar
    jackson-jaxrs-1.8.8.jar
    jackson-mapper-asl-1.8.8.jar
    jackson-xc-1.8.8.jar
    jasper-compiler-5.5.23.jar
    jasper-runtime-5.5.23.jar
    jaxb-api-2.2.2.jar
    jaxb-impl-2.2.3-1.jar
    jersey-core-1.8.jar
    jersey-json-1.8.jar
    jersey-server-1.8.jar
    jets3t-0.6.1.jar
    jettison-1.1.jar
    jetty-6.1.26.cloudera.2.jar
    jetty-util-6.1.26.cloudera.2.jar
    jline-0.9.94.jar
    jsch-0.1.42.jar
    jsp-api-2.1.jar
    jsr305-1.3.9.jar
    junit-4.8.2.jar
    kfs-0.3.jar
    libfb303-0.9.0.jar
    log4j-1.2.17.jar
    mockito-all-1.8.5.jar
    paranamer-2.3.jar
    pig-0.11.0-cdh4.3.1-withouthadoop.jar
    protobuf-java-2.4.0a.jar
    servlet-api-2.5.jar
    slf4j-api-1.6.1.jar
    snappy-java-1.0.4.1.jar
    stax-api-1.0.1.jar
    xmlenc-0.52.jar
    xz-1.0.jar
  2. Copy the following MapReduce 1 JAR files into a subfolder under the common and core JAR files directory that you created in Step 1. For example, you might want to name this subfolder MR1.
    hadoop-core-2.0.0-mr1-cdh4.3.1.jar
    hadoop-tools-2.0.0-mr1-cdh4.3.1.jar
  3. Copy the following MapReduce 2 JAR files into a subfolder under the common and core JAR files directory that you created in Step 1. For example, you might want to name this subfolder MR2.
    hadoop-mapreduce-client-common-2.0.0-cdh4.3.1.jar
    hadoop-mapreduce-client-core-2.0.0-cdh4.3.1.jar
    hadoop-mapreduce-client-shuffle-2.0.0-cdh4.3.1.jar
    hadoop-mapreduce-client-jobclient-2.0.0-cdh4.3.1.jar
    hadoop-mapreduce-client-app-2.0.0-cdh4.3.1.jar
  4. Set the SAS_HADOOP_JAR_PATH variable to point to the main folder that contains the common and core JAR files and the subfolder that contains either the MapReduce 1 or the MapReduce 2 JAR files.
    Note: The MapReduce 1 and MapReduce 2 JAR files cannot be on the same Java classpath.
    Note: The JAR files in the SAS_HADOOP_JAR_PATH directory must match the Hadoop server to which SAS is connected. If multiple Hadoop servers are running different Hadoop versions, then create and populate separate directories with version-specific Hadoop JAR files for each Hadoop version. Then dynamically set SAS_HADOOP_JAR_PATH, based on the target Hadoop server to which each SAS job or SAS session is connected. One way to dynamically set SAS_HADOOP_JAR_PATH is to create a wrapper script associated with each Hadoop version. SAS is invoked via a wrapper script that sets SAS_HADOOP_JAR_PATH appropriately to pick up the JAR files that match the target Hadoop server. Upgrading your Hadoop server version might involve multiple active Hadoop versions. The same multi-version instructions apply.
  5. Retrieve the following Hadoop configuration files from your cluster and create one configuration file.
    1. If you are using MapReduce 1, merge the properties from the Hadoop core, Hadoop HDFS, and MapReduce configuration files into one single configuration file.
    2. If you are using MapReduce 2, merge the properties from the Hadoop core, Hadoop HDFS, MapReduce 2, and YARN configuration files into one single configuration file.
      Note: The merged configuration file must have one beginning <configuration> tag and one ending </configuration> tag. Only the properties should exist between the <configuration>…</configuration> tags. Here is an example:
      <?xml version="1.0" encoding="UTF-8"?>
      
      <!--Autogenerated by Cloudera Manager-->
      <configuration>
        <property>
          <name>hive.metastore.local</name>
          <value>false</value>
        </property>
      
      /* lines omitted for sake of brevity */
      
        <property>
          <name>yarn.nodemanager.aux-services</name>
          <value></value>
        </property>
      </configuration>
      
    3. Save the configuration file to a location of your choosing.
      The Hadoop configuration file can be used when assigning a SAS/ACCESS LIBNAME, running PROC HADOOP, or publishing and running the scoring macros.

Co-Locating JAR Files and Merging Configuration File Properties for Hortonworks 1.3.2

If you are using Hortonworks Hadoop and MapReduce 1, follow these steps to co-locate the JAR files and merge the configuration file properties.
  1. Copy the following JAR files to a directory of your choosing on your client machine. For example, you might want to name this directory comcore.
    ambari-log4j-1.2.5.17.jar
    asm-3.2.jar
    aspectjrt-1.6.11.jar
    aspectjtools-1.6.11.jar
    avro-1.7.1.jar
    commons-beanutils-1.7.0.jar
    commons-beanutils-core-1.8.0.jar
    commons-cli-1.2.jar
    commons-codec-1.4.jar
    commons-collections-3.2.1.jar
    commons-configuration-1.6.jar
    commons-daemon-1.0.1.jar
    commons-digester-1.8.jar
    commons-el-1.0.jar
    commons-httpclient-3.0.1.jar
    commons-io-2.1.jar
    commons-lang-2.4.jar
    commons-logging-1.1.1.jar
    commons-logging-api-1.0.4.jar
    commons-math-2.1.jar
    commons-net-3.1.jar
    core-3.1.1.jar
    guava-11.0.2.jar
    hadoop-capacity-scheduler-1.2.0.1.3.2.0-111.jar
    hadoop-client-1.2.0.1.3.2.0-111.jar
    hadoop-core-1.2.0.1.3.2.0-111.jar
    hadoop-fairscheduler-1.2.0.1.3.2.0-111.jar
    hadoop-lzo-0.5.0.jar
    hadoop-minicluster-1.2.0.1.3.2.0-111.jar
    hadoop-tools-1.2.0.1.3.2.0-111.jar
    hive-beeline-0.11.0.1.3.2.0-111.jar
    hive-cli-0.11.0.1.3.2.0-111.jar
    hive-common-0.11.0.1.3.2.0-111.jar
    hive-contrib-0.11.0.1.3.2.0-111.jar
    hive-exec-0.11.0.1.3.2.0-111.jar
    hive-hbase-handler-0.11.0.1.3.2.0-111.jar
    hive-hwi-0.11.0.1.3.2.0-111.jar
    hive-jdbc-0.11.0.1.3.2.0-111.jar
    hive-metastore-0.11.0.1.3.2.0-111.jar
    hive-serde-0.11.0.1.3.2.0-111.jar
    hive-service-0.11.0.1.3.2.0-111.jar
    hive-shims-0.11.0.1.3.2.0-111.jar
    hsqldb-1.8.0.10.jar
    jackson-core-asl-1.8.8.jar
    jackson-mapper-asl-1.8.8.jar
    jasper-compiler-5.5.12.jar
    jasper-runtime-5.5.12.jar
    jdeb-0.8.jar
    jersey-core-1.8.jar
    jersey-json-1.8.jar
    jersey-server-1.8.jar
    jets3t-0.6.1.jar
    jetty-6.1.26.jar
    jetty-util-6.1.26.jar
    jsch-0.1.42.jar
    junit-4.5.jar
    kfs-0.2.2.jar
    libfb303-0.9.0.jar
    log4j-1.2.15.jar
    mockito-all-1.8.5.jar
    netty-3.6.2.Final.jar
    oro-2.0.8.jar
    pig-0.11.1.1.3.2.0-111-core.jar
    postgresql-9.1-901-1.jdbc4.jar
    servlet-api-2.5-20081211.jar
    slf4j-api-1.4.3.jar
    slf4j-log4j12-1.4.3.jar
    xmlenc-0.52.jar
  2. Set the SAS_HADOOP_JAR_PATH variable to point to the main folder that contains the common and core JAR files.
    Note: The JAR files in the SAS_HADOOP_JAR_PATH directory must match the Hadoop server to which SAS is connected. If multiple Hadoop servers are running different Hadoop versions, then create and populate separate directories with version-specific Hadoop JAR files for each Hadoop version. Then dynamically set SAS_HADOOP_JAR_PATH, based on the target Hadoop server to which each SAS job or SAS session is connected. One way to dynamically set SAS_HADOOP_JAR_PATH is to create a wrapper script associated with each Hadoop version. SAS is invoked via a wrapper script that sets SAS_HADOOP_JAR_PATH appropriately to pick up the JAR files that match the target Hadoop server. Upgrading your Hadoop server version might involve multiple active Hadoop versions. The same multi-version instructions apply.
  3. Retrieve the Hadoop core, Hadoop HDFS, and MapReduce 1 configuration files from your cluster and create one configuration file.
    Note: The merged configuration file must have one beginning <configuration> tag and one ending </configuration> tag. Only the properties should exist between the <configuration>…</configuration> tags. Here is an example.
    ?xml version="1.0" encoding="UTF-8"?>
    <configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>abcdef.sas.com:8021</value>
      </property>
    
    /* lines omitted for sake of brevity */
    
      <property>
        <name>fs.defaultFS</name>
        <value>hdfs://abcdef.sas.com:8020</value>
      </property>
    </configuration>
    
  4. Save the configuration file to a location of your choosing.
    The Hadoop configuration file can be used when assigning a SAS/ACCESS LIBNAME, running PROC HADOOP, or publishing and running the scoring macros.

SASEP-SERVERS.SH Script

Overview of the SASEP-SERVERS.SH Script

The sasep-servers.sh script enables you to perform the following actions.
  • Install or uninstall the SAS Embedded Process and Hadoop MapReduce JAR files on a single node or a group of nodes.
  • Start or stop the SAS Embedded Process on a single node or on a group of nodes.
  • Determine the status of the SAS Embedded Process on a single node or on a group of nodes.
  • Write the installation output to a log file.
  • Pass options to the SAS Embedded Process.
Note: The sasep-servers.sh script can be run from any folder on any node in the cluster. You can also add its location to the PATH environment variable.

SASEP-SERVERS.SH Syntax

sasep-servers.sh
-add | -remove | -start | -stop | -status
-hdfsuser user-id
<-hostfile host-list-filename | -host <">host-list<">>
-mrhome path-to-mr-home
<-epscript path-to-ep-install-script>
<-mrscript path-to-mr-jar-file-script>
<-options "option-list">
<-log filename>
<-version apache-version-number>
Arguments

-add

installs the SAS Embedded Process on the hosts that you specify with either the -hostfile or -host option.

Requirement The -hostfile or -host options are mutually exclusive. At least one must be specified.

-remove

removes the SAS Embedded Process on the hosts that you specify with either the -hostfile or -host option.

Requirement The -hostfile or -host options are mutually exclusive.

-start

starts the SAS Embedded Process on the hosts that you specify with either the -hostfile or -host option.

Requirement The -hostfile or -host options are mutually exclusive.

-stop

stops the SAS Embedded Process on the hosts that you specify with either the -hostfile or -host option

Requirement The -hostfile or -host options are mutually exclusive.

-status

provides the status of the SAS Embedded Process on the hosts that you specify with either the -hostfile or -host option.

Requirement The -hostfile or -host options are mutually exclusive.

-hdfsuser user-id

specifies the user ID that has Write access to HDFS root directory.

Note The user ID is used to copy the SAS Embedded Process configuration files to HDFS.

-hostfile host-list-filename

specifies the full path of a file that contains the list of hosts where the SAS Embedded Process is installed, removed, started, stopped, or status is provided.

Tip You can also assign a host list filename to a UNIX variable, sas_ephosts_file.
export sasep_hosts=/etc/hadoop/conf/slaves
Example
-hostfile /etc/hadoop/conf/slaves

-host <">host-list<">

specifies the target host or host list where the SAS Embedded Process is installed, removed, started, stopped, or status is provided.

Requirement If you specify more than one host, the hosts must be enclosed in double quotation marks and separated by spaces.
Tip You can also assign a list of hosts to a UNIX variable, sas_ephosts.
export sasep_hosts="server1 server2 server3"
Example
-host "server1 server2 server3"
-host bluesvr

-mrhome path-to-mr-home

specifies the path to the MapReduce home.

-epscript path-to-ep-install-script

copies and unpacks the SAS Embedded Process install script file to the host.

Restriction Use this option only with the -add option.
Requirement You must specify either the full or relative path of the SAS Embedded Process install script, tkindbsrv-9.41_M1-n_lax.sh file.
Example
-epscript /home/hadoop/image/current/tkindbsrv-9.41_M1-1_lax.sh

-mrscript path-to-mr-jar-file-script

copies and unpacks the Hadoop MapReduce JAR files install script on the hosts.

Restriction Use this option only with the -add option.
Requirement You must specify either the full or relative path of the Hadoop MapReduce JAR file install script, hadoopmrjars-9.41.n_lax.sh file.
Example
-mrscript /home/hadoop/image/current/tkindbsrv-9.41_M1-1_lax.sh

-options "option-list"

specifies options that are passed directly to the SAS Embedded Process. The following options can be used.

-trace trace-level

specifies what type of trace information is created.

0 no trace log
1 fatal error
2 error with information or data value
3 warning
4 note
5 information as an SQL statement
6 critical and command trace
7 detail trace, lock
8 enter and exit of procedures
9 tedious trace for data types and values
10 trace all information
Default 02
Note The trace log messages are stored in the MapReduce job log.

-port port-number

specifies the TCP port number where the SAS Embedded Process accepts connections.

Default 9261
Requirement The options in the list must be separated by spaces, and the list must be enclosed in double quotation marks.

-log filename

writes the installation output to the specified filename.

-version apache-version-number

specifies the Hadoop version of the JAR file that you want to install on the cluster. The apache-version-number can be one of the following values.

0.23

installs the Hadoop MapReduce JAR files that are built from Apache Hadoop 0.23 (sas.hadoop.ep.apache023.jar and sas.hadoop.ep.apache023.nls.jar).

1.2

installs the Hadoop MapReduce JAR files that are built from Apache Hadoop 1.2.1 (sas.hadoop.ep.apache121.jar and sas.hadoop.ep.apache121.nls.jar).

2.0

installs the Hadoop MapReduce JAR files that are built from Apache Hadoop 0.2.3 (sas.hadoop.ep.apache023.jar and sas.hadoop.ep.apache023.nls.jar).

2.1

installs the Hadoop MapReduce JAR files that are built from Apache Hadoop 2.0.5 (sas.hadoop.ep.apache205.jar and sas.hadoop.ep.apache205.nls.jar).

Default If you do not specify the -version option, the sasep.servers.sh script will detect the version of Hadoop that is in use and install the JAR files associated with that version. For more information, see Installing the SAS Embedded Process and Hadoop MapReduce JAR Files.
Interaction The -version option overrides the version that is automatically detected by the sasep.servers.sh script.

Starting the SAS Embedded Process

There are three ways to manually start the SAS Embedded Process.
  • Run the sasep-servers.sh script with the -start option on the master node.
    This starts the SAS Embedded Process on all nodes. For more information about running the sasep-servers.sh script, see SASEP-SERVERS.SH Syntax.
  • Run sasep-server-start.sh on a node.
    This starts the SAS Embedded Process on the local node only. The sasep-server-start.sh script is located in the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41/bin/ directory. For more information, see Installing the SAS Embedded Process and Hadoop MapReduce JAR Files.
  • Run the UNIX service command on a node.
    This starts the SAS Embedded Process on the local node only. The service command calls the init script that is located in the /etc/init.d directory. A symbolic link to the init script is created in the /etc/rc3.d and /etc/rc5.d directories, where 3 and 5 are the run level at which you want the script to be executed.
    Because the SAS Embedded Process init script is registered as a service, the SAS Embedded Process is started automatically when the node is rebooted.

Stopping the SAS Embedded Process

The SAS Embedded Process continues to run until it is manually stopped. The ability to control the SAS Embedded Process on individual nodes could be useful when performing maintenance on an individual node.
There are three ways to stop the SAS Embedded Process.
  • Run the sasep-servers.sh script with the -stop option from the master node.
    This stops the SAS Embedded Process on all nodes. For more information about running the sasep-servers.sh script, see SASEP-SERVERS.SH Syntax.
  • Run sasep-server-stop.sh on a node.
    This stops the SAS Embedded Process on the local node only. The sasep-server-stop.sh script is located in the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41/bin/ directory. For more information, see Installing the SAS Embedded Process and Hadoop MapReduce JAR Files.
  • Run the UNIX service command on a node.
    This stops the SAS Embedded Process on the local node only.

Determining the Status of the SAS Embedded Process

You can display the status of the SAS Embedded Process on one node or all nodes. There are three ways to display the status of the SAS Embedded Process.
  • Run the sasep-servers.sh script with the -status option from the master node.
    This displays the status of the SAS Embedded Process on all nodes. For more information about running the sasep-servers.sh script, see SASEP-SERVERS.SH Syntax.
  • Run sasep-server-status.sh from a node.
    This displays the status of the SAS Embedded Process on the local node only. The sasep-server-status.sh script is located in the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.41/bin/ directory. For more information, see Installing the SAS Embedded Process and Hadoop MapReduce JAR Files.
  • Run the UNIX service command on a node.
    This displays the status of the SAS Embedded Process on the local node only.

Hadoop Permissions

The SAS Embedded Process install process must be executed as root. The install script requires you to pass, as a parameter, a Hadoop HDFS user with Write permission to the root of the HDFS.
Note: These are standard UNIX ownership permissions.

Documentation for Using In-Database Processing in Hadoop

For information about using in-database processing in Hadoop, see the following publications:
  • SAS In-Database Products: User's Guide
  • High-performance procedures in various SAS publications
  • SAS Data Integration Studio: User’s Guide
  • SAS/ACCESS Interface to Hadoop and PROC HDMD in SAS/ACCESS for Relational Databases: Reference
  • SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide
  • SAS Intelligence Platform: Data Administration Guide
  • PROC HADOOP in Base SAS Procedures Guide
  • FILENAME Statement, Hadoop Access Method in SAS Statements: Reference