In-Database Deployment Package for Hadoop

Prerequisites

SAS Foundation and the SAS/ACCESS Interface to Hadoop must be installed before you install and configure the in-database deployment package for Hadoop.
The SAS Embedded Process requires Cloudera CHD4. Three settings in Cloudera CHD4 must be changed to optimize the performance of the SAS Embedded Process. Enter these lines in the Hadoop configuration file to change settings. The values that you enter for the last two lines depends on your Hadoop system.
Mapred.job.reuse.jvm.num.tasks = -1
Mapred.tasktracker.map.tasks.maximum = value
Mapred.map.tasks = value

Overview of the In-Database Deployment Package for Hadoop

This section describes how to install and configure the in-database deployment package for Hadoop (SAS Embedded Process 9.35).
The in-database deployment package for Hadoop must be installed and configured before you can read and write data to a Hadoop Distributed File System (HDFS) in parallel for High-Performance Analytics (HPA).
The in-database deployment package for Hadoop includes two parts, the SAS Embedded Process and the Hadoop JAR installer. The SAS Embedded Process is a SAS server process that runs within Hadoop to read and write data. The SAS Embedded Process contains macros, run-time libraries, and other software that is installed on your Hadoop system. Both parts must be installed on all nodes of a Hadoop cluster.

Hadoop Installation and Configuration Steps

  1. If you are upgrading from or reinstalling a previous release, follow the instructions in Upgrading from or Reinstalling a Previous Version before installing the in-database deployment package.
  2. Move the SAS Embedded Process and Hadoop JAR installers to the Hadoop master node. For more information, see Moving the SAS Embedded Process and Hadoop JAR Installers.
  3. Install the SAS Embedded Process and the Hadoop JAR files.For more information, see Moving the SAS Embedded Process and Hadoop JAR Installers.

Upgrading from or Reinstalling a Previous Version

To upgrade or reinstall a previous version, follow these steps.
  1. Stop the Hadoop SAS Embedded Process.
    SASEPHome/SAS/SASTKInDatabaseForHadoop/9.35/bin/sasep-stop.all.sh
    SASEPHome is the location where you installed the SAS Embedded Process. For more information, see Installing the SAS Embedded Process.
    For more information about stopping the SAS Embedded Process, see Controlling the Hadoop SAS Embedded Process.
  2. Delete the Hadoop SAS Embedded Process from all nodes.
    SASEPHome/SAS/SASTKInDatabaseForHadoop/9.35/bin/sasep-delete.all.sh
    For more information about stopping the SAS Embedded Process, see Controlling the Hadoop SAS Embedded Process.
  3. Reinstall the SAS Embedded Process, the Hadoop JAR files, or both by rerunning the install scripts, sasep-copy-all.sh and hdepjarinstall.sh, respectively. For more information, see Installing the SAS Embedded Process and Hadoop JAR Files.
Note: It is recommended that you make a backup copy of the Hadoop JAR files and the configuration file before upgrading or reinstalling.

Moving the SAS Embedded Process and Hadoop JAR Installers

Moving the SAS Embedded Process Installer

The SAS Embedded Process installer is contained in a self-extracting archive file named tkindbsrv-9.35-n_lax.sh. n is a number that indicates the latest version of the file. If this is the initial installation, n has a value of 1. Each time you reinstall or upgrade, n is incremented by 1. The self-extracting archive file is located in the SAS-install-directory/SASTKInDatabaseServer/9.35/HadooponLinuxx64/ directory.
Using a method of your choice, transfer the SAS Embedded Process Installer to your Hadoop master node. This example uses Secure Copy, and SASEPHome is the location where you want to install the SAS Embedded Process.
scp tkindbsrv-9.35-1_lax.sh username@hadoop:/SASEPHome

Moving the Hadoop JAR Installer

The Hadoop JAR installer is contained in a self-extracting archive file named hadoopmrjars-9.3-n_lax.sh.sh. n is a number that indicates the latest version of the file. If this is the initial installation, n has a value of 1. Each time you reinstall or upgrade, n is incremented by 1. The self-extracting archive file is located in the SAS-install-directory/SASACCESStoHadoopMapReduceJARFiles/ directory.
Using a method of your choice, transfer the Hadoop JAR installer to your Hadoop master node. This example uses Secure Copy, and SASEPHome is the location where you want to install the SAS Embedded Process.
scp hadoopmrjars-9.3-1_lax.sh username@hadoop:/SASEPHome

Installing the SAS Embedded Process and Hadoop JAR Files

Installing the SAS Embedded Process

To install the SAS Embedded Process, follow these steps.
Note: Permissions are needed to install the SAS Embedded Process. For more information, see Hadoop Permissions.
  1. Move to the master node where you want the SAS Embedded Process installed.
    cd /SASEPHome
    SASEPHome is the same location to which you copied the self-extracting archive file. For more information, see Moving the SAS Embedded Process Installer.
    Note: It is recommended that the person who installed the Hadoop system also install the SAS Embedded Process and Hadoop JAR files.
  2. Use the following command to unpack the tkindbsrv-9.35-n_lax.sh file.
    SASEPHome/tkindbsrv-9.35-n_lax.sh
    
    n is a number that indicates the latest version of the file. If this is the initial installation, n has a value of 1. Each time you reinstall or upgrade, n is incremented by 1.
    Note: If you unpack in the wrong directory, you can simply move it after the unpack.
    After this script is run and the files are unpacked, the following directory structure is created where SASEPHome is the master node from Step 1.
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/misc
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/sasexe
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/utilities
    
    In addition, the content of the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin should look similar to this.
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/InstallTKIndbSrv.sh
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-copy-all.sh
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-delete-all.sh
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-start-all.sh
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-start.sh
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-status-all.sh
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-status.sh
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-stop-all.sh
    /SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-stop.sh
    
    Most of these files enable you to control the SAS Embedded Process. For more information about stopping the SAS Embedded Process, see Controlling the Hadoop SAS Embedded Process.
  3. (Optional) Create the same SASEPHome directory on all slave nodes where the SAS Embedded Process will be installed. This step is not required. It is considered a best practice.
  4. Use the following script to deploy the SAS Embedded Process installation across all the slave nodes where SASEPHome is the root directory from Step 1.
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-copy-all.sh
    Note: The sasep-copy-all.sh script cannot be moved and must be executed from its location in the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/ directory. The location that the script is executed from determines the installation directory.
    The script asks for a host list of the slave nodes. Use this command to find the Cloudera host list where HadoopHome is the Hadoop host.
    HadoopHome/hadoop/etc/hadoop/slaves
    Note: Although you can install the SAS Embedded Process in multiple locations, the best practice is to have only one instance installed.
  5. Use the following command to start the SAS Embedded Process on all nodes where SASEPHome is the root directory from Step 1.
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-start-all.sh
    Note: The sasep-start-all.sh script cannot be moved. It must be executed from its location in the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/ directory. The location that the script is executed from determines the installation directory.
    The script asks for a host list of the slave nodes. Use this command to find the Cloudera host list.
    HadoopHome/hadoop/etc/hadoop/slaves
  6. (Optional) Execute the following script to check the status of the SAS Embedded Process on all nodes.
    SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/sasep-status-all.sh
    Note: The sasep-status-all.sh script cannot be moved and must be executed from its location in the SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/ directory. The location that the script is executed from is used to determine the installation directory.
    The script asks for a host list of the slave nodes. The Cloudera host list can be found by using this command.
    HadoopHome/hadoop/etc/hadoop/slaves

Installing the Hadoop JAR Files

To install the Hadoop JAR files, follow these steps.
Note: Permissions are needed to install the Hadoop JAR files. For more information, see Hadoop Permissions.
  1. Move to the master node where you want the SAS Embedded Process installed.
    cd /SASEPHome
    SASEPHome is the same location to which you copied the self-extracting archive file. For more information, see Moving the Hadoop JAR Installer.
    Note: It is recommended that the person who installed the Hadoop system also install the SAS Embedded Process and Hadoop JAR files.
  2. Use this command to unpack the JAR installer.
    ./hadoopmrjars-9.3-n_lax.sh
    An installer script is created in the /SASEPHome/SAS/SASACCESStoHadoopMapReduceJARFiles/9.3/bin/hdepjarinstall.sh directory.
  3. Use this command to execute the hdepjarinstall.sh script.
    ./hdepjarinstall.sh
    The script prompts you for the location of the master and slave host list. The Cloudera master and host lists can be found by using these commands where HadoopHome is the Hadoop host.
    HadoopHome/hadoop/etc/hadoop/master
    HadoopHome/hadoop/etc/hadoop/slaves
    The JAR files are installed to the following directories.
    /HadoopHome/hadoop/share/hadoop/common/lib/sas.hadoop.ep.cloudera.nls.jar
    /HadoopHome/hadoop/share/hadoop/common/lib/sas.hadoop.ep.cloudera.jar
    
    In addition, an ep-config.xml configuration file is written to the HDFS file system. You can run this command to locate the configuration file.
    hadoop fs -ls /sas/ep/config

Controlling the Hadoop SAS Embedded Process

The SAS Embedded Process starts when you run a script. It continues to run until it is manually stopped. You can start and stop the SAS Embedded Process on all nodes or one node at a time. The ability to control the SAS Embedded Process on individual nodes could be useful when performing maintenance on an individual node.
When the SAS Embedded Process is installed, the scripts that control the SAS Embedded Process are installed in the following directory.
SASEPHome/SAS/SASTKInDatabaseServerForHadoop/9.35/bin/
SASEPHome is the location where you installed the SAS Embedded Process.For more information, see Installing the SAS Embedded Process.
The following scripts enable you to control the SAS Embedded Process.
Note: These scripts must be run from the directory. You cannot use a relative path.
Script
Action performed
sasep-start-all.sh
Starts the SAS Embedded Process on all slave nodes.
sasep-start.sh
Starts the SAS Embedded Process only on the node where the script is run.
sasep-stop-all.sh
Shuts down the SAS Embedded Process on all slave nodes.
sasep-stop.sh
Shuts down the SAS Embedded Process only on the node where the script is run.
sasep-status-all.sh
Provides the status of the SAS Embedded Process on all slave nodes.
sasep-status.sh
Provides the status of the SAS Embedded Process only for the node where the script is run.
sasep-delete-all.sh
Removes the SAS Embedded Process from all nodes. This removes all the files that were installed in the epinstall_dir/SAS/ directory.

Hadoop Permissions

To install the SAS Embedded Process and Hadoop JAR files, you need WRITE access to the Hadoop and MapReduce install locations. You also need WRITE access to the root folder of HDFS.

Documentation for Hadoop

For information about using Hadoop, see the following publications:
  • SAS/ACCESS Interface to Hadoop in SAS/ACCESS for Relational Databases: Reference
  • SAS High-Performance Analytics Infrastructure: Installation and Configuration Guide
  • SAS Intelligence Platform: Data Administration Guide
  • PROC HADOOP in Base SAS Procedures Guide
  • FILENAME Statement, Hadoop Access Method in SAS Statements Reference
  • SAS Data Integration Studio: User’s Guide
  • High-performance procedures in various SAS publications