Obtaining Hadoop Distribution JAR and Configuration Files from Hadoop Cluster :: SAS® Scalable Performance Data Server 5.3: Processing Data in Hadoop

Overview

To use SPD Server to access files on a Hadoop server, a set of Hadoop JAR and configuration files must be available to the SPD Server machine. To make the required JAR and configuration files available, you must obtain these files from the Hadoop cluster, copy them to the SPD Server machine, and specify the Hadoop parameter file options.

There are two methods to obtain the JAR and configuration files:

If you license SAS/ACCESS Interface to Hadoop, use the SAS Deployment Manager.
Use the Hadoop tracer script hadooptracer.py in Python that is provided by SAS.

Note: Gathering the JAR and configuration files is a one-time process (unless you are updating your cluster or changing Hadoop vendors). If you have already gathered the Hadoop JAR and configuration files for another SAS component using the SAS Deployment Manager or the Hadoop tracer script, you do not need to do it again.

Using SAS Deployment Manager to Obtain the Hadoop JAR and Configuration Files

If you license SAS/ACCESS Interface to Hadoop, you should use SAS Deployment Manager to obtain and make required Hadoop JAR and configuration files available to the SPD Server machine. For more information about using SAS Deployment Manager for SAS/ACCESS Interface to Hadoop, see SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS.

If you do not license SAS/ACCESS Interface to Hadoop, follow the instructions in Using the Hadoop Tracer Script to Obtain the Hadoop JAR and Configuration Files.

Using the Hadoop Tracer Script to Obtain the Hadoop JAR and Configuration Files

Prerequisites for Using the Hadoop Tracer Script

These are prerequisites to ensure that the Hadoop tracer script runs successfully:

Ensure that Python and strace are installed. Contact your system administrator if these packages are not installed on the system.
Ensure that the user running the script has authorization to issue HDFS and Hive commands.
If Hadoop is secured with Kerberos, obtain a Kerberos ticket for the user before running the script.

Obtaining and Running the Hadoop Tracer Script

To obtain and run the Hadoop tracer script, follow these steps:

On the Hadoop server, create a temporary directory to hold a ZIP file that you will download. An example is /opt/sas/hadoopfiles/temp.
Download the hadooptracer.zip file from the following FTP site to the directory that you created in Step 1: ftp://ftp.sas.com/techsup/download/blind/access/hadooptracer.zip.
Using a method of your choice (for example, PSFTP, SFTP, SCP, or FTP), transfer the ZIP file to the Hive node on your Hadoop cluster.
Unzip the file.

The hadooptracer.py file is included in this ZIP file.
Change permissions on the file to have EXECUTE permission.
```
chmod 755 hadooptracer.py
```
Run the tracer script.
```
python ./hadooptracer.py --filterby latest
```
Note: The filterby latest option ensures that if duplicate JAR or configuration files exist, the latest version is selected.
This script performs the following tasks:
- Pulls the necessary Hadoop JAR and configuration files and places them in the /tmp/jars directory and the /tmp/sitexmls directory, respectively.
- Creates the hadooptrace.json file in the /tmp directory. If you need a custom filename or path for the JSON output file, use this command instead:
  python ./hadooptracer.py –f /your-path/your-filename.json
- Creates a log in the /tmp/hadooptracer.log directory.
  
  If a problem occurs and you need to contact SAS Technical Support, please include this log file with your tech support track.
  
  Note: Some error messages in the console output for hadooptracer.py are normal and do not necessarily indicate a problem with the JAR file and configuration file collection process. However, if the files are not collected as expected or if you experience problems connecting to Hadoop with the collected files, contact SAS Technical Support for assistance and include the /tmp/hadooptracer.log file.
On the SPD Server machine, create two directories to hold the JAR and configuration files. An example would be /opt/sas/hadoopfiles/jars and /opt/sas/hadoopfiles/configs.
Using a method of your choice (for example, PSFTP, SFTP, SCP, or FTP), copy the files in the /tmp/jars and the /tmp/jars and the /tmp/sitexmls directories on the Hadoop server to directories on the SPD Server machine that you created in Step 7, such as /opt/sas/hadoopfiles/jars and /opt/sas/hadoopfiles/configs, respectively.
If you are using Hortonworks or MapR, additional configuration is needed. For more information, see the following topics:
- Additional Configuration for Hortonworks
- Additional Configuration for a MapR Hadoop Distribution