Prerequisites for Hadoop

Overview

Before you can use the Hadoop transformations included with SAS Data Integration Studio, you must perform the following tasks:

Tasks

Set Up the Hadoop Server

Set up a SAS server environment that supports Hadoop.
After you install and configure an environment for the second maintenance release for SAS 9.3 or later, perform the following steps.
  1. Install the Hadoop JAR files required by SAS on your SAS machine. Note that the SAS Hadoop implementation does not include the Hadoop JAR files that are open-source files. You must copy the required Hadoop JAR files from your Hadoop server onto your SAS machine where the workspace server resides.
    Create a directory on your SAS machine accessible to all SAS users.
  2. Copy Hadoop JAR files with names similar to the following files into that directory:
    • hadoop-core.jar
    • pig.jar
    • hive-exec.jar
    • hive-jdbc.jar
    • hive-metastore.jar
    • hive-service.jar
    • libfb303.jar
    Note that the filenames might vary based on which Hadoop environment and version you are accessing. You might need assistance from your Hadoop administrator to locate the JAR files and copy them to the SAS machine. Except for libfb303, these JAR files include version numbers. For example, on the Hadoop server, the pig JAR file might be pig-0.8.0, pig-0.9.1, or something similar.
  3. SAS must be able to find the copied JAR files. Therefore, you must set the SAS_HADOOP_JAR_PATH environment variable by creating an operating environment variable named SAS_HADOOP_JAR_PATH as the directory path of the JAR files.
    For example, on Windows, if the JAR files are copied to directory C:\third_party\Hadoop\jars, then the following DOS command sets the environment variable appropriately:
    set SAS_HADOOP_JAR_PATH=C:\third_party\Hadoop\jars
    The following command sets the environment variable similarly on UNIX:
    export SAS_HADOOP_JAR_PATH=/users/third_party/Hadoop/jars
    Set SAS_HADOOP_JAR_PATH in a permanent manner for all SAS users accessing Hadoop.

Define the Hadoop Server

Perform the following steps to define the Hadoop server in SAS Management Console:
  1. Connect as an administrator.
  2. Invoke SAS Management Console and connect to the metadata server that has been set up for Hadoop.
  3. Create a Hadoop users group by right-clicking on the User Manager to define a new group to run and create the Hadoop jobs. (Note that the PROC HADOOP code has user ID and password in the code. Therefore, creating a generic user ID helps other users run the jobs.) Perform the following tasks:
    • Create a new group with an appropriate name.
    • Select Accounts. Select New and set the user ID to an appropriate value. Then, provide an appropriate password. Finally, create an authentication domain for Hadoop users, with a distinctive name such as HadoopUsers.
    • Add all the account users to this group so that they can create and run the Hadoop jobs.
  4. Define a new Hadoop server by right-clicking on New Server and selecting Resource Templatesthen selectServersthen selectCloud Serversthen selectHadoop Servers. Perform the following steps:
    • Provide a name for the Hadoop server and click Next.
    • Supply an associated machine name from the list of SAS Hadoop servers and click Next.
    • Specify the Hive server information and click Next.
    • Accept default values until you reach the Authentication Domain. Specify the HadoopUsers domain and click Finish.

Validate Hadoop

Make sure that PROC HADOOP runs in your SAS Data Integration Studio implementation. To perform this validation, access the Code Editor window from the Tools menu and submit the following code:
Proc Hadoop; run;
Check the log to ensure that the following note is displayed: Procedure Hadoop used. If the Hadoop environment was not properly prepared, you will see the following note: PROC Hadoop not available.