Preparing for a Remote Parallel Connection

Overview of Preparing for a Remote Parallel Connection

Before you can configure the SAS High-Performance Analytics environment to use the SAS Embedded Process for a parallel connection with your data store, you must locate particular JAR files and gather particular information about your data provider.
From the following list, choose the topic for your respective remote data provider:

Prepare for Hadoop

Overview of Preparing for Hadoop

Note: In the third maintenance release of SAS 9.4, SAS Embedded Process supports the Cloudera, Hortonworks, IBM BigInsights, MapR, and Pivotal HD distributions of Hadoop. For more specific version information, see the SAS 9.4 Support for Hadoop.
Preparing for a remote parallel connection to Hadoop consists of the following steps:
  1. downloading and running a tracer script, hadooptracer.py, supplied by SAS.
    hadooptracer.py traces the system calls of various Hadoop client tools and uses this data to identify the appropriate client JAR files that the SAS High-Performance Analytics environment requires for Hadoop client machine to Hadoop server environment connectivity.
    The script also copies these Hadoop client JAR files to a location on the Hadoop Hive node that you specify. You then manually copy this directory to the machine where you will deploy the SAS High-Performance environment root node. See Run the Hadoop Tracer Script.
  2. recording information about your Hadoop deployment that you will need when configuring the analytics environment for a remote data store.
  3. determining the version of the JRE used by Hadoop and installing that same JRE on the analytics cluster.

Prerequisites for the Hadoop Tracer Script

In order to run the Hadoop tracer script, you must meet these prerequisites:
  • Python 2.6 (or later) and strace must be installed.
  • We recommend that your Hadoop Hive node machine must have a temporary directory named tmp under the root directory (/tmp).
    By default, the script uses a single directory (/tmp) as its temporary and output directories. However, you can change these using various script options.
  • The user running the script must have the following:
    • a Linux account with SSH access (password or private key), to the Hive node or NameNode.
    • authorization to issue HDFS and Hive commands.
      Before the script is executed, a simple validation test is run to see whether HDFS (hadoop) and Hive (hive) commands can be issued. If the validation test fails, the script does not execute.
    • a ticket (Kerberos only).

Run the Hadoop Tracer Script

To download and run the Hadoop tracer script, follow these steps:
  1. Make sure that your system meets the prerequisites.
  2. On a machine from which you can also access your Hadoop Hive node, create a temporary directory to download the script.
    Here is an example:
    mkdir hadoopfiles_temp
  3. Download the hadooptracer.zip file from the following FTP site to the directory that you created: ftp://ftp.sas.com/techsup/download/blind/access/hadooptracer.zip.
  4. If there is not a /tmp directory on your Hadoop Hive node, create it.
    By default, the script uses a single directory (/tmp) as its temporary and output directories. However, you can change these using various script options. See Step 8.
  5. Using a transfer method, such as PSFTP, SFTP, SCP, or FTP, transfer the ZIP file to the Hive node on your Hadoop cluster.
    Here is an example:
    scp /opt/sas/hadoopfiles_temp/hadooptracer.zip 
    root@hive_node.example.com:/tmp
  6. On the Hadoop Hive node, change to the /tmp directory and unzip hadooptracer.zip.
    Here is an example:
    cd /tmp
    tar xzf hadooptracer.zip
  7. Enter the following command to grant Execute permissions on the script file:
    chmod 755 ./hadooptracer.py
  8. Run the script with these options:
    python hadooptracer.py --filterby=latest --postprocess
    Note: hadooptracer.py ignores --postprocess on Cloudera Hadoop clusters.
    Tip
    For more information about script options, run the script with the -h option.
    The script does the following:
    1. traces the system calls of various Hadoop client tools and uses this data to identify the appropriate client JAR files.
    2. copies the client JAR files to /tmp/jars.
    3. writes its log to /tmp/hadhooptracer.log.
    Note: Some error messages in the console output for hadooptracer.py are normal and do not necessarily indicate a problem with the JAR and configuration file collection process. However, if the files are not collected as expected or if you experience problems connecting to Hadoop with the collected files, contact SAS Technical Support and include the hadooptracer.log file.
  9. When the script finishes executing, delete the following files:
    • derby*.jar
    • spark-examples*.jar
    • ranger-plugins-audit*.jar
    • avatica*.jar
    • hadoop-0.20.2-dev-core*.jar
  10. Copy the JAR files that hadhooptracer.py writes in/tmp/jars to a directory on the SAS High-Performance Analytics environment root node machine. As the analytics environment configuration script prompts you for this directory later, be sure to note it in Record the Location of the Hadoop Client JAR Files.

Hadoop Checklists

Before you can configure the SAS High-Performance Analytics environment to use the SAS Embedded Process for a parallel connection with your Hadoop data store, there are certain requirements that must be met.
Note: In the third maintenance release of SAS 9.4, the SAS Embedded Process supports the IBM BigInsights, Cloudera, Hortonworks, MapR, and Pivotal HD distributions of Hadoop. For more detailed information, see the SAS Foundation system requirements documentation for your operating environment.
  1. Record the path to the Hadoop client JAR files required by the analytics environment in the table that follows:
    Record the Location of the Hadoop Client JAR Files
    Example
    Actual Path of the Required Hadoop JAR Files on Your Analytics Environment Root Node
    /opt/hadoop_jars
    (Hadoop client JAR files)
    Note: The location of the common and core JAR files listed in Record the Location of the Hadoop Client JAR Files should be the same location that you copied the client JAR files to in Step 10.
  2. Record the location (JAVA_HOME) of the 64-bit Java Runtime Engine (JRE) used by the analytics environment in the table that follows:
    Record the Location of the JRE Used by the SAS High-Performance Analytics Environment
    Example
    Actual Path of the JRE on Your Analytics Environment Root Node
    /opt/java/jre1.7.0_07
    Note: This is the location of the JRE that the SAS High-Performance Analytics environment uses, not the location of the JRE that Hadoop uses.

Install the JRE on the Analytics Cluster

The SAS High-Performance Analytics environment requires a 64-bit Java Runtime Engine (JRE) when the environment is configured for a remote parallel connection.
Note: The JRE used by the analytics environment must match the version of the JRE used by your Hadoop cluster.
As the analytics environment configuration script prompts you for the location of this JRE (JAVA_HOME), be sure to note it in Record the Location of the JRE Used by the SAS High-Performance Analytics Environment.

Prepare for a Greenplum Data Computing Appliance

Before you can configure the SAS High-Performance Analytics environment to use the SAS Embedded Process for a parallel connection with your Greenplum Data Computing Appliance, there are certain requirements that must be met.
  1. Install the Greenplum client on the Greenplum Master Server (blade 0) in your analytics cluster.
    For more information, refer to your Greenplum documentation.
  2. Record the path to the Greenplum client in the table that follows:
    Record the Location of the Greenplum Client
    Example
    Actual Path of the Greenplum Client on Your System
    /usr/local/greenplum-db

Prepare for a HANA Cluster

Before you can configure the SAS High-Performance Analytics environment to use the SAS Embedded Process for a parallel connection with your HANA cluster, there are certain requirements that must be met.
  1. Install the HANA client on blade 0 in your analytics cluster.
    For more information, refer to your HANA documentation.
  2. Record the path to the HANA client in the table that follows:
    Record the Location of the HANA Client
    Example
    Actual Path of the HANA Client on Your System
    /usr/local/lib/hdbclient

Prepare for an Oracle Exadata Appliance

Before you can configure the SAS High-Performance Analytics environment to use the SAS Embedded Process for a parallel connection with your Oracle Exadata appliance, there are certain requirements that must be met.
  1. Install the Oracle client on blade 0 in your analytics cluster.
    For more information, refer to your Oracle documentation.
  2. Record the path to the Oracle client in the table that follows. (This should be the absolute path to libclntsh.so):
    Record the Location of the Oracle Client
    Example
    Actual Path of the Oracle Client on Your System
    /usr/local/ora11gr2/product/11.2.0/client_1/lib
  3. Record the value of the Oracle TNS_ADMIN environment variable in the table that follows. (Typically, this is the directory that contains the tnsnames.ora file):
    Record the Value of the Oracle TNS_ADMIN Environment Variable
    Example
    Oracle TNS_ADMIN Environment Variable Value on Your System
    /my_server/oracle

Prepare for a Teradata Managed Server Cabinet

Before you can configure the SAS High-Performance Analytics environment to use the SAS Embedded Process for a parallel connection with your Teradata Managed Server Cabinet, there are certain requirements that must be met.
  1. Install the Teradata client on blade 0 in your analytics cluster.
    For more information, refer to your Teradata documentation.
  2. Record the path to the Teradata client in the table that follows. (This should be the absolute path to the directory that contains the odbc_64 subdirectory):
    Record the Location of the Teradata Client
    Example
    Actual Location of the Teradata Client on Your System
    /opt/teradata/client/13.10
Last updated: June 19, 2017