Configuring for Access to a Data Store with a SAS Embedded Process

Overview of Configuring for Access to a Data Store with a SAS Embedded Process

The process involved for configuring the SAS High-Performance Analytics environment with a SAS Embedded Process consists of the following steps:
  1. Prepare for the data provider that the analytics environment will query.
    For more information, see Preparing for a Remote Parallel Connection.
    Note: Other third-party data providers besides Hadoop are supported. For more information, see the SAS In-Database Products: Administrator’s Guide.
  2. Review the considerations for configuring the analytics environment for use with a remote data store.
    For more information, see How the Configuration Script Works.
  3. Configure the analytics environment for a remote data store.

How the Configuration Script Works

You configure the SAS High-Performance Analytics environment with a SAS Embedded Process using a shell script. The script enables you to configure the environment for the various third-party data stores supported by the SAS Embedded Process.
The analytics environment is designed on the principle, install once, configure many. For example, suppose that your site has three remote data stores from three different third-party vendors whose data you want to analyze. You run the analytics environment configuration script one time and provide the information for each data store vendor as you are prompted for it. (When prompted for a data store vendor that you do not have, simply ignore that set of prompts.)
When you have different versions of the same vendor’s data store, specifying the vendor’s latest client data libraries usually works. However, this choice can be problematic for different versions of Hadoop, where a later set of JAR files is not typically backwardly compatible with earlier versions, or for sites that use Hadoop implementations from more than one vendor. (The configuration script does not delineate between different Hadoop vendors.) In these situations, you must run the analytics environment configuration script once for each different Hadoop version or vendor. As the configuration script creates a TKGrid_REP directory underneath the current directory, it is important to run the script a second time from a different directory.
To illustrate how you might manage configuring the analytics environment for two different Hadoop vendors, consider this example: suppose your site uses Cloudera Hadoop 4 and Hortonworks Data Platform 2. When running the analytics environment script to configure for Cloudera 4, you would create a directory similar to:
cdh4
When configuring the analytics environment for Cloudera, you would run the script from the cdh4 directory. When complete, the script creates a TKGrid_REP child directory:
cdh4/TKGrid_REP
For Hortonworks, you would create a directory similar to:
hdp2
When configuring the analytics environment for Hortonworks, you would run the script from the hdp2 directory. When complete, the script creates a TKGrid_REP child directory:
hdp2/TKGrid_REP

Configure for Access to a Data Store with a SAS Embedded Process

To configure the High-Performance Analytics environment for a remote data store, follow these steps:
  1. Make sure that you have reviewed all of the information contained in the section Preparing for a Remote Parallel Connection.
  2. Make sure that you understand how the analytics environment configuration script works, as described in How the Configuration Script Works.
  3. The software that is needed for the analytics environment is available from within the SAS Software Depot that was created by the site depot administrator: depot-installation-location/standalone_installs/SAS_High-Performance_Node_Installation/3_6/Linux_for_x64.
  4. Copy the TKGrid_REP file that is appropriate for your operating system to the /tmp directory of the root node of the analytic cluster.
  5. Log on to the machine that is the root node of the cluster with a user account that has the necessary permissions.
  6. Change directories to the desired installation location, such as /opt.
  7. Run the shell script in this directory.
    The shell script creates the TKGrid_REP subdirectory and places all files under that directory.
  8. Respond to the prompts from the configuration program:
    Configuration Parameters for the TKGrid_REP Shell Script
    Parameter
    Description
    TKGrid Remote EP Addon Configuration Utility.
    Running on 'machine-name'
    Using stdin for options.
    Do you want to configure remote access to Teradata? (yes/NO)
    If you are using a Teradata Managed Cabinet for your data provider, specify yes and press Enter. Otherwise, specify no and press Enter.
    Enter path of Teradata client install. i.e.: /opt/teradata/client/13.10
    If you specified no in the previous step, specify the path where the Teradata client was installed and press Enter. (This path was recorded earlier in Record the Location of the Teradata Client.)
    Do you want to configure remote access to Greenplum? (yes/NO)
    If you are using a Greenplum Data Computing Appliance for your data provider, specify yes and press Enter. Otherwise, specify no and press Enter.
    Enter path of Greenplum client install. i.e.: /usr/local/greenplum-db
    If you specified no in the previous step, specify the path where the Greenplum client was installed and press Enter. (This path was recorded earlier in Record the Location of the Greenplum Client .)
    Do you want to configure remote access to Hadoop? (yes/NO)
    If you are using a Hadoop machine cluster for your data provider, specify yes and press Enter. Otherwise, specify no and press Enter.
    Enter path of 64 bit JRE i.e.: /usr/java/jdk1.7.0_09/jre
    If you chose yes in the previous step, specify the path where the JRE resides that analytics environment uses and press Enter. (This path was recorded earlier in Record the Location of the JRE Used by the SAS High-Performance Analytics Environment.)
    Enter path of the directory (or directories separated by :) containing the Hadoop client JAR files.
    Specify the path where the client Hadoop JAR files required by SAS reside and press Enter. (This path was recorded earlier in Record the Location of the Hadoop Client JAR Files.)
    Enter any JRE Options you need added for the Java invocation, or just Enter if none.
    If you need to add any JRE options, do so here (for example, -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf -Djava.library.path=/opt/mapr/lib).
    Do you want to configure remote access to Oracle? (yes/NO)
    If you are using an ORACLE Exadata appliance for your data provider, specify yes and press Enter. Otherwise, specify no and press Enter.
    Enter path of Oracle client libraries. i.e.: /usr/local/ora11gr2/product/11.2.0/client_1/lib
    Enter the path where the Oracle client libraries reside and press Enter. (This path was recorded earlier in Record the Location of the Oracle Client.)
    Enter path of TNS_ADMIN, or just enter if not needed.
    Enter the value of the Oracle TNS_ADMIN environment variable and press Enter. (This value was recorded earlier in Record the Value of the Oracle TNS_ADMIN Environment Variable .)
    Do you want to configure remote access to SAP HANA? (yes/NO)
    If you are using a HANA cluster for your data provider, specify yes and press Enter. Otherwise, specify no and press Enter.
    Enter path of HANA client install. i.e.: /usr/local/lib/hdbclient
    Enter the path where the HANA client libraries reside and press Enter. (This path was recorded earlier in Record the Location of the HANA Client .)
    Shared install or replicate to each node? (Y=SHARED/n=replicated)
    If you are installing to a local drive on each node, then select no and press Enter to indicate that this is a replicated installation. If you are installing to a drive that is shared across all the nodes (for example, NFS), then specify yes and press Enter.
    Enter path to TKGrid install
    Specify the absolute path to where the analytics environment is installed and press Enter. This should be the directory in which the analytics environment install program was run with TKGrid appended to it (for example, /opt/TKGrid).
    For more information, see Step 6.
    Enter additional paths to include in LD_LIBRARY_PATH, separated by colons (:)
    If you have any external library paths that you want to be accessible to the analytics environment, specify the paths here and press Enter. Separate paths with a colon (:). If you have no paths to specify, press Enter.
  9. If you selected a replicated installation at the first prompt, you are now prompted to choose the technique for distributing the contents to the appliance nodes:
    The install can now copy this directory to all the machines
    listed in 'pathname' using scp, skipping the first entry. Perform copy?  (YES/no)
    Press Enter if you want the installation program to perform the replication. Enter no if you are distributing the contents of the installation directory by some other technique.
  10. You have finished deploying the analytics environment for a remote data source. If you have not done so already, install the appropriate SAS Embedded Process on the remote data appliance or machine cluster for your respective data provider.
  11. To validate your analytics environment, proceed to Validating the Analytics Environment Deployment.
Last updated: June 19, 2017