Manual Installation

Copying the Scripts

The SAS Data Quality Accelerator and SAS QKB scripts are contained in a self-extracting archive file named sepdqacchadp-2.70000-1.sh. This file is contained in a ZIP file that is located in a directory in your SAS Software Depot. This ZIP file must be copied to the EPInstallDir that was created during the installation of the SAS In-Database Deployment Package, as described in Creating the SAS Embedded Process Directory.
To copy the ZIP file to the EPInstallDir on your Hadoop master node, follow these steps:
  1. Navigate to the YourSASDepot/standalone_installs directory.
    This directory was created when your SAS Software Depot was created by the SAS Download Manager.
  2. Locate the en_sasexe.zip file. This file is in the following directory: YourSASDepot/standalone_installs/SAS_Data_Quality_Accelerator_Embedded_Process_Package_for_Hadoop/2_7/Hadoop_on_Linux_x64.
    The.sepdqacchadp-2.70000-1.sh file is included in this ZIP file.
  3. Unzip the ZIP file on the client.
    unzip en_sasexe.zip
    The ZIP file contains one file: sepdqacchadp-2.70000-1.sh.
  4. Copy the sepdqacchadp-2.70000-1.sh file to theEPInstallDir directory on the Hadoop master node (NameNode). The following example uses secure copy:
    scp sepdqacchadp-2.70000-1.sh username@hdpclus1:/EPInstallDir
  5. Log on to the Hadoop NameNode as root. Then, execute the following command from the EPInstallDir directory:
    ./sepdqacchadp-2.70000-1.sh
This command creates the following files in EPInstallDir/sasexe/SASEPHome/bin of the Hadoop NameNode:
  • dq_install.sh
  • dq_uninstall.sh
  • dq_env.sh
  • qkb_push.sh
The dq_install.sh script enables you to deploy SAS Data Quality Accelerator files to the cluster nodes. See Installing SAS Data Quality Accelerator.
The dq_uninstall.sh script enables you to remove SAS Data Quality Accelerator files from the cluster nodes. See Removing SAS Data Quality Accelerator.
The dq_env.sh script is a utility script that is used by the other scripts.
The qkb_push.sh script enables you to deploy or remove the QKB to or from the cluster nodes. Before you can use qkb_push.sh, you must copy a QKB to the Hadoop master node. See Installing the QKB.

SAS Data Quality Accelerator

Installing SAS Data Quality Accelerator

To deploy SAS Data Quality Accelerator binaries to the cluster, run dq_install.sh. You must run dq_install.sh as the root user.
Run dq_install.sh as follows:
cd EPInstallDir/sasexe/SASEPHome/bin
./dq_install.sh
The dq_install.sh file automatically discovers all nodes of the cluster by default and deploys the SAS Data Quality Accelerator files to those nodes. Use the -h or -f arguments to specify deploying the files to a specific node or group of nodes.
By default, dq_install.sh does not list the names of the host nodes to which it deploys the files. To create such a list, include the -v argument in the command.
For information about supported arguments, see DQ_INSTALL.SH and DQ_UNINSTALL.SH Syntax.
The dq_install.sh script creates the following files on each node on which it is executed:
EPInstallDir/bin/dq_install.sh
EPInstallDir/bin/dq_install.sh
EPInstallDir/bin/qkb_push.sh
EPInstallDir/bin/dq_env.sh
EPInstallDir/jars/sas.tools.qkb.hadoop.jar
EPInstallDir/sasexe/tkeblufn.so
EPInstallDir/sasexe/t0w7zt.so
EPInstallDir/sasexe/t0w7zh.so
EPInstallDir/sasexe/t0w7ko.so
EPInstallDir/sasexe/t0w7ja.so
EPInstallDir/sasexe/t0w7fr.so
EPInstallDir/sasexe/t0w7en.so
EPInstallDir/sasexe/d2dqtokens.so
EPInstallDir/sasexe/d2dqlocales.so
EPInstallDir/sasexe/d2dqdefns.so
EPInstallDir/sasexe/d2dq.so
Verify that these files have been copied to the nodes.

Removing SAS Data Quality Accelerator

To remove SAS Data Quality Accelerator binaries from the cluster, run dq_uninstall.sh. You must run dq_uninstall.sh as the root user.
Note:
  • If you are removing the QKB, you must do so before removing the binaries. Removing the binaries removes the qkb_push.sh file that is used to remove the QKB. Running dq_uninstall.sh does not remove the QKB from the cluster. Instructions for removing the QKB are found in Removing the QKB.
  • This step is not necessary for Cloudera and Hortonworks distributions where SAS In-Database Technologies for Hadoop were installed through the SAS Deployment Manager.
Run dq_uninstall.sh as follows:
cd EPInstallDir/sasexe/SASEPHome/bin
./dq_uninstall.sh
The dq_uninstall.sh file automatically discovers all nodes of the cluster by default and removes the SAS Data Quality Accelerator files from those nodes. Use the -h or -f arguments to specify removing the files from a specific node or group of nodes.
By default, dq_uninstall.sh does not list the names of the host nodes from which it removes the files. To create such a list, include the -v argument in the command.
For information about supported arguments, see DQ_INSTALL.SH and DQ_UNINSTALL.SH Syntax.

DQ_INSTALL.SH and DQ_UNINSTALL.SH Syntax

dq_install.sh
<-?>
<-l logfile>
<-f hostfile>
<-h hostname>
<-v >
dq_uninstall.sh
<-?>
<-l logfile>
<-f hostfile>
<-h hostname>
<-v >
Arguments

-?

prints usage information.

-l logfile

directs status information to the specified log file instead of to standard output.

-f hostfile

specifies the full path of a file that contains the list of hosts where SAS Data Quality Accelerator is installed or removed.

Default The script discovers the cluster topology and uses the retrieved list of data nodes.
Note The -f and -h arguments are mutually exclusive.
Example
-f /etc/hadoop/conf/slaves

-h hostname < hostname>

specifies the target host or host list where SAS Data Quality Accelerator is installed or removed.

Default The script discovers the cluster topology and uses the retrieved list of data nodes.
Requirement If you specify more than one host, the host names must be separated by spaces.
Note The -f and -h arguments are mutually exclusive.
Tip Use the -host argument when new nodes are added to the cluster.
Example
-h server1 server2 server3
-h bluesvr

-v

specifies verbose output, which lists the names of the nodes on which the script ran.

SAS Quality Knowledge Base (QKB)

Obtaining a QKB

You can obtain a QKB in one of the following ways:
  • Run the SAS Deployment Wizard, which is part of your SAS order. In the Select Products to Install dialog box, select the check box for SAS Quality Knowledge Base. This installs the SAS QKB for Contact Information.
    Note:
    • This option applies only to the SAS QKB for Contact Information. For step-by-step guidance on installing a QKB using the SAS Deployment Wizard, see SAS Quality Knowledge Base for Contact Information: Installation and Configuration Guide.
    • If you did not select the check box for SAS Quality Knowledge Base when you initially ran the SAS Deployment Wizard, you can run it again. In the Select Products to Install dialog box, deselect all check boxes, and then select the check box for SAS Quality Knowledge Base.
  • Download a QKB from the SAS downloads site. You can download either the SAS QKB for Contact Information or the SAS QKB for Product Data.
    Note: You must have a SAS profile for this option.
    1. Select the appropriate QKB.
    2. When prompted, log on to your SAS profile or create a new profile.
    3. Complete downloading and installing the QKB.
  • Copy a QKB that you already use with other SAS software in your enterprise.
    For more information, see Copying the QKB to the Hadoop NameNode.
After your initial deployment, periodically update the QKB in your Hadoop cluster to ensure that you are using the latest QKB updates provided by SAS.

Updating and Customizing the QKB

SAS provides regular updates to the QKB. It is recommended that you update your QKB each time that a new one is released. For a listing of the latest enhancements to the QKB, see the What’s New document on the SAS Quality Knowledge Base product documentation page at support.sas.com. To find this page, either search on the name SAS Quality Knowledge Base or locate the name in the product index and click the Documentation tab. Check the What’s New for each QKB to determine which definitions have been added, modified, or deprecated, and to learn about new locales that might be supported. Contact your SAS software representative to order updated QKBs and locales. After obtaining the new QKB, copy it to the Hadoop NameNode (See Copying the QKB to the Hadoop NameNode) and use the same steps that you would to deploy a standard QKB.
The definitions delivered in the QKB are sufficient for performing most data quality operations. However, if you have DataFlux Data Management Studio, you can use the Customize feature to modify your QKB to meet specific needs. See your SAS representative for information on licensing DataFlux Data Management Studio.
If you want to customize your QKB, it is recommended that you customize your QKB on a local workstation, and then copy the customized QKB to the Hadoop NameNode for deployment. When updates to the QKB are required, merge your customizations into an updated QKB locally, and copy the updated, customized QKB to the Hadoop NameNode (See Copying the QKB to the Hadoop NameNode) for deployment. This enables you to deploy a customized QKB to the Hadoop cluster using the same steps that you would to deploy a standard QKB. Copying your customized QKB from a local workstation also means that you have a backup of the QKB on your local workstation. See the online Help provided with your SAS Quality Knowledge Base for information about how to merge any customizations that you have made into an updated QKB.

Kerberos Security Requirements

A Kerberos ticket (TGT) is required to deploy the QKB in a Kerberos environment.
To create the ticket, follow these steps:
  1. Log on as root.
  2. Change to the HDFS user.
  3. Run kinit.
  4. Exit to root.
The following is an example of commands used to obtain the ticket.
su - root 
su - hdfs
kinit -kt hdfs.keytab hdfs
exit

Copying the QKB to the Hadoop NameNode

After you have obtained a QKB (seeSAS Quality Knowledge Base (QKB)), you must copy it to the Hadoop NameNode. Copy the QKB to a temporary staging area, such as /tmp/qkbstage.
You can copy the QKB to the Hadoop NameNode by using a file transfer command like FTP or SCP, or by mounting the file system where the QKB is located on the Hadoop NameNode. You must copy the complete QKB directory structure.
SAS installation tools typically create a QKB in the following locations, where qkb_product is the QKB product name and qkb_version is the QKB version number:
  • Windows 7: C:\ProgramData\SAS\QKB\qkb_product\qkb_version.
    For example:
    C:\ProgramData\SAS\QKB\CI\26
    Note: ProgramData is a hidden location.
  • UNIX and Linux:/opt/sas/qkb/qkb_product/qkb_version.
    For example:
    /opt/sas/qkb/ci/26
The following example shows how you might copy a QKB that exists on a Linux system to the Hadoop NameNode. The example uses secure copy with the -r argument to recursively copy the specified directory and its subdirectories.
  • Assume that hmaster456 is the host name of the Hadoop NameNode.
  • The target location on the NameNode is /tmp/qkbstage
To copy the QKB from the client desktop, issue the command:
scp -r /opt/sas/qkb/ci/26 hmaster456:/tmp/qkbstage

Overview of the QKB_PUSH.SH Script

The qkb_push.sh script enables you to perform the following actions.
  • Install or remove SAS QKB files on a single node or a group of nodes.
  • Generate a SAS QKB index file and write the file to an HDFS location.
  • Write the installation or removal output to a log file.
The qkb_push.sh file is created in the EPInstallDir/sasexe/SASEPHome/bin directory. You must execute qkb_push.sh from this directory.
Note: If you used SAS Deployment Manager or the zip file method of deploying SAS In-Database Technologies for Hadoop, the qkb_push.sh file is located in the EPInstallDir/SASEPHome/bin directory.
You can also use qkb_push.sh to deploy updated versions of the QKB. For more information, see Updating and Customizing the QKB.
You suppress index creation or perform only index creation by using the -i and -x arguments. If users have a problem viewing QKB definitions from within SAS Data Loader, you might want to re-create the index file.
Note: Only one QKB and one index file are supported in the Hadoop framework at a time. For example, you cannot have a QKB for Contact Information and a QKB for Product Data in the Hadoop framework at the same time. Subsequent QKB and index pushes replace prior ones, unless you are pushing a QKB that is an earlier version than the one installed or has a different name. In these cases, you must remove the old QKB from the cluster before deploying the new one.
The QKB source directory is copied to the fixed location /opt/qkb/default on each node. The QKB index file is created in the /sas/qkb directory in HDFS. If a QKB or QKB index file already exists in the target location, the new QKB or QKB index file overwrites it.

Installing the QKB

Installing the QKB on the Hadoop cluster nodes performs the following two tasks:
  • copies the specified QKB directory to a fixed location (/opt/qkb/default) on each of the Hadoop nodes.
    Note: Each Hadoop node requires approximately 8 GB of disk space for the QKB.
  • generates an index file from the contents of the QKB and pushes this index file to HDFS. This index file, named default.idx, is created in the /sas/qkb directory in HDFS. The default.idx file provides a list of QKB definition and token names to SAS Data Loader.
    Note: Creating the index file requires special permissions in a Kerberos security environment. These permissions must be configured before deploying the QKB. See Kerberos Security Requirements.
To deploy the QKB to the cluster, run qkb_push.sh. You must run qkb_push.sh as the root user.
Run qkb_push.sh as follows:
cd EPInstallDir/sasexe/SASEPHome/bin
./qkb_push.sh qkb_path
where qkb_path is the name of the directory on the NameNode to which you copied the QKB. For example, you might use the following:
./qkb_push.sh /tmp/qkbstage/version
The qkb_push.sh script automatically discovers all nodes of the cluster by default and deploys the QKB to those nodes. Use the -h or -f arguments to specify deploying the files to a specific node or group of nodes.
By default, qkb_push.sh does not list the names of the host nodes to which it deploys the files. To create such a list, include the -v argument in the command. If a name other than the default was configured for the HDFS or MAPR user name, include the -s argument in the command.
For information about supported arguments, see QKB_PUSH.SH Syntax.
The qkb_push.sh script creates the following directories and files on each node on which it is executed:
EPInstallDir/opt/qkb/default/chopinfo
opt/qkb/default/dfx.meta
opt/qkb/default/grammar
opt/qkb/default/inst.meta
opt/qkb/default/locale
opt/qkb/default/phonetx
opt/qkb/default/regexlib
opt/qkb/default/scheme
opt/qkb/default/upgrade.40
opt/qkb/default/vocab
Verify that these directories and files have been copied to the nodes.
Check that the default.idx file was created in HDFS or MAPR by issuing the command:
hadoop fs -ls /sas/qkb

Removing the QKB

The QKB can be removed from the Hadoop cluster by executing the qkb_push.sh executable file with the -r argument. You must have root access to execute qkb_push.sh.
Note: If you are removing the entire in-database deployment, you must remove the QKB first.
Run qkb_push.sh as follows:
cd EPInstallDir/sasexe/SASEPHome/bin
./qkb_push.sh -r
The -r argument automatically discovers all nodes of the cluster by default and removes the QKB files from those nodes. Use the -h or -f arguments to specify removing the files from a specific node or group of nodes.
Note: The QKB index file is not removed from HDFS when the -h or -f argument is specified with -r.
By default, the -r argument does not list the names of the host nodes from which it removes the files. To create such a list, include the -v argument in the command.
For information about supported arguments, see QKB_PUSH.SH Syntax.

QKB_PUSH.SH Syntax

qkb_push.sh <arguments> qkb_path
<-?>
<-l logfile>
<-f hostfile>
<-h hostname>
<-v >
<-s user-id>
<-i >
<-x >
<-r >
Arguments

-?

prints usage information.

-l logfile

directs status information to the specified log file instead of to standard output.

-f hostfile

specifies the full path of a file that contains the list of hosts where the QKB is installed or removed.

Default The qkb_push.sh script discovers the cluster topology and uses the retrieved list of data nodes.
Interaction Use the -f argument in conjunction with the -r argument to remove the QKB from specific nodes.
Note The -f and -h arguments are mutually exclusive.
See -r
Example
-f /etc/hadoop/conf/slaves

-h hostname < hostname>

specifies the target host or host list where the QKB is installed or removed.

Default The qkb_push.sh script discovers the cluster topology and uses the retrieved list of data nodes.
Requirement If you specify more than one host, the host names must be separated by spaces.
Interaction Use the -h argument in conjunction with the -r argument to remove the QKB from specific nodes.
Note The -f and -h arguments are mutually exclusive.
Tip Use the -host argument when new nodes are added to the cluster
See -r
Example
-h server1 server2 server3
-h bluesvr

-v

specifies verbose output, which lists the names of the nodes on which the script ran.

-s user-id

specifies the user ID that has Write access to the HDFS root directory when the default user name is not used.

Default hdfs for all Hadoop distributions except MapR
mapr for MapR

-i

creates and pushes the QKB index only.

-x

suppresses QKB index creation.

-r

removes the QKB from the Hadoop nodes and it removes the QKB index file from HDFS.

Default The -r argument discovers the cluster topology and uses the retrieved list of data nodes.
Interaction You can specify the hosts from which you want to remove the QKB by using the -f or -h arguments. The -f and -h arguments are mutually exclusive.
Note The QKB index file is not removed from HDFS when the -h or -f argument is specified in conjunction with -r.
See -f hostfile
-h hostname hostname
Example
-r -h server1 server2 server3
-r -f /etc/hadoop/conf/slaves
-r -l logfile