SAS Quality Knowledge Base (QKB)

Deploying a QKB

Your software order entitles you to a QKB, either a SAS QKB for Contact Information or a SAS QKB for Product Data. To deploy a QKB, follow these steps:
  1. Obtain a QKB.
  2. Copy the QKB to the Hadoop master node (NameNode).
  3. Use qkb_push.sh to deploy the QKB in the cluster and create an index file.

Obtaining a QKB

You can obtain a QKB in one of the following ways:
  • Run the SAS Deployment Wizard. In the Select Products to Install dialog box, select the check box for SAS Quality Knowledge Base for your order. This installs the SAS QKB for Contact Information.
    Note: This option applies only to the SAS QKB for Contact Information. For step-by-step guidance on installing a QKB using the SAS Deployment Wizard, see the SAS Quality Knowledge Base for Contact Information: Installation and Configuration Guide on the SAS Documentation site.
  • Download a QKB from the SAS Downloads site. You can select the SAS QKB for Product Data or SAS QKB for Contact Information.
    Select a QKB, and then follow the installation instructions in the Readme file for your operating environment. To open the Readme file, you must have a SAS profile. When prompted, you can log on or create a new profile.
  • Copy a QKB that you already use with other SAS software in your enterprise.
    See Copying the QKB to the Hadoop NameNode for more information.
After your initial deployment, periodically update the QKB in your Hadoop cluster to make sure that you are using the latest QKB updates provided by SAS. For more information, see Updating and Customizing the SAS Quality Knowledge Base.

Copying the QKB to the Hadoop NameNode

After you have obtained a QKB, you must copy it to the Hadoop NameNode. We recommend that you copy the QKB to a temporary staging area, such as /tmp/qkbstage.
You can copy the QKB to the Hadoop NameNode by using a file transfer command like FTP or SCP, or by mounting the file system where the QKB is located on the Hadoop NameNode. You must copy the complete QKB directory structure.
SAS installation tools typically create a QKB in these locations:
Windows 7: C:\ProgramData\SAS\QKB.
Note: ProgramData is a hidden location.
UNIX and Linux:/opt/sas/qkb/share
The following example shows how you might copy a QKB that exists on a Linux system to the Hadoop NameNode. The example uses the secure copy with the -r flag to recursively copy the specified directory.
  • Assume that desktop123 is the host name of the desktop system where the QKB is installed.
  • Assume that hmaster456 is the host name of the Hadoop NameNode.
  • The target location on the NameNode is /tmp/qkbstage
To copy the QKB from client desktop, issue the command:
scp -r /opt/sas/qkb/share hmaster456:/tmp/qkbstage
To copy the QKB from the Hadoop NameNode, issue the command:
scp -r desktop123:/opt/sas/qkb/share /tmp/qkbstage

Using qkb_push.sh

SAS Data Quality Accelerator for Hadoop provides the qkb_push.sh executable file to enable you to deploy the QKB on the Hadoop cluster nodes.
Note: Each Hadoop node needs approximately 8 GB of disk space for the QKB.
The qkb_push.sh file performs two tasks:
  • copies the specified QKB directory to a fixed location (/opt/qkb/default) on each of the Hadoop nodes. The qkb_push.sh file automatically discovers all nodes in the cluster and deploys the QKB on them by default.
  • generates an index file from the contents of the QKB and pushes this index file to HDFS. This index file, named default.idx, is created in the /sas/qkb directory in HDFS. The default.idx file provides a list of QKB definition and token names to SAS Data Loader.
Creating the index file requires special permissions in a Kerberos security environment. If you have a Kerberos environment, see Kerberos Security Requirements before executing qkb_push.sh.

Kerberos Security Requirements

In a Kerberos environment, a Kerberos ticket (TGT) is necessary to run qkb_push.sh.
To create the ticket, follow these steps:
  1. Log on as root.
  2. Change to the HDFS user.
  3. Run kinit.
  4. Exit back to root.
  5. Run qkb_push.sh.
The following are examples of commands that you might use to obtain the ticket.
su - root 
su - hdfs
kinit -kt hdfs.keytab hdfs
exit
Note: You must supply the root password for the first command

Executing qkb_push.sh

The qkb_push.sh file must be run as the root user. It becomes the HDFS user or MAPR user, as appropriate, in order to detect the nodes in the cluster.
Execute qkb_push.sh as follows:
cd EPInstallDir/SASEPHome/bin
./qkb_push.sh qkb_path
For qkb_path, specify the name of the directory on the NameNode to which you copied your QKB. For example:
./qkb_push.sh /tmp/qkbstage
If a name other than the default was configured for the HDFS or MAPR user name, include the -s flag in the command as follows:
./qkb_push.sh -s HDFS-user /tmp/qkbstage
The executable file does not list the names of the host nodes on which it installs the QKB by default. To create a list, include the -v flag in the command. Here is an example:
./qkb_push.sh -v /tmp/qkbstage
For more information about qkb_push.sh flags, see qkb_push.sh: Reference.

Verifying the QKB Deployment

The qkb_push.sh file creates the following files and directories in the /opt/qkb/default directory on each node. Check this directory on one or more nodes to make sure they are there.
  • chopinfo
  • dfx.meta
  • grammar
  • inst.meta
  • locale
  • phonetx
  • regexlib
  • scheme
  • upgrade.40
  • vocab
Check that the default.idx file was created in HDFS by issuing the command:
hadoop fs -ls /sas/qkb

Troubleshooting the QKB Deployment

The QKB deployment can fail for the following reasons:
  • You did not obtain a Kerberos ticket before attempting to run qkb_push.sh in a Kerberos environment. Obtain a ticket and try again.
  • You executed qkb_push.sh from a directory other than EPInstallDir/SASEPHome/bin. The script must be run from EPInstallDir/SASEPHome/bin.
  • You had insufficient space in the /tmp directory for qkb_push.sh to run. (Clear space and try again.)
  • You specified an invalid name for the HDFS or MAPR user in qkb_push.sh. Check the name and try again.

qkb_push.sh: Reference

Overview

The qkb_push.sh file is created in the EPInstallDir/SASEPHome/bin directory by the SAS Data Quality Accelerator install script (sepdqacchadp). You must execute qkb_push.sh from this directory.
By default, qkb_push.sh automatically discovers all nodes in the cluster and deploys the specified QKB on them. The script also generates an index file from the contents of the QKB and pushes this index file to HDFS.
Flags are provided to enable you to deploy the QKB to specific nodes or a group of nodes. If you are expanding your Hadoop cluster by adding new nodes after the initial deployment, you might want to use one of these flags to deploy the QKB to those nodes and avoid redeploying to the entire cluster. Flags are also available to enable you to suppress index creation or to perform only index creation. If users have a problem viewing QKB definitions from within Data Loader, you might want to re-create the index file.
You can also use qkb_push.sh to deploy updated versions of the QKB. For more information, see Updating and Customizing the SAS Quality Knowledge Base.
Note: Only one QKB and one index file are supported in the Hadoop framework at a time. For example, you cannot have a QKB for Contact Information and a QKB for Product Data in the Hadoop framework at the same time. Subsequent QKB and index pushes replace prior ones, unless you are pushing a QKB that is of an earlier version than the one installed or has a different name. In these cases, you must remove the old QKB from the cluster before deploying the new one. For more information, see Removing the QKB from the Hadoop Cluster.
Run qkb_push.sh as the root user. It becomes the HDFS user or MAPR user, as appropriate, in order to detect the nodes in the cluster. A flag is available to specify the HDFS user name if a name other than the default was configured.
To simplify maintenance, the source QKB directory is copied to a fixed location (/opt/qkb/default) on each node. The QKB index file is created in the /sas/qkb directory in HDFS. If a QKB or QKB index file already exists in the target location, the new QKB or QKB index file overwrites it.

Syntax

./qkb_push.sh <options> qkb_path

Required Argument

qkb_path

specifies the path to the source directory for the QKB.

Authentication Options

-s HDFS-user | MAPR-user

specifies the user name to associate with HDFS, when the default user name is not used. The default user name is HDFS in all Hadoop distributions, except MapReduce. In MapReduce, the default name is mapr.

QKB Index Options

-i

creates and pushes the QKB index only.

-x

suppresses QKB index creation.

Subsetting Options

-h hostname

specifies the host name or IP address of the computers or computers on which to perform the deployment.

-f hostfile

specifies the name of a file that contains a list of the host names or IP addresses on which to perform the deployment.

General Options

-?

prints usage information.

-l logfile

directs status information to the specified log file, instead of to standard output.

-r

removes the QKB from the Hadoop nodes and the QKB index file from HDFS.

-v

specifies verbose output, which lists the names of the nodes on which the qkb_push.sh file ran.

Examples

The following are sample commands:
  • To deploy to one or more nodes that are specified on a command line, execute:
    ./qkb_push.sh -h hostname1 [-h hostname2] qkb_path
  • To deploy using a file that contains a list of node names, execute:
    ./qkb_push.sh -f hostfile qkb_path
  • To deploy to one or more nodes suppressing QKB index creation, execute:
    ./qkb_push.sh -x -h hostname1 [-h hostname2] qkb_path