Creating a Hadoop Container Job

Problem

You want to run multiple Hadoop processes without creating an overly complicated SAS Data Integration Studio job.

Solution

You can create a SAS Data Integration Studio job that contains the Hadoop Container transformation. This transformation enables you to run a series of Hadoop steps such as transfers to and from Hadoop, Map Reduce processing, and Pig Latin processing without adding the dedicated transformations for those tasks to the job.

For example, you can create a sample job that performs the following tasks that are run through the Hadoop Container transformation:

Transfer data from a text source file to a Hadoop output file (Transfer To Hadoop step)
Transfer data from a Hadoop source file to a text output file (Transfer From Hadoop step)
Process data using a Map Reduce step
Process data using a Pig step

Perform the following tasks:

Tasks

Create a Hadoop Container Job

You can create a Hadoop Container job similar to the sample job, which contains four Hadoop steps that correspond to four rows of tables and transformations.

The Diagram tab for the job is shown in the following display:

Hadoop Container Flow

The two rows at the top of the tab are used for the Transfer To and Transfer From steps. Note that the first row contains a text source table and Hadoop target table, where the second row contains a Hadoop source table and a text target table. The two rows at the bottom of the tab are used for the Map Reduce and Pig steps. Note that each row begins with a text source and a Transfer To Hadoop transformation that creates the Hadoop source table for the Hadoop Container transformation. Both rows feed steps that send output to Hadoop target tables.

Add and Review Hadoop Steps

The steps processed in the Hadoop Container transformation are listed in a table on the Hadoop Steps tab. You can add, edit, reorder, and delete steps by clicking the buttons in the toolbar at the top of the tab.

The available step types are shown in the following display:

Hadoop Step Types

You can click a row in the table to review its name, description, input, and output in the Details panel at the bottom of the tab. If a step has multiple inputs or outputs, you can use the drop-down arrow to select the object that you need.

The following display shows the Details panel for the Transfer To step in the sample job:

Transfer to Details

The Details panel for the Pig step is shown in the following display:

Pig Details Panel

You can select a step and click the Properties button to configure, review, and edit its properties.

Configure Transfer Steps

The Transfer To and Transfer From steps are configured in a window that contains panes that cover Transfer Options and Statement Options.

The following display shows the transfer options set for the Transfer To step in the sample job:

Transfer Options Window

Configure a Map Reduce Step

The properties for the Map Reduce step in the sample job are shown in the following display:

Map Reduce Properties

The window contains an additional Map Reduce JAR. Install only Hadoop JAR files required by SAS on your SAS machine. Note that the SAS Hadoop implementation does not include the Hadoop JAR files that are open-source files. You must copy the required Hadoop JAR files to be copied from your Hadoop server onto your SAS machine where the workspace server resides.

The window also includes required arguments for the mapper and optional arguments for the reducer and combiner.

Finally, the Map Reduce step in the sample job includes the following advanced options (accessed by clicking Advanced Options):

Output format class name: org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
Output key class name: org.apache.hadoop.io.Text
Output value class name: org.apache.hadoop.io.IntWritable

The Map Reduce Options section of the Advanced Options window also contains several other options that are not set in the sample job.

Note that the All files in this location check box must be selected on the File Location tab in the properties window for the Map Reduce target table. This step enables you to see the data in the table after the job has completed successfully.

Configure a Pig Step

The properties for the Pig step in the sample job are shown in the following display:

Pig Properties

The Pig step contains the following statements:

one = load '/user/test/nodecontainer/NodeContainerPigSource2.txt' using PigStorage(); 
generated = FOREACH one GENERATE $0, $2; 
store generated into '/user/test/nodecontainer/NodeContainerPigTarget.txt'  
USING PigStorage(',')

Note that the All files in this location check box must be selected in the File Location tab in the properties window for the Pig target table. This step enables you to see the data in the table after the job has completed successfully.

Configure the Hadoop Options Tab

The following display shows the Hadoop Options tab:

Hadoop Options Tab

The Hadoop Options tab enables you to set options such as server selection, output deletion, pre- and post-process code, and configuration overrides for all of the steps. Note that the pre- and post-process code on this tab is run against the Hadoop server only. This code is not the standard pre- and post-process code that is run on the SAS workspace server. Therefore, SAS code is not appropriate input for these fields.

Run the Job and Review the Output

Run the job and verify that the job completes without error. Then, review the output. You should see the following results:

The Hadoop external file target NodeContainerTransferToTarget has the same 6 observations as its NodeContainerTransferToSource source external file.
The external file target NodeContainerTransferFromTarget, which is not a Hadoop target, has the same 6 observations as its NodeContainerTransferToTarget source external file.
The Hadoop external file target NodeContainerMapReduce has 7,816 different words, each in separate observations.
The Hadoop external File target NodeContainerPigTarget has the same 6 observations as its NodeContainerPigSource2 source external file.