HADOOP Procedure

Example 2: Submitting a MapReduce Program

Details

This PROC HADOOP example submits a MapReduce program to a Hadoop server. The example uses the Hadoop MapReduce application WordCount that reads a text input file, breaks each line into words, counts the words, and then writes the word counts to the output text file.

Program

filename cfg 'C:\Users\sasabc\Hadoop\sample_config.xml';

proc hadoop options=cfg username="sasabc" password="sasabc" verbose;

   mapreduce input='/user/sasabc/architectdoc.txt'

      output='/user/sasabc/outputtest'

      jar='C:\Users\sasabc\Hadoop\jars\WordCount.jar'

      outputkey="org.apache.hadoop.io.Text"

      outputvalue="org.apache.hadoop.io.IntWritable"

      reduce="org.apache.hadoop.examples.WordCount$IntSumReducer"

      combine="org.apache.hadoop.examples.WordCount$IntSumReducer"

      map="org.apache.hadoop.examples.WordCount$TokenizerMapper";
run;

Program Description

Assign a file reference to the Hadoop configuration file.The FILENAME statement assigns the file reference CFG to the physical location of a Hadoop configuration file that is named sample_config.xml, which is shown in Using PROC HADOOP.

filename cfg 'C:\Users\sasabc\Hadoop\sample_config.xml';

Execute the PROC HADOOP statement.The PROC HADOOP statement controls access to the Hadoop server by referencing the Hadoop configuration file with the OPTIONS= option, identifying the user ID and password on the Hadoop server, and specifying the option VERBOSE, which enables additional messages to the SAS log.

proc hadoop options=cfg username="sasabc" password="sasabc" verbose;

Submit a MapReduce program.The MAPREDUCE statement includes several options. INPUT= specifies the HDFS path of the input Hadoop file named architectdoc.txt.

   mapreduce input='/user/sasabc/architectdoc.txt'

Create an HDFS path.OUTPUT= creates the HDFS path for the program output location named outputtest.

      output='/user/sasabc/outputtest'

Specify the JAR file.JAR= specifies the location of the JAR file that contains the MapReduce program named WordCount.jar.

      jar='C:\Users\sasabc\Hadoop\jars\WordCount.jar'

Specify an output key class.OUTPUTKEY= specifies the name of the output key class. The org.apache.hadoop.io.Text class stores text using standard UTF8 encoding and provides methods to compare text.

      outputkey="org.apache.hadoop.io.Text"

Specify an output value class.OUTPUTVALUE= specifies the name of the output value class org.apache.hadoop.io.IntWritable.

      outputvalue="org.apache.hadoop.io.IntWritable"

Specify a reducer class.REDUCE= specifies the name of the reducer class org.apache.hadoop.examples.WordCount$IntSumReducer.

      reduce="org.apache.hadoop.examples.WordCount$IntSumReducer"

Specify a combiner class.COMBINE= specifies the name of the combiner class org.apache.hadoop.examples.WordCount$IntSumReducer.

      combine="org.apache.hadoop.examples.WordCount$IntSumReducer"

Specify a map class.MAP= specifies the name of the map class org.apache.hadoop.examples.WordCount$TokenizerMapper.

      map="org.apache.hadoop.examples.WordCount$TokenizerMapper";
run;