HADOOP Procedure

Example 2: Submitting a MapReduce Program

Details

This PROC HADOOP example submits a MapReduce program to a Hadoop server. The example uses the Hadoop MapReduce application WordCount that reads a text input file, breaks each line into words, counts the words, and then writes the word counts to the output text file.

Program

filename cfg 'C:\Users\sasabc\Hadoop\sample_config.xml';
proc hadoop options=cfg username="sasabc" password="sasabc" verbose;
   mapreduce input='/user/sasabc/architectdoc.txt'
      output='/user/sasabc/outputtest'
      jar='C:\Users\sasabc\Hadoop\jars\WordCount.jar'
      outputkey="org.apache.hadoop.io.Text"
      outputvalue="org.apache.hadoop.io.IntWritable"
      reduce="org.apache.hadoop.examples.WordCount$IntSumReducer"
      combine="org.apache.hadoop.examples.WordCount$IntSumReducer"
      map="org.apache.hadoop.examples.WordCount$TokenizerMapper";
run;

Program Description

Assign a file reference to the Hadoop configuration file.The FILENAME statement assigns the file reference CFG to the physical location of a Hadoop configuration file that is named sample_config.xml, which is shown in Using PROC HADOOP.
filename cfg 'C:\Users\sasabc\Hadoop\sample_config.xml';
Execute the PROC HADOOP statement.The PROC HADOOP statement controls access to the Hadoop server by referencing the Hadoop configuration file with the OPTIONS= option, identifying the user ID and password on the Hadoop server, and specifying the option VERBOSE, which enables additional messages to the SAS log.
proc hadoop options=cfg username="sasabc" password="sasabc" verbose;
Submit a MapReduce program.The MAPREDUCE statement includes several options. INPUT= specifies the HDFS path of the input Hadoop file named architectdoc.txt.
   mapreduce input='/user/sasabc/architectdoc.txt'
Create an HDFS path.OUTPUT= creates the HDFS path for the program output location named outputtest.
      output='/user/sasabc/outputtest'
Specify the JAR file.JAR= specifies the location of the JAR file that contains the MapReduce program named WordCount.jar.
      jar='C:\Users\sasabc\Hadoop\jars\WordCount.jar'
Specify an output key class.OUTPUTKEY= specifies the name of the output key class. The org.apache.hadoop.io.Text class stores text using standard UTF8 encoding and provides methods to compare text.
      outputkey="org.apache.hadoop.io.Text"
Specify an output value class.OUTPUTVALUE= specifies the name of the output value class org.apache.hadoop.io.IntWritable.
      outputvalue="org.apache.hadoop.io.IntWritable"
Specify a reducer class.REDUCE= specifies the name of the reducer class org.apache.hadoop.examples.WordCount$IntSumReducer.
      reduce="org.apache.hadoop.examples.WordCount$IntSumReducer"
Specify a combiner class.COMBINE= specifies the name of the combiner class org.apache.hadoop.examples.WordCount$IntSumReducer.
      combine="org.apache.hadoop.examples.WordCount$IntSumReducer"
Specify a map class.MAP= specifies the name of the map class org.apache.hadoop.examples.WordCount$TokenizerMapper.
      map="org.apache.hadoop.examples.WordCount$TokenizerMapper";
run;