WHERE Processing Optimization with MapReduce

Overview: WHERE Processing Optimization with MapReduce

WHERE processing enables you to conditionally select a subset of rows, so that the software processes only the rows that meet specified conditions. To optimize the performance of WHERE processing, you can request that data subsetting be performed in the Hadoop cluster, which can be an SPD Server dynamic cluster table. Then, when you submit SPD Server code that includes a WHERE expression (which defines the condition that selected rows must satisfy), SPD Server instantiates the WHERE expression as a Java class. SPD Server submits the Java class to the Hadoop cluster as a component in a MapReduce program.
By requesting that data subsetting be performed in the Hadoop cluster, performance might be improved by taking advantage of the filtering and ordering capabilities of the MapReduce framework. As a result, only the subset of the data is returned to the SPD Server client. Performance is often improved with large tables when the WHERE expression qualifies only a relatively small subset.

WHERE Processing Optimization Settings

By default, WHERE processing is performed by SPD Server on the SPD Server host. Data subsetting is not performed in the Hadoop cluster. Here are the WHERE processing optimization settings that you can specify:

Using the HADOOPACCELWH Parameter File Option

To request global WHERE processing optimization so that all data subsetting is performed in the Hadoop cluster:
  • Include the HADOOPACCELWH parameter file option in the spdsserv.parm parameter file.
  • Use the HADOOPACCELJVER= parameter file option in the spdsserv.parm parameter file to specify the Java Runtime Environment version running on the Hadoop cluster. The default Java version is 1.6.
  • Use the HADOOPWORKPATH= parameter file option in the spdsserv.parm parameter file to specify the path to the HDFS directory that stores the temporary results of the MapReduce output. If the parameter is not specified, the files are written to /tmp. The specified path must exist. If you specify a path that does not exist, the MapReduce job fails.
  • Use the SPDSACWH= macro variable or the ACCELWHERE= table option to turn off WHERE processing optimization if needed.
Note: You can include the NOHADOOPACCELWH parameter file option in the spdsserv.parm parameter file to specify that WHERE processing optimization with MapReduce cannot be requested. With the NOHADOOPACCELWH setting, you cannot use the SPDSACWH= macro variable or the ACCELWHERE= table option to request WHERE processing optimization.

WHERE Processing Optimization Requirements

To perform the data subsetting in the Hadoop cluster, the following requirements must be met. If any of these requirements are not met, the subsetting is performed by the SPD Server host, not by a MapReduce program in the Hadoop cluster.
  • The table cannot be encrypted.
  • The table cannot be compressed.
  • The table must be larger than the HDFS block size.
  • The submitted SAS code cannot request BY-group processing.
  • The submitted SAS code cannot include the STARTOBS= or ENDOBS= options.
The submitted WHERE expression cannot include any of the following syntax:
  • a variable as an operand, such as where lastname;
  • variable-to-variable comparison
  • SAS functions, such as SUBSTR, TODAY, UPCASE, and PUT
  • arithmetic operators *, / , + , – , and **
  • IS NULL or IS MISSING and IS NOT NULL or IS NOT MISSING operators
  • concatenation operators, such as || or !!
  • negative prefix operator, such as where z = –(x+y);
  • pattern matching operators LIKE and CONTAINS
  • sounds-like operator SOUNDEX (=*)
  • truncated comparison operator using the colon modifier (:), such as where lastname=: ‘S’;
Note: For additional details about WHERE processing optimization, include the macro variable SPDSWDEB=YES in your code to determine whether the optimization occurred. For more information, see the SPDSWDEB= macro variable in SAS Scalable Performance Data Server: User’s Guide.