DATA Step Processing in Hadoop

In order to accelerate DATA step processing of data based in Hadoop, the DATA step has been enhanced to determine when the user code is appropriate for exporting to the Hadoop MapReduce facility. If you have installed and activated the SAS Embedded Process on a Hadoop cluster, it is possible for DATA step code to be executed in parallel against the input data residing on the HDFS file system.
Because of the single-source, shared-nothing nature of MapReduce processing and the immutable nature of HDFS files, only a subset of the full DATA step syntax can be passed through for parallel execution. The DATA step can be run inside Hadoop for scoring with the following limitations:
  • Only one input file and one output file are allowed.
  • The input file and output file are in Hadoop.
  • Only functions and formats that are supported by the DS2 language compile successfully.
  • Some DATA step statements are not allowed, such as those pertaining to input and output.
To enable the DATA step to be run inside Hadoop, set the DSACCEL= system option to ANY.
If a SAS program does not meet the requirements for running in Hadoop, the code executes in your Base SAS session. In this case, SAS reads and writes large tables over the network.
You can determine whether your code is non-compliant for Hadoop by setting the system option MSGLEVEL=I. When MSGLEVEL=I, SAS writes log messages that identify the non-compliant code.