Parallel Processing for Data in HDFS

Overview: Parallel Processing for Data in HDFS

Parallel processing uses multiple threads that run in parallel so that a large operation is divided into multiple smaller ones that are executed simultaneously. SPD Server supports parallel processing to improve the performance of reading and writing data that is stored in HDFS.
By default, SPD Server performs parallel processing only if a Read operation includes WHERE processing. If the Read operation does not include WHERE processing, the Read operation is performed by a single thread. To request parallel processing for all Read and Write operations, use these options:
Here is an example of the SPDSHPRD= macro variable to request parallel processing for all Read operations:
%let spdshprd=yes;
In this example, the PARALLELREAD= table option requests parallel processing for all Read operations for the Class.StudentID table:
libname class sasspds 'mydomain' server=myhost.5400 user="anonymous";

proc freq data=class.StudentID (parallelread=yes); 
   tables age; 
run;
Here is an example of the SPDSHPWR= macro variable to request parallel processing for all Write operations and to use eight threads. By specifying the SPDSHPWR= macro variable, parallel processing is performed for all Write operations:
%let spdshpwr=8;
In this example, the PARALLELWRITE= table option requests parallel processing for all Write operations and to use eight threads:
libname class sasspds 'mydomain' server=myhost.5400 user="anonymous";

proc append base=class.StudentID data=class.Ages (parallelwrite=8); 
run;
Note: For parallel processing for Read operations, SPD Server determines the number of threads to use based on the MAXWHTHREADS= parameter file option value, which specifies the maximum number of threads that SPD Server uses for processing. For parallel processing for Write operations, you must specify the number of threads to use. The maximum number of threads is determined by the MAXWHTHREADS= parameter file option value.

Parallel Processing Considerations

The following are considerations for requesting parallel processing:
  • For some environments, parallel processing might not improve the performance. The availability of network bandwidth and the number of CPUs on the SAS client machine determine the performance improvement. It is recommended that you set up a test in your environment to measure performance with and without parallel processing.
  • When parallel Read processing occurs, the order in which the rows are returned might not be in the physical order of the rows in the table. Some applications require that rows be returned in the physical order. For example, the COMPARE procedure expects that rows are read from the table in the same order that they were written. Also, legacy code that uses the DATA step or the OBS= table option might rely on physical order to produce the expected results.
  • When parallel Write processing occurs, each parallel write thread works on its own data partition file. This can result in multiple partitions that do not contain the maximum partition size for the data file. The number of parallel write threads should be carefully chosen so as not to result in too many small partitions as this will result in poor performance for subsequent reads and writes.

Tuning Parallel Processing Performance

To tune the performance of parallel processing, consider these SPD Server options: