Understanding the Parallel Join Facility

Overview of the Parallel Join Facility

The Parallel Join facility is a feature of the server SQL Planner that decreases the processing time that is required to create a pairwise join between two server tables. The savings in processing time is created when the server performs the pairwise join in parallel.
The SQL Planner first searches for pairs when the server source tables are to be joined. When the Planner finds a pair, it checks the join syntax for that pair to determine whether the syntax meets all of the requirements for the Parallel Join facility. If the join syntax meets the requirements, the pair of tables are joined by the Parallel Join facility.

Criteria for Using the Parallel Join Facility

The criteria for using the server Parallel Join facility can be more complex than simply requiring a pairwise join of two server tables. The Parallel Join facility can handle multiple character columns, numeric columns, or combinations of character and numeric columns that are joined between pairs of tables. Numeric columns do not need to be of the same width to act as a join key, but character columns must be of the same width in order to be a join key. Columns that are involved in a join cannot be derived from a SAS CASE statement, and cannot be created from character manipulation functions such as SUBSTR, YEAR, MONT, DAY, and TRIM.

Parallel Join Methods

Parallel Sort-Merge Method

The parallel sort-merge join method first performs a parallel sort to order the data, and then merges the sorted tables in parallel. During the merge, the facility concurrently joins multiple rows from one table with the corresponding rows in the other table. You can use the parallel sort-merge join method to execute any join that meets the requirements for a parallel join.
The parallel sort-merge method is a good, all-purpose parallel join strategy that requires no intervention from you. The tables for the sort-merge method do not need to be in the same domain. The sort-merge method is not affected by the distribution of the data in the sort key columns.
The sort-merge method begins by completely sorting the smaller of the two tables that are being joined. Simultaneously, it performs partial parallel sorts on the larger table. If both tables are very large and sufficient resources are not available to do the complete sort on the smaller table, the performance of the parallel sort-merge method can degrade. The parallel sort-merge method is also limited when you are performing an outer join, left join, or right join in parallel. Parallel outer joins, left joins, or right joins can use only two concurrent threads. Inner joins are not limited in the parallel sort-merge method and can use more than two concurrent threads during parallel operations.

Parallel Range Join Method

The parallel range join method uses a join index to determine the ranges of rows between the tables that can be joined in parallel. The parallel range join method requires you to create a join index on the columns to be joined in the tables that you want to merge. The join index divides the two tables into a specified number of near-equal parts, or ranges, based on matching values between the join columns. The Parallel Join facility recognizes the ranges of rows that contain matching values between the join columns, and then uses concurrent join threads to join the rows in parallel. The server parallel sort then sorts the rows within a range.
You can use the parallel range join method only on tables that are in the same domain. If either of the two tables are updated after the join index is created, you must rebuild the join index before you can use the parallel range join method. The parallel range join method performs best when the columns of the tables that are being joined are sorted. If the columns are not relatively sorted, then the concurrent join threads can cause processor thrashing. Processor thrashing occurs when unsorted rows in a table require the server to perform increasingly larger table row scans. These larger scans can consume processor resources at a high rate during concurrent join operations. For more information about creating join indexes, see Index Utility in SAS Scalable Performance Data Server: Administrator’s Guide.
How does the server Parallel Join facility choose between the sort-merge method and the range join method? If a join index is available for the tables to be joined, the Parallel Join facility chooses the parallel range join method. If a join index does not exist, or if the join index has not been rebuilt because a table was updated, the Parallel Join facility defaults to using the parallel sort-merge method.

Parallel Joins with Group-By

A powerful feature of the server Parallel Join facility is its integration with the server Parallel Group-By facility. If the result of the parallel join contains a GROUP BY statement, the partial results of the parallel join threads are passed to the Parallel Group-By facility, which performs the group-by operation in parallel. In the following example, the server performs both a parallel join and parallel group-by operation.
libname path1 sasspds .... IP=YES;

proc sql;
create table junk as
 select a.c, b.d, sum(b.e)
 from path1.table1 a,
  path1.table2 b
 where a.i = b.i
 group by a.d, b.d;
quit;
When you use the server Parallel Join facility, you can use the parallel group-by method on multiple tables.
Last updated: February 8, 2017