Parallel Join Facility

Overview of the Parallel Join Facility

The Parallel Join facility is a feature of the SPD Server SQL Planner that decreases the processing time that is required to create a pairwise join between two SPD Server tables. The savings in processing time is created when SPD Server performs the pairwise join in parallel.
The SQL Planner first searches for pairs when SPD Server source tables are to be joined. When the Planner finds a pair, it checks the join syntax for that pair to determine whether the syntax meets all of the requirements for the Parallel Join facility. If the join syntax meets the requirements, the pair of tables are joined by the Parallel Join facility.

Criteria for Using the Parallel Join Facility

The criteria for using the SPD Server Parallel Join facility can be more complex than simply requiring a pairwise join of two SPD Server tables. The Parallel Join facility can handle multiple character columns, numeric columns, or combinations of character and numeric columns that are joined between pairs of tables. Numeric columns do not need to be of the same width to act as a join key, but character columns must be of the same width in order to be a join key. Columns that are involved in a join cannot be derived from a SAS CASE statement, and cannot be created from character manipulation functions such as SUBSTR, YEAR, MONT, DAY, and TRIM.

Parallel Join Methods

Parallel Sort-Merge Method

The parallel sort-merge join method first performs a parallel sort to order the data, and then merges the sorted tables in parallel. During the merge, the facility concurrently joins multiple rows from one table with the corresponding rows in the other table. You can use the parallel sort-merge join method to execute any join that meets the requirements for a parallel join.
The parallel sort-merge method is a good, all-around parallel join strategy that requires no intervention from you. The tables for the sort-merge method do not need to be in the same domain. The sort-merge method is not affected by the distribution of the data in the sort key columns.
The sort-merge method begins by completely sorting the smaller of the two tables that are being joined, while simultaneously performing partial parallel sorts on the larger table. If both tables are very large and sufficient resources are not available to do the complete sort on the smaller table, the performance of the parallel sort-merge method can degrade. The parallel sort-merge method is also limited when you are performing an outer, left, or right join in parallel. Parallel outer, left, or right joins can use only two concurrent threads. Inner joins are not limited in the parallel sort-merge method and can use more than two concurrent threads during parallel operations.

Parallel Range Join Method

The parallel range join method uses a join index to determine the ranges of rows between the tables that can be joined in parallel. The parallel range join method requires you to create a join index on the columns to be joined in the tables that you want to merge. The join index divides the two tables into a specified number of near-equal parts, or ranges, based on matching values between the join columns. The Parallel Join facility recognizes the ranges of rows that contain matching values between the join columns, and then uses concurrent join threads to join the rows in parallel. The SPD Server parallel sort then sorts the rows within a range.
You can use the parallel range join method only on tables that are in the same domain. If either of the two tables are updated after the join index is created, you must rebuild the join index before you can use the parallel range join method. The parallel range join method performs best when the columns of the tables that are being joined are sorted. If the columns are not relatively sorted, then the concurrent join threads can cause processor thrashing. Processor thrashing occurs when unsorted rows in a table require SPD Server to perform increasingly larger table row scans, which can consume processor resources at a high rate during concurrent join operations.
For more information about creating join indexes, see SAS Scalable Performance Data (SPD) Server Index Utility Ixutil in SAS Scalable Performance Data Server: Administrator's Guide.
How does the SPD Server Parallel Join facility choose between the sort-merge method and the range join method? If a join index is available for the tables to be joined, the Parallel Join facility chooses the parallel range join method. If a join index does not exist, or if the join index has not been rebuilt because a table was updated, the Parallel Join facility defaults to using the parallel sort-merge method.

Parallel Joins with Group-By

A powerful feature of the SPD Server Parallel Join facility is its integration with the SPD Server Parallel Group-By facility. If the result of the parallel join contains a GROUP BY statement, the partial results of the parallel join threads are passed to the Parallel Group-By facility, which performs the group-by operation in parallel. In the following example, SPD Server performs both a parallel join and parallel group-by operation.
LIBNAME path1 sasspds .... IP=YES;

PROC SQL;
create table junk as
 select a.c, b.d, sum(b.e)
 from path1.table1 a,
  path1.table2 b
 where a.i = b.i
 group by a.d, b.d;
quit;
When you use the SPD Server Parallel Join facility, you can use the parallel group-by method on multiple tables.

Parallel Join SQL Options

PLLJOIN / NOPLLJOIN

The PLLJOIN / NOPLLJOIN option enables and disables the SPD Server Parallel Join facility.
Usage:
execute(reset noplljoin)
  by sasspds ; /* disables Parallel Join */

CONCURRENCY=

The CONCURRENCY=<n> option sets the concurrency level that the SPD Server Parallel Join facility uses. The integer value n specifies the number of levels. In most cases, you should not change the default SPD Server concurrency setting, which is half of the available number of processors.
Usage:
execute(reset concurrency=4)
  by sasspds ; /* enables 4 concurrency levels */  

PLLJMAGIC

The PLLJMAGIC option specifies how SPD server performs parallel joins.
Usage:
execute(reset plljmagic=<100/200>)
  by sasspds ; 
PLLJMAGIC=100 forces a parallel range join when the range index is available.
PLLJMAGIC=200 forces a parallel merge join.

Parallel Join Example 1

The example is a basic SQL query that creates a pairwise join of two SPD Server tables, table1 and table2.
LIBNAME path1 sasspds .... IP=YES;

PROC SQL;
create table junk as
 select *
  from path1.table1 a,
  path1.table2 b
  where a.i = b.i;
 quit;

Parallel Join Example 2

This example is an SQL query that uses more than two SPD Server tables. The SQL Planner performs a parallel join on table1 and table2, and then use a non-parallel method to join the results of the first join and table3. The second join uses a non-parallel join method because the criteria for a parallel join were not met. A parallel join can be performed only on a pairwise join of two SPD Server tables and the query calls three SPD Server tables.
LIBNAME path1 sasspds .... IP=YES;

PROC SQL;
create table junk as
 select *
  from path1.table1 a,
  path1.table2 b,
  path1.table3 c
 where a.i = b.i and b.i = c.i;
quit;

Parallel Join Example 3

You can use multiple parallel joins in the same SQL query, as long as the SQL Planner can perform the query by using more than one pairwise join. In this parallel join example, a more complex query contains a union of two separate joins. Both joins are pairwise joins of two SPD Server tables. There is a pairwise join between table1 and table2. A pairwise join between table3 and table4 is performed concurrently, using the Parallel Join facility.
PROC SQL;
create table junk as
 select *
  from path1.table1 a,
  path1.table2 b
  where a.i = b.i
  union

 select *
  from path1.table3 c,
path1.table4 d
where c.i = d.i;
quit;