Parallel Join Facility

Overview of the Parallel Join Facility

The Parallel Join facility is a feature of the SPD Server SQL planner that decreases the required processing time when creating a pair-wise join between two SPD Server tables. The processing time savings is created when SPD Server performs the pair-wise join in parallel.

The SQL planner first searches for pairs when SPD Server source tables are to be joined. When a pair is found, the planner checks the join syntax for that pair to determine whether it meets all of the requirements for the Parallel Join facility. If the join syntax meets the requirements, the pair of tables will be joined by the Parallel Join facility.

Criteria to use the Parallel Join Facility

The criteria to use the SPD Server Parallel Join facility can be more complex than simply requiring a pair-wise join of two SPD Server tables. The Parallel Join facility can handle multiple character columns, numeric columns, or combinations of character and numeric columns that are joined between pairs of tables. Numeric columns do not need to be of the same width to act as a join key, but character columns must be of the same width in order to be a join key. Columns that are involved in a join cannot be derived from a SAS CASE statement, and cannot be created from character manipulation functions such as SUBSTR, YEAR, MONT, DAY, and TRIM.

Parallel Join Methods

Parallel Sort-Merge Method

The parallel sort-merge join method first performs a parallel sort to order the data, and then merges the sorted tables in parallel. During the merge, the facility concurrently joins multiple rows from one table with the corresponding rows in the other table. You can use the parallel sort-merge join method to execute any join that meets the requirements for parallel join.

The parallel sort-merge method is a good all-around parallel join strategy that requires no intervention from the user. The tables for the sort-merge method do not need to be in the same domain. The performance for the sort-merge method is not affected by the distribution of the data in the sort key columns.

The sort-merge method begins by completely sorting the smaller of the two tables being joined, while also performing concurrent partial parallel sorts on the larger table. If both tables are very large and sufficient resources are not available to do the complete sort on the smaller table, the performance of the parallel sort-merge method can degrade. The parallel sort-merge method is also limited when performing an outer, left, or right join in parallel. Only two concurrent threads can be used when performing parallel outer, left, or right joins. Inner joins are not limited in the parallel sort-merge method and can use more than two concurrent threads during parallel operations.

Parallel Range Join Method

The parallel range join method uses a join index to determine the ranges of rows between the tables that can be joined in parallel. The parallel range join method requires you to create a join index on the columns to be joined in the tables that you want to merge. The join index divides the two tables into a specified number of near-equal parts, or ranges, based on matching values between the join columns. The Parallel Join facility recognizes the ranges of rows that contain matching values between the join columns, and then uses concurrent join threads to join the rows in parallel. The SPD Server parallel sort then sorts the rows within a range.

The parallel range join method can be performed only on tables that are in the same domain. If either of the two tables are updated after the join index is created, the join index must be rebuilt before the parallel range join method can be used. The parallel range join method performs best when the columns of the tables that are being joined are sorted. If the columns are not relatively sorted, then the concurrent join threads can cause processor thrashing. Processor thrashing occurs when unsorted rows in a table require SPD Server to perform increasingly larger table row scans, which can consume processor resources at a high rate during concurrent join operations.

More detailed information about creating join indexes is available in Chapter 17, "SAS Scalable Performance Data (SPD) Server Index Utility Ixutil," of the SAS Scalable Performance Data (SPD) Server 4.5: Administrator's Guide.

How does the SPD Server Parallel Join facility choose between the sort-merge method and the range join method? If a join index is available for the tables to be joined, the Parallel Join facility chooses the parallel range join method. If a join index does not exist, or if the join index has not been rebuilt since a table was updated, the Parallel Join facility defaults to using the parallel sort-merge method.

Parallel Joins with Group-By

A powerful feature of the SPD Server Parallel Join facility is its integration with the SPD Server Parallel Group-By facility. If the result of the parallel join contains a group-by statement, the partial results of the parallel join threads are passed to the Parallel Group-By facility, which performs the group-by operation in parallel. In the following example, SPD Server performs both a parallel join and parallel group-by operation.

LIBNAME path1 sasspds .... IP=YES;

PROC SQL;
create table junk as
 select a.c, b.d, sum(b.e)
 from path1.table1 a,
  path1.table2 b
 where a.i = b.i
 group by a.d, b.d;
quit;

When you use the SPD Server Parallel Join facility, you are not restricted to using the parallel group-by method only on single tables.

Parallel Join SQL Options

PLLJOIN/NOPLLJOIN

The PLLJOIN/NOPLLJOIN option enables and disables the SPD Server Parallel Join facility.

Usage:

execute(reset noplljoin)
  by sasspds ; /* disables Parallel Join */

CONCURRENCY

The CONCURRENCY=<n> option sets the concurrency level that is used by the SPD Server Parallel Join facility, where the integer n specifies the number of levels. In most cases, changing the default SPD Server concurrency setting (half of the available number of processors) is not recommended.

Usage:

execute(reset concurrency=4)
  by sasspds ; /* enables 4 concurrency levels */

PLLJMAGIC

The PLLJMAGIC option specifies how SPD server performs parallel joins.

Usage:

execute(reset plljmagic=<100/200>)
  by sasspds ;

PLLJMAGIC=100 forces a parallel range join when the range index is available.

PLLJMAGIC=200 forces a parallel merge join.

Parallel Join Example 1

The first parallel join example is a basic SQL query that creates a pair-wise join of two SPD Server tables, table1 and table2.

LIBNAME path1 sasspds .... IP=YES;

PROC SQL;
create table junk as
 select *
  from path1.table1 a,
  path1.table2 b
  where a.i = b.i;
 quit;

Parallel Join Example 2

The next parallel join example is an SQL query that uses more than two SPD Server tables. In this example, the SQL planner performs a parallel join on table1 and table2, and then use a non-parallel method to join the results of the first join and table3. A non-parallel join method is used for the second join, because the criteria for a parallel join was not met. A parallel join can be performed only on a pair-wise join of two SPD Server tables, and the query calls three SPD Server tables.

LIBNAME path1 sasspds .... IP=YES;

PROC SQL;
create table junk as
 select *
  from path1.table1 a,
  path1.table2 b,
  path1.table3 c
 where a.i = b.i and b.i = c.i;
quit;

Parallel Join Example 3

Multiple parallel joins can be used in the same SQL query, as long as the SQL planner can perform the query using more than one pairwise join. In the next parallel join example, a more complex query contains a union of two separate joins. Both joins are pair-wise joins of two SPD Server tables. There is a pair-wise join between table1 and table2, and then a pair-wise join between table3 and table4 is performed concurrently, using the Parallel Join facility.

PROC SQL;
create table junk as
 select *
  from path1.table1 a,
  path1.table2 b
  where a.i = b.i
  union

 select *
  from path1.table3 c,
path1.table4 d
where c.i = d.i;
quit;