Influencing the Join Algorithm

Problem

You want to influence the SAS SQL optimizer to choose the join algorithm that yields the best possible performance for the SQL processing that is included in a SAS Data Integration Studio job. SAS SQL implements several well-known join algorithms: sort-merge, index, and hash.

Solution

Common techniques aid join performance, irrespective of the algorithm chosen. These techniques use options that are found on the SQL Properties pane and the properties panes for the tables found in SAS queries. However, selecting a join algorithm is important enough to merit a dedicated topic. You can use the Debug property on the SQL Join Properties pane to run the _method option, which adds a trace that indicates which algorithm is used when in the Log tab.
You can use the following join types:

Tasks

Sort-Merge Joins

Conditions often cause the SAS SQL optimizer to choose the sort-merge algorithm, and techniques that improve sort performance also improve sort-merge join performance. However, understanding and using index and hash joins can provide performance gains. Sort-merge is the algorithm that is selected most often by the SQL optimizer. When index nested loop and hash join are eliminated as choices, a sort-merge join or simple nested loop join is used. A sort-merge sorts one table, stores the sorted intermediate table, sorts the second table, and finally merges the two to form the join result. Use the Suggest Sort Merge Join property on the SQL Properties pane to encourage a sort-merge. This property adds MAGIC=102 to the PROC SQL invocation, as follows: proc sql _method magic=102;.

Index Joins

An index join looks up each row of the smaller table by querying an index of the large table. When chosen by the optimizer, an index join usually outperforms a sort-merge join on the same data. To get the best join performance, you should ensure that both tables have indexes created on any columns that you want to participate in the join relationship. The SAS SQL optimizer considers an index join when:
  • The join is an equijoin in which tables are related by equivalence conditions on key columns.
  • Joins with multiple conditions are connected by the AND operator.
  • The larger table has an index that includes all the join keys.
Encourage an index nested loop with IDXWHERE=YES as a data set option, as follows: proc sql _method; select ... from smalltable, largetable(idxwhere=yes). You can also turn on the Suggest Index Join property on the properties panes for the tables in the query.

Hash Joins

The optimizer considers a hash join when an index join is eliminated as a possibility. With a hash join, the smaller table is reconfigured in memory as a hash table. SQL sequentially scans the larger table and performs row-by-row hash lookup against the small table to form the result set. A memory-sizing formula, which is not presented here, determines whether a hash join is chosen. The formula is based on the PROC SQL option BUFFERSIZE, whose default value is 64 KB. On a memory-rich system, consider increasing BUFFERSIZE to increase the likelihood that a hash join is chosen. You can also encourage a hash join by increasing the default 64 KB PROC SQL buffer size option. Set the Buffer Size property on the SQL Properties pane to 1048576.