Understanding the WHERE Clause Planner

The WHERE clause planner implemented in SPD Server avoids computation-intensive operations and uses simple computations where possible. WHERE clauses in large database operations can be very resource-intensive. Before SAS SPD Server 4.0, query authors often needed to manually tune queries for performance. The tuning was accomplished using macro variables and index settings. The WHERE clause planner integrated into the server does the tuning work for the user by automatically costing the different approaches to index evaluation.

WHERE-Costing Using Cardinality Ratio and Distribution Values

Two key factors are used to evaluate, or cost, WHERE clause indexes: cardinality ratio and distribution.
The cardinality ratio refers to the proportion expressed by the number of distinct values in the index divided by the number of rows in a table. When many rows in a table hold the same value for a given variable, the variable value is said to have a low cardinality ratio. An example of a table with a low cardinality ratio might be a table of unleaded gasoline prices from service stations in the same area of a large city. Tables that have a low cardinality ratio feature many rows, but only a few unique row values.
Conversely, when a table has only one or a few rows that contain the same variable value, then that table can be described as having a high cardinality ratio. An example of a table with a high cardinality ratio might be an office phone directory, where the variable for phone extension is always unique. Tables that have a high cardinality ratio tend to contain many rows with very few repeating, or non-unique values.
The cardinality ratio for an index is in the range 0–1. Indexes with a high cardinality ratio value of 1.0 are completely unique with no repeated values. Indexes with a low cardinality ratio generate a score that approaches zero as the number of unique variable values diminish. The closer to zero, the lower the cardinality ratio of the index.
Distribution refers to the sequential proximity between rows for values of a variable that are repeated throughout the variable's table distribution. When a certain value for a variable exists in many rows that are scattered uniformly throughout the table, that value is said to have a wide distribution. If a variable value exists in many contiguous or nearly contiguous rows, the distribution is clustered.

WHERE Clause EVAL Strategies

Server indexing keeps track of the cardinality ratio and distribution of variable values in a table and uses them to calculate the cost of a WHERE clause. The WHERE clause planner uses four evaluation strategies to determine the number of rows that will be required to execute a given query. The four evaluation strategies are EVAL 1, EVAL 2, EVAL 3, and EVAL 4. True rows contain the variable values specified in a WHERE clause. False rows do not contain the variable value specified in the clause.
EVAL 1, EVAL 3, EVAL 4, and EVAL 5 evaluate true rows in the table using indexes. EVAL 2 and EVAL 6 evaluate true rows of a table without using indexes.

EVAL 1

evaluates true rows using an index to locate the true rows in each segment of the table. The index evaluation process generates a list of row IDs per segment. EVAL 1 accepts WHERE clause operators for equivalency expressions such as EQ, =, LE, <=, LT, <, GE, >=, GT, >, IN, and BETWEEN. EVAL 1 uses threaded parallel processing across the index segments to permit concurrent evaluation of multiple indexes. EVAL 1 combines multiple segment bitmaps from queries that use multiple indexes to generate the list of row IDs per segment.

EVAL 2

takes true rows as determined by EVAL 1, EVAL 3, or EVAL 4, and then eliminates any rows shown to be false, leaving a table that contains only true rows. EVAL 2 processes all rows of a table when no index evaluation is possible. For example, no index evaluation is possible when an index is not present or when some predecessor function performs an operation that invalidates the index.

EVAL 3

a single index sequential process. Use EVAL 3 when the number of rows returned by an index is unique or nearly unique (when cardinality ratio is high). EVAL 3 returns a list of true rows for the entire table. EVAL 3 supports only the equality operators EQ and =.

EVAL 4

similar to EVAL 3 but supports a larger set of inequality and inclusion operators, such as IN, GT, GE, LT, LE, and BETWEEN.

EVAL 5

can operate when the server index scan facility is used. The EVAL 5 strategy uses index metadata and aggregate SQL functions to evaluate true rows. The EVAL 5 strategy does not require a table scan.

For example, suppose that when X is indexed, the server uses EVAL 5 to evaluate the following SQL expression:
count(*) where x=5
The index metadata is scanned for the condition x = 5 instead of performing table scans. The EVAL 5 strategy supports the MIN(), MAX(), COUNT(), COUNT(distinct), NMISS(), and RANGE() functions. The EVAL 5 strategy cannot be used on SQL expressions, which use functions other than those listed above.

EVAL 6

emulates the behavior of EVAL 2. With EVAL 6, the query is a candidate for Hadoop WHERE processing. If the Hadoop WHERE processing fails, EVAL 6 reverts to the EVAL 2 operation. EVAL 6 takes true rows as determined by EVAL 1, EVAL 3, or EVAL 4, and then eliminates any rows shown to be false, leaving a table that contains only true rows. EVAL 2 processes all rows of a table when no index evaluation is possible. For example, no index evaluation is possible when an index is not present or when some predecessor function performs an operation that invalidates the index.

The WHERE clause planner in SAS SPD Server 3.x relied heavily on EVAL 1 and EVAL 2 threaded strategies to evaluate most clauses. Sometimes the SAS SPD Server 3.x EVAL 1 and EVAL 2 strategies would over-thread and over-manipulate indexes during the WHERE clause evaluation. This resulted in reduced performance or excessive resource consumption. Beginning with SAS SPD Server 5.2, which introduced WHERE clause costing, EVAL 3 and EVAL 4 strategies are more suitable evaluation engines, which conserve resources and boost processor performance.

Assigning EVAL Strategies

Overview of Assigning EVAL Strategies

The server WHERE clause planner uses the following logic when selecting an EVAL strategy to evaluate expressions:
When the planner encounters a WHERE clause, it builds a tree that represents all of the possible predicate expressions. The objective of the WHERE clause planner is to divide the set of predicate expressions into two trees. One tree collects predicate expressions that lack usable indexes and are constrained to EVAL 2 evaluation. The remaining predicate expressions are put in the other tree. Each of the predicate expressions in the second tree is scanned and assigned an evaluation strategy of EVAL 1, EVAL 3, or EVAL 4, depending on the WHERE clause costing values and the syntax used in the predicate expression. With the server WHERE clause costing in place, EVAL 3 and EVAL 4 strategies are more suitable evaluation engines that conserve resources and boost processor performance.
The second tree, which does not use the EVAL 2 method, is scanned for predicate expressions that return values with a low cardinality ratio. When low cardinality ratio predicate expressions are identified, they are ranked. The predicate expression with the lowest cardinality ratio value is set aside for an index-based evaluation. All of the other remaining predicate expressions are evaluated using the EVAL 2 tree strategy. The predicate expression with the highest cardinality ratio is evaluated using either the EVAL 3 or the EVAL 4 strategy. The syntax used in the predicate expression determines which of the two strategies to use. Frequently, the single index EVAL 3 or EVAL 4 is chosen because single index evaluations require smaller processing loads and yield reliable results. With a low processor overhead and a high data yield, there is no reason to include other indexes when a single index is sufficient.
When the WHERE clause planner determines that no predicate expressions meet the low cardinality ratio criteria, it chooses the EVAL 1 strategy. Before the EVAL 1 operation is performed, the costing algorithm is run on the remaining predicates to prune any predicate expressions that represent large processor loads and large data yields. Predicate expressions that will require large processor loads and produce large data yields are moved to the EVAL 2 tree.

Index Scan Facility

When the server invokes the index scan facility, and the SQL aggregate uses the specified supported functions for EVAL 5, the EVAL 5 strategy uses a fast index metadata scan to select SQL statements that meet the aggregate function criterion.

High Yield Predicate Expressions

A large, or high data yield expression has a high percentage of rows containing true segments. The default threshold for a high yield expression is one where less than 25% of the rows evaluated are returned by the predicate. At this point, processor costs related to index use begin increasing without proportional returns on the evaluation results.

High Processing Load Predicate Expressions

Predicate expressions that require high processing loads are predicates that usually require large amounts of index manipulation before they can complete. When the required amount of index work exceeds the work that is required to use an EVAL 2 strategy, the predicate expression will be best evaluated by the EVAL 2 tree. Open-ended predicate expressions that contain many syntax inequality operators (such as GT and LT) or many variations in syntax are good high-work candidates for EVAL 2. High-work predicate expressions are detected by comparing the number of unique values in the predicate expression to the number of unique values contained in the index.

High-Yield and High-Processing Load Predicate Expressions

When all predicate expressions in EVAL 1 are high-yield or high-processor load, the server uses segmented costing. In segmented costing, true segments are passed to EVAL 2 for processing. EVAL 2 processes only table segments that can provide true rows for the WHERE clause.

Turning WHERE Clause Costing Off

You can use the server spdsserv.parm parameter file to configure the default WHERECOSTING parameter setting to ON. To turn off WHERE clause costing within the scope of a job, you can use macros or a DATA step:
  • The SPDSWCST=NO macro setting turns off WHERE clause costing. If you turn off WHERE costing in the spdsserv.parm parameter file, or if you use the macro setting SPDSWCST=NO, the WHERE clause planner reverts to a non-costing, rules-based algorithm.
  • The SPDSWSEQ=YES macro overrides WHERE clause costing, and enables you to force a global EVAL3 or EVAL4 strategy.
  • The WHERECOSTING parameter can be removed or set to NOWHERECOSTING in the spdsserv.parm parameter file to turn off costing for the entire server.
If you turn off WHERE clause costing in the spdsserv.parm parameter file, or if you use the macro setting SPDSWCST=NO, the WHERE clause planner reverts to the rules-based WHERE clause planning of earlier versions of SPD Server.

WHINIT: Indexed and Non-Indexed Predicates

Overview of WHINIT

If the server is not configured to use dynamic WHERE-costing, the WHERE clause planner reverts to the rule-based heuristics of WHINIT. WHINIT uses rules to select indexes for the predicates, and then selects the most appropriate EVAL strategy for the query.
WHINIT splits the WHERE clause, represented as a tree, into non-indexed and indexed parts. Non-indexed predicates include the following:
  • non-indexed columns
  • functions
  • columns that have indexes that WHINIT cannot use
If the WHERE clause planner places indexed predicates in the non-indexed tree, it is usually because the predicates involve an OR expression. Here is an example of a predicate with an OR expression: where x = 1 or y = 2. Even if column x is indexed, WHINIT cannot use the index because the OR is disjunctive. As a result of the disjunctive OR, the planner cannot use the index, and places both the predicates x = 1 and y = 2 into the non-indexed part of the WHERE tree.

Sample WHINIT Output

SAS users can use a server macro variable to view WHERE clause planner output:
%let SPDSWDEB=YES;
The following is what the WHINIT plan might give for the following scenario:
  • a WHERE clause of where a = 1 and b in (1 2 3) and d = 3 and (d + 3 = c)
  • an SPD index IDX_ABC on columns (A B C)
  • an SPD index D on column (D)
Note: The line numbers are for reference. They are NOT part of the actual output.
 1:whinit: WHERE ((A=1) and B in (1, 2, 3) and (D=3) and (C=(D+3)))
 2:whinit: wh-tree presented
 3:
       /-NAME = [A]
 4:           /-CEQ----|
 5:          |
     \-LITN = [1]
 6: --LAND---|
 7:          |
     /-NAME = [B]
 8:          |--IN-----|
 9:          |
    |          /-LITN = [1]
10:          |
    \-SET----|
11:          |
             |--LITN = [2]
12:          |
              \-LITN = [3]
13:          |
    /-NAME = [D]
14:          |--CEQ----|
15:          |
    \-LITN = [3]
16:          |
    /-NAME = [C]
17:           \-CEQ----|
18:
     |          /-NAME = [D]
19:
      \-AADD---|
20:
                \-LITN = [3]
21:whinit: wh-tree after split
22:           /-NAME = [C]
23: --CEQ----|
24:          |
     /-NAME = [D]
25:           \-AADD---|
26:
      \-LITN = [3]
27:whinit: SBM-INDEX D uses 50% of segs (WITHIN maxsegratio 75%)
28:whinit: INDEX tree after split
29:
      /-NAME = [A] <1>SBM-INDEX IDX_ABC (A,B)
30:           /-CEQ----|
31:          |
     \-LITN = [1]
32: --LAND---|
33:          |
     /-NAME = [B]
34:          |--IN-----|
35:          |
    |          /-LITN = [1]
36:          |
     \-SET----|
37:          |
              |--LITN = [2]
38:          |
               \-LITN = [3]
39:          |
     /-NAME = [D] <2>SBM-INDEX D (D)
40:           \-CEQ----|
41:
       \-LITN = [3]
42:whinit returns: ALL EVAL1(w/SEGLIST) EVAL2
Line 1 shows what the WHINIT Planner received. Do not be surprised—what the planner receives can differ from your entries. Sometimes SAS optimizes or transforms a WHERE clause before passing it to the server. For example, it can eliminate entities such as NOT operators, the union of set lists, and so on.
Lines 2 to 20 show the presented WHERE clause in a tree format. The tree format is a user-readable form of the actual WHERE clause that is processed by the server engine.
Lines 21 to 26 show the non-indexed WHERE tree, the result of splitting off the indexed part. The non-indexed WHERE tree can be empty, or it can look the same as lines 2 to 20 if no indexes are selected. Consider that it is the non-indexed part of the WHERE clause that WHINIT uses to filter rows obtained by the indexed strategies (EVAL1, 3 or 4).
Lines 27 to 41 shows that the percentage of segments containing values selected from column D is with the maximum allowed to proceed with pre-segment logic. Therefore, only those segments that contain values that satisfy the WHERE clause for column D will be included in further query processing for that column. Composite index IDX_ABC and simple index D are used to resolve the indexed WHERE clause predicates.
Line 42, the last line in our output, shows which strategies are used. The first keyword ALL indicates that the server can identify correctly ALL resulting rows, without help from the SAS System. First, the server will call EVAL1, an indexed method, to quickly access a list of rows that satisfy where a = 1 and b in (1 2 3) and d = 3, then it will use EVAL2 to determine whether c = d + 3 is true on these rows.
When output from EVAL1 displays the suffix seglist, as it does in the above output, it means that SPD indexes were detected, and that the indexes were used to filter only the segments that satisfy the indexed predicates. When EVAL1 has no suffix, it means that ALL segments will be evaluated.
The server stores the minimum and maximum values for a table index in a global structure. WHINIT can use the numeric range to “prune” predicates when the table index values are out of the mininum/maximum range. WHINIT output keywords can indicate pruning activity. For example, if WHINIT had determined that the values for D (in the WHERE clause) are between 5 and 13, then the predicate where d = 3 could never be true. In this case, WHINIT would have pruned this predicate because it is logically impossible, or FALSE. Pruning can also affect higher nodes. If the d = 3 predicate were deemed FALSE, then the AND subtree would also be FALSE and would also have been pruned.

WHINIT Output Return Keywords

In the last line of the output, ALL is one of the following keywords that the planner can display:
  • ALL - The server can evaluate ALL of the WHERE clause when determining which rows satisfy the clause.
  • SOME - The server can handle SOME or part of the WHERE clause. It will then need SAS to help identify resulting rows.
  • NONE - The server cannot evaluate this WHERE clause. SAS will perform all evaluations.
  • TRUE - The server has determined that the entire WHERE clause is TRUE, and that all the rows satisfy the given WHERE clause.
  • FALSE - The server has determined that the WHERE clause is FALSE. That is, no rows can satisfy the WHERE clause.
  • RC=number - An internal error has occurred. The error number is displayed.
  • EVALx - the EVAL strategies that the planner will use. x can be 1, 2, 3, or 4.

Composite Index Permutations

A composite index can involve one or more in sets of equality predicates, such as an index on columns (a b c). When WHINIT is presented with a WHERE clause that has such a composite index, such as where a = 1 and b in (1 2 3) and c in (4 5), it will generate all permutations of this compound key, probing the index for each value. In the example, six values are generated:
(a b c) = (1 1 4) (1 1 5) (1 2 4) (1 2 5) (1 3 4) (1 3 5)
The permutations start at the back end of the key to take advantage of locality: to locate keys with close values that access the same disk page. This means fewer input/output operations on the index.
Last updated: February 8, 2017