WHERE Clause Planner

The WHERE clause Planner implemented in SPD Server avoids computation-intensive operations and uses simple computations where possible. WHERE clauses in large database operations can be very resource-intensive operations. In SPD Server 3.x and earlier releases, query authors often needed to manually tune queries for performance. The tuning was accomplished using macro variables and index settings. The WHERE clause planner integrated into SPD Server does the tuning work for the user by automatically costing the different approaches to index evaluation.

WHERE-Costing Using Cardinality Ratio and Distribution Values

Two key factors are used to evaluate, or cost WHERE clause indices. The factors are cardinality ratio and distribution.
The cardinality ratio refers to the proportion expressed by the number distinct values in the index divided by the number of rows in a table. When many observations in a table hold the same value for a given variable, the variable value is said to have a low cardinality ratio. An example of a table with a low cardinality ratio might be a table of unleaded gasoline prices from service stations in the same area of a large city. Tables that have a low cardinality ratio feature many observations, but only a few unique observation values.
Conversely, when a table has only one or few observations that contain the same variable value, then that table can be described as having a high cardinality ratio. An example of a table with a high cardinality ratio might be an office phone directory, where the variable for phone extension is always unique. Tables that have a high cardinality ratio tend to contain many observations with very few repeating, or non-unique values.
The cardinality ratio for an index is in the range 0–1. Indices with a high cardinality ratio value of 1.0 are completely unique with no repeated values. Indices with a low cardinality ratio generate a score that approaches zero as the number of unique variable values diminish. The closer to zero, the lower the cardinality ratio of the index.
Distribution refers to the sequential proximity between observations for values of a variable that are repeated throughout the variable's data set distribution. When a certain value for a variable exists in many observations that are scattered uniformly throughout the table, that value is said to have a wide distribution. If a variable value exists in many contiguous or nearly contiguous rows, the distribution is clustered.

WHERE Clause EVAL Strategies

SPD Server indexing keeps track of the cardinality ratio and distribution of variable values in a table and uses them to calculate the cost of a WHERE clause. The WHERE clause planner uses four evaluation strategies to determine the number of rows that will be required to execute a given query. The four evaluation strategies are EVAL 1, EVAL 2, EVAL 3, and EVAL 4. True rows are rows that contain the variable values specified in a WHERE clause. False rows do not contain the variable value specified in the clause.
EVAL 1, EVAL 3, EVAL 4, and EVAL 5 evaluate true rows in the table using indices. EVAL 2 evaluates true rows of a table without using indices.
  • EVAL 1 evaluates true rows using an index to locate the true rows in each segment of the table. The index evaluation process generates a list of row IDs per segment. EVAL 1 accepts WHERE clause operators for equivalency expressions such as EQ, =, LE, <=, LT, <, GE, >=, GT, >, IN, and BETWEEN. EVAL 1 uses threaded parallel processing across the index segments to permit concurrent evaluation of multiple indices. EVAL 1 combines multiple segment bitmaps from queries that use multiple indices to generate the list of row IDs per segment.
  • EVAL 2 takes true rows as determined by EVAL 1, EVAL 3, or EVAL 4, and then uses brute force to eliminate any rows shown to be false, leaving a table that contains only true rows. EVAL 2 processes all rows of a table when no index evaluation is possible. For example, no index evaluation is possible when an index is not present or when some predecessor function performs an operation that invalidates the index.
  • EVAL 3 is a single index sequential process. Use EVAL 3 when the number of rows returned by an index is unique or nearly unique (when cardinality ratio is high). EVAL 3 returns a list of true rows for the entire table. EVAL 3 only supports the equality operators EQ and =.
  • EVAL 4 is similar to EVAL 3 but supports a larger set of inequality and inclusion operators, such as IN, GT, GE, LT, LE, and BETWEEN.
  • EVAL 5 can operate when the SPD Server Index Scan facility is used. The EVAL 5 strategy uses index metadata and aggregate SQL functions to evaluate true rows. The EVAL 5 strategy does not require a table scan.
    For example, when x is indexed, and SPD Server uses EVAL 5 to evaluate the SQL expression
    count(*) where x=5 , 
    the index metadata is scanned for the condition, x = 5 instead of performing table scans. The EVAL 5 strategy supports the min(), max(), count(), count(distint), nmiss(), and range() functions. The EVAL 5 strategy cannot be used on SQL expressions, which use functions other than those listed above.
The WHERE clause planner in SPD Server 3.x relied heavily on EVAL 1 and EVAL 2 threaded strategies to evaluate most clauses. Sometimes the SPD Server 3.x EVAL 1 and EVAL 2 strategies would over-thread and over-manipulate indices during the evaluations during WHERE clause evaluation. This resulted in reduced performance or excessive resource consumption. With SPD Server 5.1's WHERE clause costing in place, EVAL 3 and EVAL 4 strategies are more suitable evaluation engines which conserve resources and boost processor performance.

Assigning EVAL Strategies

Overview of Assigning EVAL Strategies

The SPD Server WHERE clause planner uses the following logic when selecting an EVAL strategy to evaluate expressions:
When the planner encounters a WHERE clause, it builds a tree that represents all of the possible predicate expressions. The objective of the WHERE clause planner is to divide the set of predicate expressions into two trees. One tree collects predicate expressions that lack usable indices and are constrained to EVAL 2 evaluation. The remaining predicate expressions are put in the other tree. Each of the predicate expressions in the second tree is scanned and assigned an evaluation strategy of EVAL 1, EVAL 3, or EVAL 4, depending on the WHERE clause costing values and the syntax used in the predicate expression .
The second tree, which does not use the EVAL 2 method, is scanned for predicate expressions that return values with a low cardinality ratio. When low cardinality ratio predicate expressions are identified, they are ranked. The predicate expression with the lowest cardinality ratio value is set aside for an index-based evaluation. All of the other remaining predicate expressions are evaluated using the EVAL 2 tree strategy. The predicate expression with the highest cardinality ratio is evaluated using either the EVAL 3 or the EVAL 4 strategy. The syntax used in the predicate expression determines which of the two strategies to use. Frequently, the single index EVAL 3 or EVAL 4 is chosen because single index evaluations require smaller processing loads and yield reliable results. With a low processor overhead and a high data yield, there is no reason to include other indices when a single index is sufficient.
When the WHERE clause planner determines that no predicate expressions meet the low cardinality ratio criteria, it chooses the EVAL 1 strategy. Before the EVAL 1 operation is performed, the costing algorithm is run on the remaining predicates to prune any predicate expressions that represent large processor loads and large data yields. Predicate expressions that will require large processor loads and produce large data yields are moved to the EVAL 2 tree.

Index Scan Facility

When SPD Server invokes the Index Scan facility, and the SQL aggregate uses the specified supported functions for EVAL 5, the EVAL 5 strategy uses a fast index metadata scan to select SQL statements that meet the aggregate function criterion.

High Yield Predicate Expressions

A large, or high data yield expression has a high percentage of rows containing true segments. The default threshold for a high yield expression is one where less than 25% of the rows evaluated are returned by the predicate. At this point, processor costs related to index use begin increasing without proportional returns on the evaluation results.

High Processing Load Predicate Expressions

Predicate expressions that require high processing loads are predicates that usually require large amounts of index manipulation before they can complete. When the amount of index work that is required exceeds the work that is required to use an EVAL 2 strategy, the predicate expression will be best evaluated by the EVAL 2 tree. Open-ended predicate expressions that contain many syntax inequality operators such as GT and LT or many variations in syntax are good high work candidates for EVAL 2. High work predicate expressions are detected by comparing the number of unique values in the predicate expression to the number of unique values contained in the index.

High Yield and High Processing Load Predicate Expressions

When all predicate expressions in EVAL 1 are high yield or high processor load, SPD Server uses segmented costing. In segmented costing, true segments are passed to EVAL 2 for processing. EVAL 2 only processes table segments that can provide true rows for the WHERE clause.

Turning WHERE Clause Costing Off

You can use the SPD Server spdsserv.parm parameter file to configure the default WHERECOSTING parameter setting to ON. If you want to turn off WHERE clause costing within the scope of a job, you can use macros or a DATA step to turn WHERE clause costing off and on:
  • The SPDSWCST=NO macro setting turns off WHERE clause costing.
  • The SPDSWSEQ=YES macro overrides WHERE clause costing and enables you to force a global EVAL3 or EVAL4 strategy.
  • The WHERECOSTING parameter can be removed or set to NOWHERECOSTING in the spdsserv.parm file if you want to turn off costing for the entire server.
If you turn off WHERE clause costing in the spdsserv.parm parameter file, or if you use the macro setting SPDSWCST=NO, the WHERE clause planner reverts to the rules-based WHERE clause planning of earlier versions of SPD Server.

WHINIT: Indexed and Non-Indexed Predicates

Overview of WHINIT

If SPD Server is not configured to use dynamic WHERE-costing, the WHERE clause planner reverts to the rule-based heuristics of WHINIT. WHINIT uses rules to select indexes for the predicates, and then select the most appropriate EVAL strategy for the query.
WHINIT splits the WHERE clause, represented as a tree, into non-indexed and indexed parts. Non-indexed predicates include
  • non-indexed columns
  • functions
  • columns that have indexes that WHINIT cannot use
If the WHERE clause planner places indexed predicates in the non-indexed tree, it is usually because the predicates involve an OR expression. An example of a predicate with an OR expression is, where x = 1 or y = 2. Even if column x is indexed, WHINIT cannot use the index because the OR is disjunctive. As a result of the disjunctive OR, the planner cannot use the index, and places both the predicates x = 1 and y = 2 into the non-indexed part of the WHERE tree.

Sample WHINIT Output

SAS users can use an SPD Server macro variable to view WHERE clause planner output:
%let SPDSWDEB=YES;
The following is what the WHINIT plan might give for the following scenario:
  • a WHERE clause of where a = 1 and b in (1 2 3) and d = 3 and (d + 3 = c)
  • an SPD index IDX_ABC on columns (A B C)
  • an SPD index D on column (D)
Note: The line numbers are for reference; they are NOT part of the actual output.
 1:whinit: WHERE ((A=1) and B in (1, 2, 3) and (D=3) and (C=(D+3)))
 2:whinit: wh-tree presented
 3:
       /-NAME = [A]
 4:           /-CEQ----|
 5:          |
     \-LITN = [1]
 6: --LAND---|
 7:          |
     /-NAME = [B]
 8:          |--IN-----|
 9:          |
    |          /-LITN = [1]
10:          |
    \-SET----|
11:          |
             |--LITN = [2]
12:          |
              \-LITN = [3]
13:          |
    /-NAME = [D]
14:          |--CEQ----|
15:          |
    \-LITN = [3]
16:          |
    /-NAME = [C]
17:           \-CEQ----|
18:
     |          /-NAME = [D]
19:
      \-AADD---|
20:
                \-LITN = [3]
21:whinit: wh-tree after split
22:           /-NAME = [C]
23: --CEQ----|
24:          |
     /-NAME = [D]
25:           \-AADD---|
26:
      \-LITN = [3]
27:whinit: SBM-INDEX D uses 50% of segs (WITHIN maxsegratio 75%)
28:whinit: INDEX tree after split
29:
      /-NAME = [A] <1>SBM-INDEX IDX_ABC (A,B)
30:           /-CEQ----|
31:          |
     \-LITN = [1]
32: --LAND---|
33:          |
     /-NAME = [B]
34:          |--IN-----|
35:          |
    |          /-LITN = [1]
36:          |
     \-SET----|
37:          |
              |--LITN = [2]
38:          |
               \-LITN = [3]
39:          |
     /-NAME = [D] <2>SBM-INDEX D (D)
40:           \-CEQ----|
41:
       \-LITN = [3]
42:whinit returns: ALL EVAL1(w/SEGLIST) EVAL2
Line 1 shows what the WHINIT Planner received. Do not be surprised -- what the Planner receives can differ from your entries. Sometimes SAS optimizes or transforms a WHERE clause before passing it to SPD Server. For example, it can eliminate entities such as NOT operators, the union of set lists, and so on.
Lines 2 to 20 show the presented WHERE clause in a tree format. The tree format is a user-readable form of the actual WHERE clause that is processed by the SPD Server engine.
Lines 21 to 26 show the non-indexed WHERE tree, the result of splitting off the indexed part. The non-indexed WHERE tree can be empty or it can look the same as lines 2 to 20 if no indexes are selected. Consider that it is the non-indexed part of the WHERE clause that WHINIT uses to filter records obtained by the indexed strategies (EVAL1, 3 or 4).
Lines 27 to 41 shows that the percentage of segments containing values selected from column D is with the maximum allowed to proceed with pre-segment logic. Therefore, only those segments that contain values that satisfy the WHERE clause for column D will be included in further query processing for that column. Composite index IDX_ABC and simple index D are used to resolve the indexed WHERE clause predicates.
Line 42, the last line in our output, shows which strategies are used. The first keyword ALL indicates that SPD Server can identify correctly ALL resulting records, without help from the SAS System. First, SPD Server will call EVAL1, an indexed method, to quickly access a list of records that satisfy where a = 1 and b in (1 2 3) and d = 3, then it will use EVAL2 to determine whether c = d + 3 is true on these records.
When output from EVAL1 displays the suffix w/ seglist, as it does in the above output, it means that SPD indexes were detected, and that the indexes were used to filter only the segments that satisfy the indexed predicates. When EVAL1 has no suffix, it means that ALL segments will be evaluated.
SPD Server stores the minimum and maximum values for a table index in a global structure. WHINIT can use the numeric range to 'prune' predicates when the table index values are out of the min / max range. WHINIT output keywords can indicate pruning activity. For example, if WHINIT had determined that the values for D (in our WHERE clause) are between 5 and 13, then as a consequence, the predicate where d = 3 could never be true. In this case, WHINIT would have pruned this predicate because it is logically impossible, or FALSE. Pruning can also affect higher nodes. If the d = 3 predicate were deemed FALSE, then the AND sub tree would also be FALSE and would also have been pruned.

WHINIT Output Return Keywords

In the last line of the output, ALL is one of the following keywords that the Planner can display:
  • ALL - SPD Server can evaluate ALL of the WHERE clause when determining which records satisfy the clause.
  • SOME - SPD Server can handle SOME or part of the WHERE clause; it will then need SAS to help identify resulting records.
  • NONE - SPD Server cannot evaluate this WHERE clause; SAS will perform all evaluations.
  • TRUE - SPD Server has determined that the entire WHERE clause is TRUE, and that all the records satisfy the given WHERE clause.
  • FALSE - SPD Server determined that the WHERE clause is FALSE, that is, no records can satisfy the WHERE clause.
  • RC=number - An internal error has occurred; the error number is displayed.
  • EVALx - the EVAL strategies the planner will use; x can be 1, 2, 3, or 4.

Composite Index Permutations

A composite index can involve one or more in set equality predicates, such as an index on columns (a b c). When WHINIT is presented with a WHERE clause that has such a composite index, such as where a = 1 and b in (1 2 3) and c in (4 5), it will generate all permutations of this compound key, probing the index for each value. In our example, six values are generated:
(a b c) = (1 1 4) (1 1 5) (1 2 4) (1 2 5) (1 3 4) (1 3 5)
The permutations start at the back end of the key to take advantage of locality: to locate keys with close values that access the same disk page. This means less input/output operations on the index.