Optimizing Dynamic Cluster Tables

Combining WHERE Clauses with Dynamic Cluster BY Clause Optimization

Dynamic Cluster BY Clause Optimization Example

Dynamic Cluster BY Clause Optimization

When you use server dynamic clusters, you can create huge tables. If a SAS job needs to manipulate a huge table, you can sort the tables for more efficient processing. Traditional processing of huge tables can overuse or overwhelm available resources. Insufficient run-time or processor resources can prohibit you from running full table scans and manipulating table rows, which are required to sort huge tables for subsequent processing.

The server provides dynamic cluster BY clause optimization to reduce the need for a large amount of processor resources when evaluating BY clauses. Dynamic cluster BY clause optimization uses the server to join individually created server member tables so that the tables appear to be a single table. But the individual member tables are also kept intact. Dynamic cluster BY clause optimization uses the SORT attribute of the member tables to bypass most of the sorting that is required to perform the implicit BY clause ordering. The server uses the SORT attribute of each member table to merge the member tables in the dynamic cluster in order by each member table's order. No additional server workspace is required, and the ordered table rows are returned quickly because the member tables do not need to be sorted.

To use dynamic cluster BY clause optimization, you need to build the dynamic cluster table a specific way. All of the member tables in your dynamic cluster table need to be sorted by the same columns that you use in the BY clause. When you build your dynamic cluster table from member tables that are presorted by your BY clause columns, your dynamic cluster table can use the BY clause optimization.

You run a BY clause that matches the SORT attribute column order of the member tables of the dynamic cluster table. The server processes the BY clause without using sort workspace and does not experience first-row latency. The server uses the presorted member tables to perform an instantaneous interleave. Because dynamic cluster BY clause optimization uses the presorted member tables, you can perform operations on huge tables that would be impossible to handle otherwise.

For example, suppose your system has sufficient CPU, memory, and workspace resources to sort a 50 GB table in a reasonable amount of time. However, suppose this system accumulates 50 GB of new data every month. After 12 months, the table requires 600 GB of storage. The system cannot sort 600 GB of data to process queries that are based on the previous 12-month period. To use dynamic cluster BY clause optimization in this situation:

Create a dynamic cluster table from the twelve 50 GB member tables. You have a 600 GB dynamic cluster table.
Store data for each successive month in a server member table.
Sort each table and add it to the 600 GB dynamic cluster table.
Use dynamic cluster BY clause optimization to run SAS steps that use BY clauses on the 600 GB dynamic cluster table.

For example, you can run a DATA step MERGE statement that uses the dynamic cluster table as the master source for the MERGE statement. The BY clause from the MERGE statement triggers the dynamic cluster BY clause optimization. The operation completes in the time that it takes to interleave the individual member tables. The process uses no server workspace and does not cause any implicit BY sort delays.

Dynamic cluster BY clause optimization is triggered when all member tables have an applicable SORT attribute ordering for the BY clause that is asserted. When the SORT attribute ordering is strong (validated), the server does not verify the order of BY columns that are returned from the member table. When the SORT attribute ordering is weak (such as from a SORTEDBY assertion in a data set option), the server verifies the order of BY columns that are returned from the member table. If the server detects an invalid BY column order, it terminates the BY clause and displays the following error message:

ERROR: Clustered BY member violates weaksort order during merge.

Combining WHERE Clauses with Dynamic Cluster BY Clause Optimization

You can use dynamic cluster BY clause optimization to combine BY clause optimization with certain WHERE clauses on dynamic cluster tables. The server must be able to determine whether the WHERE clause is trivially true or trivially false for each member table in the dynamic cluster table. To be trivially true, a WHERE clause must find the clause condition to be true for every row in the member table. To be trivially false, a WHERE clause must find the clause condition to be false for every row in the member table.

The server keeps metadata about indexed values that are in dynamic cluster table member tables. The server can determine whether the WHERE clause criteria are true or false, based on the dynamic cluster table's member table metadata. WHERE clause optimization is possible on a member-by-member basis for the entire dynamic cluster table. Suppose that member tables of a dynamic cluster table all have an index on the column QUARTER (1=JAN-MAR, 2=APR-JUN, 3=JUL-SEP, and 4=OCT-DEC). Suppose you need to run a DATA step MERGE statement that uses the expression WHERE QUARTER=2. Because the QUARTER column is indexed in all of the member tables, the server uses BY clause optimization to determine that the WHERE clause is trivially true. The server evaluates the expression only on the member tables for April, May, and June, and does not use any server workspace. The WHERE clause is determined to be trivially true or trivially false for each member table of the dynamic cluster table in advance. Then BY clause optimization performs BY processing only on the appropriate member tables.

Dynamic Cluster BY Clause Optimization Example

Consider a database of medical patient insurance claims that contains quarterly claims tables that are named ClaimsQ1 and ClaimsQ2. The following code does these tasks:

Sorts each quarterly claims table into columns that are named PatID (for patient ID) and ClaimID (for claim ID).
Combines the member tables into a dynamic cluster table that is named ClaimsAll.

DATA SPDS.ClaimsQ1;
...
run;

DATA SPDS.ClaimsQ2;
...
run;

PROC SORT DATA=SPDS.ClaimsQ1;
 BY PatID ClaimID;
run;

PROC SORT DATA=SPDS.ClaimsQ2;
 BY PatID ClaimID;
run;

PROC SPDO LIB=SPDS;
cluster create ClaimsAll;
quit;

The following DATA step MERGE statement is submitted to the ClaimsAll dynamic cluster table:

DATA SPDS.ToAdd SPDS.ToUpdate;
MERGE SPDS.NewOnes(IN=NEW1)
      SPDS.ClaimsAll(IN=OLD1);
BY PatID ClaimID;

SELECT;
WHEN(NEW1 and OLD1)
 DO;
 OUTPUT SPDS.ToUpdate;
 end;
WHEN(NEW1 and not OLD1)
 DO;
 OUTPUT SPDS.ToAdd;
 end;
run;

If ClaimsAll was not a dynamic cluster table, the DATA step MERGE statement would create an implicit sort from the BY clause on the respective server tables. However, ClaimsAll is a dynamic cluster table with member tables that are presorted. As a result, dynamic cluster BY clause optimization uses BY clause processing to merge the sorted member tables instantaneously without using any server workspace or creating any delays. The example merges the transaction data named NewOnes into new rows that are appended to the data for the next quarter.

The member tables ClaimsQ1 and ClaimsQ2 are indexed on the column Claim_Date:

DATA SPDS.RepClaims;
 SET SPDS.ClaimsAll;
 WHERE Claim_Date BETWEEN '01JAN2007' and '31MAR2007';
 BY PatID ClaimID;
run;

The WHERE clause determines whether each member table is true or false for each quarter. The WHERE clause is trivially true for the table ClaimsQ1 because the WHERE clause is true for all dates in the first quarter. The WHERE clause is trivially false for the table ClaimsQ2 because the WHERE clause is false for all dates in the second quarter. BY clause optimization determines that the member table ClaimsQ1 will be processed because the WHERE clause is true for all of the rows of the ClaimsQ1 table. BY clause optimization skips the table ClaimsQ2 because the WHERE clause is false for all of the rows of the ClaimsQ2 table.

Suppose that the Claim_Date range is changed in the WHERE clause:

DATA SPDS.RepClaims;
 SET SPDS.ClaimsAll;
 WHERE Claim_Date BETWEEN '05JAN2007' and '28JUN2007';
 BY PatID ClaimID;
run;

When the new WHERE clause is evaluated, it is not trivially true for member tables ClaimsQ1 or ClaimsQ2. The WHERE clause is not trivially false for member tables ClaimsQ1 or ClaimsQ2, either. The WHERE clause calls dates that exist in portions of the member table ClaimsQ1, and it calls dates that exist in portions of the member table ClaimsQ2. The dates in the WHERE clause do not match all of the dates that exist in the member table ClaimsQ1. They do not match all of the dates that exist in the member table ClaimsQ2. The dates in the WHERE clause are not totally exclusive of the dates that exist in the member tables ClaimsQ1 or ClaimsQ2. As a result, the server does not use BY clause optimization when it runs the code.

Last updated: February 8, 2017