Dynamic Cluster BY Clause Optimization

Overview of Optimizing BY Clauses

When you use SPD Server dynamic clusters, you can create huge data sets. If a SAS job needs to manipulate a huge data set, you can sort the data sets for more efficient processing. Traditional processing of huge data sets can overuse or overwhelm available resources. Insufficient run-time or processor resources can prohibit you from running full table scans and manipulating table rows, which are required to sort huge data sets for subsequent processing.

SPD Server provides dynamic cluster BY clause optimization to reduce the need for a large amount of processor resources when evaluating BY clauses. Dynamic cluster BY clause optimization uses SPD Server to join individually created SPD Server member data sets so that the data sets appear to be a single data set, but the individual member data sets are also kept intact. Dynamic cluster BY clause optimization uses the SORTEDBY metadata of the member data sets to bypass most of the sorting that is required to perform the implicit BY clause ordering. SPD Server uses the SORTEDBY metadata of each member data set to merge the member data sets in the dynamic cluster in order by each member data set's order. No additional SPD Server workspace is required, and the ordered data set records are returned quickly because the member data sets do not need to be sorted.

To use dynamic cluster BY clause optimization, you need to build the dynamic cluster table a specific way. All of the member tables in your dynamic cluster table need to be sorted by the same columns that you use in the BY clause. When you build your dynamic cluster table from member tables that are presorted by your BY clause columns, your dynamic cluster table can use the BY clause optimization.

When you run a BY clause that matches the SORTEDBY column order of the member tables of the dynamic cluster table, SPD Server processes the BY clause without using sort workspace and does not experience first-record latency. SPD Server uses the presorted member tables to perform an instantaneous interleave. Because dynamic cluster BY clause optimization uses the presorted member tables, you can perform operations on huge data sets that would be impossible to handle otherwise.

For example, suppose your system has sufficient CPU, memory and workspace resources to sort a 50 GB data set in a reasonable amount of time. However, suppose this system accumulates 50 GB of new data every month. After 12 months, the data set requires 600 GB of storage. The system cannot sort 600 GB of data to process queries that are based on the previous 12-month period. To use dynamic cluster BY clause optimization in this situation:

Create a dynamic cluster table from the twelve 50 GB member tables. You have a 600 GB dynamic cluster table.
Store data for each successive month in an SPD Server member table.
Sort each table and add it to the 600 GB dynamic cluster table.
Use dynamic cluster BY clause optimization to run SAS steps that use BY clauses on the 600 GB dynamic cluster table.

For example, you can run a DATA step MERGE statement that uses the dynamic cluster table as the master source for the MERGE statement. The BY clause from the MERGE statement triggers the dynamic cluster BY clause optimization. The operation completes in the time that it takes to interleave the individual member tables. The process uses no SPD Server workspace and does not cause any implicit BY sort delays.

Dynamic cluster BY clause optimization is triggered when all member tables have an applicable SORTEDBY ordering for the BY clause that is asserted. When the SORTEDBY ordering is strong (validated), SPD Server does not verify the order of BY variables that are returned from the member table. When the SORTEDBY ordering is weak (such as from a SORTEDBY assertion that was a data set option), SPD Server verifies the order of BY variables that are returned from the member table. If SPD Server detects an invalid BY variable order, it terminates the BY clause and displays the following error message:

ERROR: Clustered BY member violates weaksort order during merge.

Combining WHERE Clauses with Dynamic Cluster BY Clause Optimization

You can use dynamic cluster BY clause optimization to combine BY clause optimization with certain WHERE clauses on dynamic cluster tables. SPD Server must be able to determine whether the WHERE clause is trivially true or trivially false for each member table in the dynamic cluster table. To be trivially true, a WHERE clause must find the clause condition to be true for every row in the member table. To be trivially false, a WHERE clause must find the clause condition to be false for every row in the member table.

SPD Server keeps metadata about indexed values that are in dynamic cluster table member tables. If SPD Server can determine whether the WHERE clause criteria is true or false, based on the dynamic cluster table's member table metadata, WHERE clause optimization is possible on a member-by-member basis for the entire dynamic cluster table. Suppose that member tables of a dynamic cluster table all have an index on the column QUARTER (1=JAN-MAR, 2=APR-JUN, 3=JUL-SEP, 4=OCT-DEC). Suppose you need to run a DATA step MERGE statement that uses the expression WHERE QUARTER=2. Because the QUARTER column is indexed in all of the member tables, SPD Server uses BY clause optimization to determine that the WHERE clause is trivially true. SPD Server evaluates the expression only on the member tables for April, May, and June, and does not use any SPD Server workspace. If the WHERE clause is determined to be trivially true or trivially false for each member table of the dynamic cluster table in advance, BY clause optimization performs BY processing only on the appropriate member tables.

Dynamic Cluster BY Clause Optimization Example

Consider a database of medical patient insurance claims that contains quarterly claims data sets that are named ClaimsQ1, ClaimsQ2, ClaimsQ3, and ClaimsQ4. The following code does these tasks:

Sorts each quarterly claims table into columns that are named PatID (for patient ID) and ClaimID (for claim ID).
Combines the member tables into a dynamic cluster table that is named ClaimsAll.

DATA SPDS.ClaimsQ1;
...
run;

DATA SPDS.ClaimsQ2;
...
run;

PROC SORT DATA=SPDS.ClaimsQ1;
 BY PatID ClaimID;
run;

PROC SORT DATA=SPDS.ClaimsQ2;
 BY PatID ClaimID;
run;

PROC SPDO LIB=SPDS;
create cluster ClaimsAll;
quit;

The following DATA step MERGE statement is submitted to the ClaimsAll dynamic cluster table:

DATA SPDS.ToAdd SPDS.ToUpdate;
MERGE SPDS.NewOnes(IN=NEW1)
      SPDS.ClaimsAll(IN=OLD1);
BY PatID ClaimID;

SELECT;
WHEN(NEW1 and OLD1)
 DO;
 OUTPUT SPDS.ToUpdate;
 end;
WHEN(NEW1 and not OLD1)
 DO;
 OUTPUT SPDS.ToAdd;
 end;
run;

If ClaimsAll was not a dynamic cluster table, the DATA step MERGE statement would create an implicit sort from the BY clause on the respective SPD Server data sets. However, ClaimsAll is a dynamic cluster table with member tables that are presorted. As a result, dynamic cluster BY clause optimization uses BY clause processing to merge the sorted member tables instantaneously without using any SPD Server workspace or creating any delays. The example merges the transaction data named NewOnes into new rows that are appended to the data for the next quarter.

The member data sets ClaimsQ1 and ClaimsQ2 are indexed on the column Claim_Date:

DATA SPDS.RepClaims;
 SET SPDS.ClaimsAll;
 WHERE Claim_Date BETWEEN '01JAN2007' and '31MAR2007';
 BY PatID ClaimID;
run;

The WHERE clause determines whether each member table is true or false for each quarter. The WHERE clause is trivially true for the data set ClaimsQ1 because the WHERE clause is true for all dates in the first quarter. The WHERE clause is trivially false for the data set ClaimsQ2 because the WHERE clause is false for all dates in the second quarter. BY clause optimization determines that the member table ClaimsQ1 will be processed because the WHERE clause is true for all of the rows of the ClaimsQ1 table. BY clause optimization skips the data set ClaimsQ2 because the WHERE clause is false for all of the rows of the ClaimsQ2 table.

Suppose that the Claim_Date range is changed in the WHERE clause:

DATA SPDS.RepClaims;
 SET SPDS.ClaimsAll;
 WHERE Claim_Date BETWEEN '05JAN2007' and '28JUN2007';
 BY PatID ClaimID;
run;

When the new WHERE clause is evaluated, it is not trivially true for member tables ClaimsQ1 or ClaimsQ2. The WHERE clause is not trivially false for member tables ClaimsQ1 or ClaimsQ2, either. The WHERE clause calls dates that exist in portions of the member table ClaimsQ1, and it calls dates that exist in portions of the member table ClaimsQ2. The dates in the WHERE clause do not match all of the dates that exist in the member table ClaimsQ1, and they do not match all of the dates that exist in the member table ClaimsQ2. The dates in the WHERE clause are not totally exclusive of the dates that exist in the member tables ClaimsQ1 or ClaimsQ2. As a result, SPD Server does not use BY clause optimization when it runs the code.