Dynamic Cluster BY Clause Optimization

Overview of Optimizing BY Clauses

When you use SPD Server dynamic clusters, you can create huge data sets. If a huge data set needs to be manipulated by a SAS job, it might be better to sort the data sets for more efficient processing. Traditional processing of huge data sets can overuse or overwhelm available resources. A lack of available run-time or processor resources can prohibit you from running full table scans and manipulating table rows, which are required to sort huge data sets for subsequent processing.

SPD Server provides dynamic cluster BY clause optimization to reduce the need for a large amount of processor resources when evaluating BY clauses. The dynamic cluster BY clause optimization uses SPD Server to join individually created SPD Server member data sets so that the data sets appear to be a single data set, while keeping the individual member data sets intact. The dynamic cluster BY clause optimization uses the SORTEDBY metadata of the member data sets to bypass most of the sorting that is required to perform the implicit BY clause ordering. With the SORTEDBY metadata of each member data set, SPD Server merges the member data sets in the dynamic cluster by using each member data set's order. No additional SPD Server workspace is required, and the ordered data set records are returned with minimum delay because member data set sorting is eliminated.

To use dynamic cluster BY clause optimization, you need to build the dynamic cluster table a specific way. All of the member tables in your dynamic cluster table need to be sorted by the same columns that you need to use in the BY clause. When you build your dynamic cluster table from member tables that are presorted by your BY clause columns, your dynamic cluster table can use the BY clause optimization.

When a BY clause is run that matches the SORTEDBY column order of the member tables of the dynamic cluster table, SPD Server performs the BY clause without using sort workspace and does not experience first-record latency. SPD Server uses the presorted member tables to perform an instantaneous interleave. By using the presorted member tables, the dynamic cluster BY clause optimization enables you to perform operations on huge data sets that would be impossible to handle otherwise.

For example, suppose you have a system that has sufficient CPU, memory, and workspace resources to sort a 50-GB data set in a reasonable amount of time. However, suppose this system accumulates 50 GB of new data every month. Therefore, after 12 months, the data set requires 600 GB of storage. The system cannot handle sorting 600 GB of data to process queries that are based on the previous 12-month period. If you use SPD Server to create a dynamic cluster table from the 12 50-GB member tables, you can store each rolling month of data in an SPD Server member table. Then, you sort it like the other dynamic cluster table member tables, and add the new member table to the 600-GB dynamic cluster table. Now, you can use dynamic cluster BY clause optimization to run SAS steps that use BY clauses on the 600-GB dynamic cluster table. For example, you can run a DATA step MERGE statement that uses the dynamic cluster table as the master source for the MERGE statement. The BY clause from the MERGE statement triggers the dynamic cluster BY clause optimization. As a result, the operation completes in the time that it takes to interleave the individual member tables. And, it uses no SPD Server workspace and does not experience any implicit BY sort delays.

Dynamic cluster BY clause optimization allows BY clause optimization to be combined with certain WHERE clauses on dynamic cluster tables. For WHERE clause optimization to work, SPD Server must be able to determine whether the WHERE clause is trivially true or trivially false for each member table in the dynamic cluster table. To be trivially true, a WHERE clause must find the clause condition true for every row in the member table. To be trivially false, a WHERE clause must find the clause condition false for every row in the member table.

SPD Server keeps metadata about indexed values in dynamic cluster table member tables. If the WHERE clause criteria can be determined as true or false, based on the dynamic cluster table's member table metadata, WHERE clause optimization is possible on a member-by-member basis for the entire dynamic cluster table. Suppose that member tables of a dynamic cluster table all have an index on the column QUARTER (1=JAN-MAR, 2=APR-JUN, 3=JUL-SEP, 4=OCT-DEC). Suppose you need to run a DATA step MERGE statement that uses the expression WHERE QUARTER=2. Because the QUARTER column is indexed in all of the member tables, SPD Server uses BY clause optimization to determine that the WHERE clause is trivially true. SPD Server evaluates the expression only on the member tables for April, May, and June, and does not use any SPD Server workspace. When the WHERE clause can be determined as trivially true or trivially false for each member table of the dynamic cluster table in advance, BY clause optimization performs BY processing only on the appropriate member tables.

The dynamic cluster BY clause optimization is triggered when member tables all have an applicable SORTEDBY ordering for the BY clause that is asserted. When the SORTEDBY ordering is strong (validated), SPD Server does not perform checks to verify the order of BY variables that are returned from the member table. When the SORTEDBY ordering is weak (such as from a SORTEDBY assertion that was a data set option), additional checking is performed to verify the order of BY variables that are returned from the member table. If an invalid BY variable order is detected, SPD Server terminates the BY clause and displays the following error message:

ERROR: Clustered BY member violates weaksort order during merge.

Dynamic Cluster BY Clause Optimization Example

Consider a database of medical patient insurance claims, with quarterly claims data sets that are named ClaimsQ1, ClaimsQ2, ClaimsQ3, and ClaimsQ4. Each quarterly claims table is sorted by columns that are named PatID (for patient ID) and ClaimID (for claim ID). The member tables are combined into a dynamic cluster table that is named ClaimsAll. The following example shows the code for this example:

DATA SPDS.ClaimsQ1;
...
run;

DATA SPDS.ClaimsQ2;
...
run;

PROC SORT DATA=SPDS.ClaimsQ1;
 BY PatID ClaimID;
run;

PROC SORT DATA=SPDS.ClaimsQ2;
 BY PatID ClaimID;
run;

PROC SPDO LIB=SPDS;
create cluster ClaimsAll;
quit;

Consider the DATA step MERGE statement to be submitted to the ClaimsAll dynamic cluster table:

DATA SPDS.ToAdd SPDS.ToUpdate;
MERGE SPDS.NewOnes(IN=NEW1)
      SPDS.ClaimsAll(IN=OLD1);
BY PatID ClaimID;

SELECT;
WHEN(NEW1 and OLD1)
 DO;
 OUTPUT SPDS.ToUpdate;
 end;
WHEN(NEW1 and not OLD1)
 DO;
 OUTPUT SPDS.ToAdd;
 end;
run;

If ClaimsAll was not a dynamic cluster table, the DATA step MERGE statement would create an implicit sort from the BY clause on the respective SPD Server data sets. However, ClaimsAll is a dynamic cluster table with member tables that are presorted. As a result, the dynamic cluster BY clause optimization uses BY clause processing to merge the sorted member tables instantaneously without using any SPD Server workspace or does not experience any delays. The example merges the transaction data named NewOnes into new rows that are appended to the data for the next quarter.

Consider that the member data sets ClaimsQ1 and ClaimsQ2 are indexed on the column Claim_Date:

DATA SPDS.RepClaims;
 SET SPDS.ClaimsAll;
 WHERE Claim_Date BETWEEN '01JAN2007' and '31MAR2007';
 BY PatID ClaimID;
run;

The WHERE clause determines whether each member table is true or false for each quarter. The WHERE clause is trivially true for the data set ClaimsQ1 because the WHERE clause is true for all dates in the first quarter. The WHERE clause is trivially false for the data set ClaimsQ2 because the WHERE clause is false for all dates in the second quarter. BY clause optimization determines that the member table ClaimsQ1 will be processed because the WHERE clause is true for all of the rows of the ClaimsQ1 table. BY clause optimization skips the data set ClaimsQ2 because the WHERE clause is false for all of the rows of the ClaimsQ2 table.

Suppose that the Claim_Date range is changed in the WHERE clause:

DATA SPDS.RepClaims;
 SET SPDS.ClaimsAll;
 WHERE Claim_Date BETWEEN '05JAN2007' and '28JUN2007';
 BY PatID ClaimID;
run;

When the new WHERE clause is evaluated, it is not trivially true for member tables ClaimsQ1 or ClaimsQ2. The WHERE clause is not trivially false for member tables ClaimsQ1 or ClaimsQ2, either. The WHERE clause calls dates that exist in portions of the member table ClaimsQ1, and it calls dates that exist in portions of the member table ClaimsQ2. The dates in the WHERE clause do not match all of the dates that exist in the member table ClaimsQ1, and they do not match all of the dates that exist in the member table ClaimsQ2. The dates in the WHERE clause are not totally exclusive of the dates that exist in the member tables ClaimsQ1 or ClaimsQ2. As a result, BY clause optimization is not used when SPD Server runs the code.