Consider a database
of medical patient insurance claims that contains quarterly claims
data sets that are named ClaimsQ1, ClaimsQ2, ClaimsQ3, and ClaimsQ4.
The following code does these tasks:
-
Sorts each quarterly
claims table into columns that are named PatID (for patient ID) and
ClaimID (for claim ID).
-
Combines the member
tables into a dynamic cluster table that is named ClaimsAll.
DATA SPDS.ClaimsQ1;
...
run;
DATA SPDS.ClaimsQ2;
...
run;
PROC SORT DATA=SPDS.ClaimsQ1;
BY PatID ClaimID;
run;
PROC SORT DATA=SPDS.ClaimsQ2;
BY PatID ClaimID;
run;
PROC SPDO LIB=SPDS;
create cluster ClaimsAll;
quit;
The following DATA step
MERGE statement is submitted to the ClaimsAll dynamic cluster table:
DATA SPDS.ToAdd SPDS.ToUpdate;
MERGE SPDS.NewOnes(IN=NEW1)
SPDS.ClaimsAll(IN=OLD1);
BY PatID ClaimID;
SELECT;
WHEN(NEW1 and OLD1)
DO;
OUTPUT SPDS.ToUpdate;
end;
WHEN(NEW1 and not OLD1)
DO;
OUTPUT SPDS.ToAdd;
end;
run;
If ClaimsAll was not
a dynamic cluster table, the DATA step MERGE statement would create
an implicit sort from the BY clause on the respective SPD Server data
sets. However, ClaimsAll is a dynamic cluster table with member tables
that are presorted. As a result, dynamic cluster BY clause optimization
uses BY clause processing to merge the sorted member tables instantaneously
without using any SPD Server workspace or creating any delays. The
example merges the transaction data named NewOnes into new rows that
are appended to the data for the next quarter.
The member data sets
ClaimsQ1 and ClaimsQ2 are indexed on the column Claim_Date:
DATA SPDS.RepClaims;
SET SPDS.ClaimsAll;
WHERE Claim_Date BETWEEN '01JAN2007' and '31MAR2007';
BY PatID ClaimID;
run;
The WHERE clause determines
whether each member table is true or false for each quarter. The WHERE
clause is trivially true for the data set ClaimsQ1 because the WHERE
clause is true for all dates in the first quarter. The WHERE clause
is trivially false for the data set ClaimsQ2 because the WHERE clause
is false for all dates in the second quarter. BY clause optimization
determines that the member table ClaimsQ1 will be processed because
the WHERE clause is true for all of the rows of the ClaimsQ1 table.
BY clause optimization skips the data set ClaimsQ2 because the WHERE
clause is false for all of the rows of the ClaimsQ2 table.
Suppose that the Claim_Date
range is changed in the WHERE clause:
DATA SPDS.RepClaims;
SET SPDS.ClaimsAll;
WHERE Claim_Date BETWEEN '05JAN2007' and '28JUN2007';
BY PatID ClaimID;
run;
When the new WHERE clause
is evaluated, it is not trivially true for member tables ClaimsQ1
or ClaimsQ2. The WHERE clause is not trivially false for member tables
ClaimsQ1 or ClaimsQ2, either. The WHERE clause calls dates that exist
in portions of the member table ClaimsQ1, and it calls dates that
exist in portions of the member table ClaimsQ2. The dates in the WHERE
clause do not match all of the dates that exist in the member table
ClaimsQ1, and they do not match all of the dates that exist in the
member table ClaimsQ2. The dates in the WHERE clause are not totally
exclusive of the dates that exist in the member tables ClaimsQ1 or
ClaimsQ2. As a result, SPD Server does not use BY clause optimization
when it runs the code.