When you use SPD Server
dynamic clusters, you can create huge data sets. If a huge data set
needs to be manipulated by a SAS job, it might be better to sort the
data sets for more efficient processing. Traditional processing of
huge data sets can overuse or overwhelm available resources. A lack
of available run-time or processor resources can prohibit you from
running full table scans and manipulating table rows, which are required
to sort huge data sets for subsequent processing.
SPD Server provides
dynamic cluster BY clause optimization to reduce the need for a large
amount of processor resources when evaluating BY clauses. The dynamic
cluster BY clause optimization uses SPD Server to join individually
created SPD Server member data sets so that the data sets appear to
be a single data set, while keeping the individual member data sets
intact. The dynamic cluster BY clause optimization uses the SORTEDBY
metadata of the member data sets to bypass most of the sorting that
is required to perform the implicit BY clause ordering. With the SORTEDBY
metadata of each member data set, SPD Server merges the member data
sets in the dynamic cluster by using each member data set's order.
No additional SPD Server workspace is required, and the ordered data
set records are returned with minimum delay because member data set
sorting is eliminated.
To use dynamic cluster
BY clause optimization, you need to build the dynamic cluster table
a specific way. All of the member tables in your dynamic cluster table
need to be sorted by the same columns that you need to use in the
BY clause. When you build your dynamic cluster table from member tables
that are presorted by your BY clause columns, your dynamic cluster
table can use the BY clause optimization.
When a BY clause is
run that matches the SORTEDBY column order of the member tables of
the dynamic cluster table, SPD Server performs the BY clause without
using sort workspace and does not experience first-record latency.
SPD Server uses the presorted member tables to perform an instantaneous
interleave. By using the presorted member tables, the dynamic cluster
BY clause optimization enables you to perform operations on huge data
sets that would be impossible to handle otherwise.
For example, suppose
you have a system that has sufficient CPU, memory, and workspace resources
to sort a 50-GB data set in a reasonable amount of time. However,
suppose this system accumulates 50 GB of new data every month. Therefore,
after 12 months, the data set requires 600 GB of storage. The system
cannot handle sorting 600 GB of data to process queries that are based
on the previous 12-month period. If you use SPD Server to create a
dynamic cluster table from the 12 50-GB member tables, you can store
each rolling month of data in an SPD Server member table. Then, you
sort it like the other dynamic cluster table member tables, and add
the new member table to the 600-GB dynamic cluster table. Now, you
can use dynamic cluster BY clause optimization to run SAS steps that
use BY clauses on the 600-GB dynamic cluster table. For example, you
can run a DATA step MERGE statement that uses the dynamic cluster
table as the master source for the MERGE statement. The BY clause
from the MERGE statement triggers the dynamic cluster BY clause optimization.
As a result, the operation completes in the time that it takes to
interleave the individual member tables. And, it uses no SPD Server
workspace and does not experience any implicit BY sort delays.
Dynamic cluster BY clause
optimization allows BY clause optimization to be combined with certain
WHERE clauses on dynamic cluster tables. For WHERE clause optimization
to work, SPD Server must be able to determine whether the WHERE clause
is trivially true or trivially false for each member table in the
dynamic cluster table. To be trivially true, a WHERE clause must find
the clause condition true for every row in the member table. To be
trivially false, a WHERE clause must find the clause condition false
for every row in the member table.
SPD Server keeps metadata
about indexed values in dynamic cluster table member tables. If the
WHERE clause criteria can be determined as true or false, based on
the dynamic cluster table's member table metadata, WHERE clause optimization
is possible on a member-by-member basis for the entire dynamic cluster
table. Suppose that member tables of a dynamic cluster table all have
an index on the column QUARTER (1=JAN-MAR, 2=APR-JUN, 3=JUL-SEP, 4=OCT-DEC).
Suppose you need to run a DATA step MERGE statement that uses the
expression WHERE QUARTER=2. Because the QUARTER column is indexed
in all of the member tables, SPD Server uses BY clause optimization
to determine that the WHERE clause is trivially true. SPD Server evaluates
the expression only on the member tables for April, May, and June,
and does not use any SPD Server workspace. When the WHERE clause can
be determined as trivially true or trivially false for each member
table of the dynamic cluster table in advance, BY clause optimization
performs BY processing only on the appropriate member tables.
The dynamic cluster
BY clause optimization is triggered when member tables all have an
applicable SORTEDBY ordering for the BY clause that is asserted. When
the SORTEDBY ordering is strong (validated), SPD Server does not perform
checks to verify the order of BY variables that are returned from
the member table. When the SORTEDBY ordering is weak (such as from
a SORTEDBY assertion that was a data set option), additional checking
is performed to verify the order of BY variables that are returned
from the member table. If an invalid BY variable order is detected,
SPD Server terminates the BY clause and displays the following error
message:
ERROR: Clustered BY member violates weaksort order during merge.