When you use SPD Server
dynamic clusters, you can create huge data sets. If a SAS job needs
to manipulate a huge data set, you can sort the data sets for more
efficient processing. Traditional processing of huge data sets can
overuse or overwhelm available resources. Insufficient run-time or
processor resources can prohibit you from running full table scans
and manipulating table rows, which are required to sort huge data
sets for subsequent processing.
SPD Server provides
dynamic cluster BY clause optimization to reduce the need for a large
amount of processor resources when evaluating BY clauses. Dynamic
cluster BY clause optimization uses SPD Server to join individually
created SPD Server member data sets so that the data sets appear to
be a single data set, but the individual member data sets are also
kept intact. Dynamic cluster BY clause optimization uses the SORTEDBY
metadata of the member data sets to bypass most of the sorting that
is required to perform the implicit BY clause ordering. SPD Server
uses the SORTEDBY metadata of each member data set to merge the member
data sets in the dynamic cluster in order by each member data set's
order. No additional SPD Server workspace is required, and the ordered
data set records are returned quickly because the member data sets
do not need to be sorted.
To use dynamic cluster
BY clause optimization, you need to build the dynamic cluster table
a specific way. All of the member tables in your dynamic cluster table
need to be sorted by the same columns that you use in the BY clause.
When you build your dynamic cluster table from member tables that
are presorted by your BY clause columns, your dynamic cluster table
can use the BY clause optimization.
When you run a BY clause
that matches the SORTEDBY column order of the member tables of the
dynamic cluster table, SPD Server processes the BY clause without
using sort workspace and does not experience first-record latency.
SPD Server uses the presorted member tables to perform an instantaneous
interleave. Because dynamic cluster BY clause optimization uses the
presorted member tables, you can perform operations on huge data sets
that would be impossible to handle otherwise.
For example, suppose
your system has sufficient CPU, memory and workspace resources to
sort a 50 GB data set in a reasonable amount of time. However, suppose
this system accumulates 50 GB of new data every month. After 12 months,
the data set requires 600 GB of storage. The system cannot sort 600
GB of data to process queries that are based on the previous 12-month
period. To use dynamic cluster BY clause optimization in this situation:
-
Create a dynamic cluster
table from the twelve 50 GB member tables. You have a 600 GB dynamic
cluster table.
-
Store data for each
successive month in an SPD Server member table.
-
Sort each table and
add it to the 600 GB dynamic cluster table.
-
Use dynamic cluster
BY clause optimization to run SAS steps that use BY clauses on the
600 GB dynamic cluster table.
For example, you can
run a DATA step MERGE statement that uses the dynamic cluster table
as the master source for the MERGE statement. The BY clause from the
MERGE statement triggers the dynamic cluster BY clause optimization.
The operation completes in the time that it takes to interleave the
individual member tables. The process uses no SPD Server workspace
and does not cause any implicit BY sort delays.
Dynamic cluster BY clause
optimization is triggered when all member tables have an applicable
SORTEDBY ordering for the BY clause that is asserted. When the SORTEDBY
ordering is strong (validated), SPD Server does not verify the order
of BY variables that are returned from the member table. When the
SORTEDBY ordering is weak (such as from a SORTEDBY assertion that
was a data set option), SPD Server verifies the order of BY variables
that are returned from the member table. If SPD Server detects an
invalid BY variable order, it terminates the BY clause and displays
the following error message:
ERROR: Clustered BY member violates weaksort order during merge.