Phase I uses the SQL
join keys from the subsetted Phase I dimension tables to get a smaller
set of candidate rows to query in the central fact table. After optimizing
the candidate rows in the fact table, the Phase I index probe examines
index structures to determine the best STARJOIN strategy to use. There
are two SPD Server STARJOIN strategies: the IN-SET strategy and the
COMPOSITE strategy. In all but a few cases, the IN-SET strategy is
the most robust and efficient processing strategy. The user can determine
which strategy SPD Server chooses by providing the required table
index types in the submitted SQL.
Phase I creates the
smaller set of candidate rows in the central fact table by eliminating
fact table rows that do not match the SQL join keys from the subsetted
Phase I dimension tables. For example, if the SQL query requests information
about transactions that occurred only in North Carolina store locations,
the candidate rows that are retained in the fact table uses the SQL
that subsets the Location dimension table:
WHERE location.STATE = "NC";
If the Sales fact table
contains sales records for all 50 states, Phase I uses the SQL that
subsets the Location dimension table to eliminate the sales records
of all stores in states other than North Carolina from the fact table
candidate rows. The example is simple, but powerful -- reducing the
fact table candidate row set to transactions from only North Carolina
stores eliminates massive amounts of nonproductive data processing.
The Phase I index probe
inventories the number and types of indexes on the fact table and
dimension tables as it attempts to identify the best STARJOIN strategy.
To use the STARJOIN IN-SET strategy, Phase I must find simple indexes
on all SQL join columns in the fact table and dimension tables. Otherwise,
to use the STARJOIN COMPOSITE strategy, Phase I searches for the best
composite index that is available on the fact table. The best composite
index for the fact table is the composite index that spans the largest
set of join predicates from the aggregated Phase I dimension tables.
Based on the fact table
and dimension table index probe, SPD Server selects the STARJOIN strategy
using the following logic:
-
If one or more simple indexes are
found on fact table and dimension table SQL join columns, and no spanning
composite indexes are found on the fact table, SPD Server selects
the STARJOIN IN-SET strategy.
-
If an optimal spanning composite
index is found on the fact table, and no simple indexes are found
on fact table and dimension table SQL join columns, SPD Server selects
the STARJOIN COMPOSITE strategy.
-
If both simple and spanning composite
indexes are found, SPD Server generally selects the STARJOIN IN-SET
strategy, unless the composite index is an exact match for all of
the Phase I join predicates, and only lesser matches are available
with the IN-SET strategy.
-
If no suitable indexes are found
for either STARJOIN strategy, SPD Server does not use STARJOIN; it
joins the sub-tree using the standard SPD Server pairwise join.
The IN-SET and COMPOSITE
join strategies have some underlying differences.
The IN-SET join strategy
will cache temporary Phase 1 probes in memory, when possible, for
use by Phase 2. The caching can result in significant performance
improvements by using in-memory lookups into the dimension tables
for Phase 2 probes, rather than performing more costly file system
probes of the dimension tables, which can result in significant performance
improvements. The amount of memory allocated for Phase 1 IN-SET caching
is controlled by the STARSIZE= server parameter. You can use the STARJOIN
DETAILS option to see which partial results of the Phase 1 IN-SET
strategy are cached, and whether sufficient memory was allocated for
STARJOIN to cache all partial results.
The IN-SET join strategy
uses an IN-SET transformation of dimension table metadata to produce
a powerful compound WHERE clause to be used on the STARJOIN fact table.
The "IN" part of the term "IN-SET" refers to an IN specification in
the SQL WHERE clause. The IN-SET is the set of values that populate
the contents of the SQL IN query expression. For example, in the following
SQL WHERE clause, the cities
Raleigh,
Cary, and
Clayton are the values
of the IN-SET:
WHERE location.CITY in ("Raleigh", "Cary", "Clayton");
For the IN-SET strategy,
Phase I dimension tables are subsetted, and then the resulting set
of join keys form the SQL IN expression for the fact table's corresponding
join column. You must have simple indexes on all SQL join columns
in both the fact table and dimension tables before STARJOIN Phase
I can select the IN-SET strategy.
If the dimension table
Location has six rows for Raleigh, Cary, and Clayton, then six STORE_NUMBER
values are applied to the IN-SET WHERE clause that is used to select
the candidate rows from the central fact table. The STARJOIN IN-SET
facility transforms the dimension table's CITY values into STORE_NUMBER
values that can be used to select candidate rows from the Sales fact
table. The transformed WHERE clause to be applied to the fact table
might resemble the following code:
WHERE fact.STORE_NUMBER in
(100,101,102,103,104,105,106);
You can use IN-SET transformations
in a star schema that has any number of dimension tables and a fact
table. Consider the following example dimension table subsetting statement:
WHERE location.CITY in
("Raleigh","Cary","Clayton")
and Time.SALES_WEEK = 1;
Because the Sales fact
table has no matching CITY column to join with the Location dimension
table, and no matching SALES_WEEK column to join with the Time table,
the IN-SET strategy uses transformations to create a WHERE clause
that the Sales fact table can resolve:
WHERE fact.STORE_NUMBER in
(100,101,102,103,104,105,106)
and Time.SALES_DATE in
('01JAN2005'd,'02JAN2005'd,'03JAN2005'd,
'04JAN2005'd,'05JAN2005'd,'06JAN2005'd,
'07JAN2005'd,);
The advantage of the
STARJOIN facility is that it handles all of the transformations on
a fact table, from dimension table subsetting to IN-SET WHERE clauses.
The COMPOSITE join strategy
uses a composite index on the fact table to exhaustively probe the
full Cartesian product of the combined join keys that is produced
by the aggregated dimension table subsetting. SPD Server compares
the composite indexes on the fact table to the theoretical composite
index that is made from all of the join keys in the Phase I dimension
tables. Phase I selects the best composite index on the fact table,
based on the join requirements of the dimension tables.
A disadvantage of using
the COMPOSITE join strategy is that when more than a few join keys
exist, the Cartesian product map can become large geometric matrixes
that can interfere with processing performance. You must have a composite
index on the fact table that consists of Phase I dimension table join
columns before STARJOIN Phase I can select the COMPOSITE join strategy.
If any Phase I dimension
tables contain join predicates that do not have supporting simple
or composite indexes on the fact table, those Phase I dimension tables
are dropped from Phase I processing and are moved to the Phase II
group.