Phase I uses the SQL
join keys from the subsetted Phase I dimension tables to get a smaller
set of candidate rows to query in the central fact table. After the
Phase I index probe optimizes the candidate rows in the fact table,
the probe examines index structures to determine the best STARJOIN
strategy to use. There are two SPD Server STARJOIN strategies: the
IN-SET strategy and the COMPOSITE strategy. In all but a few cases,
the IN-SET strategy is the most robust and efficient processing strategy.
You can determine which strategy SPD Server chooses by providing the
required table index types in the SQL that you submit.
Phase I creates the
smaller set of candidate rows in the central fact table by eliminating
fact table rows that do not match the SQL join keys from the subsetted
Phase I dimension tables. For example, if the SQL query requests information
about transactions that occurred only in North Carolina store locations,
the candidate rows that are retained in the fact table uses the SQL
that subsets the Location dimension table:
WHERE location.STATE = "NC";
If the Sales fact table
contains sales records for all 50 states, Phase I uses the SQL that
subsets the Location dimension table to eliminate the sales records
of all stores in states other than North Carolina from the fact table
candidate rows. The fact table candidate rowset is reduced to transactions
from only North Carolina stores, which eliminates massive amounts
of nonproductive data processing.
The Phase I index probe
inventories the number and types of indexes on the fact table and
dimension tables as it attempts to identify the best STARJOIN strategy.
To use the STARJOIN IN-SET strategy, Phase I must find simple indexes
on all SQL join columns in the fact table and dimension tables. Otherwise,
to use the STARJOIN COMPOSITE strategy, Phase I searches for the best
composite index that is available on the fact table. The best composite
index for the fact table is the composite index that spans the largest
set of join predicates from the aggregated Phase I dimension tables.
Based on the fact table
and dimension table index probe, SPD Server selects the STARJOIN strategy
by using the following logic:
-
If the probe finds one or more
simple indexes on fact table and dimension table SQL join columns,
and does not find spanning composite indexes on the fact table, SPD
Server selects the STARJOIN IN-SET strategy.
-
If the probe finds an optimal spanning
composite index on the fact table, and does not find simple indexes
on fact table and dimension table SQL join columns, SPD Server selects
the STARJOIN COMPOSITE strategy.
-
If the probe finds both simple
and spanning composite indexes, SPD Server generally selects the STARJOIN
IN-SET strategy. If the composite index is an exact match for all
of the Phase I join predicates, and only lesser matches are available
with the IN-SET strategy, SPD Server selects the IN-SET strategy.
-
If the probe does not find suitable
indexes for either STARJOIN strategy, SPD Server does not use STARJOIN;
it joins the subtree using the standard SPD Server pairwise join.
The IN-SET and COMPOSITE
join strategies have some underlying differences.
The IN-SET join strategy
will cache temporary Phase 1 probes in memory, when possible, for
use by Phase 2. The caching can result in significant performance
improvements by using in-memory lookups into the dimension tables
for Phase 2 probes, rather than performing more costly file system
probes of the dimension tables, which can result in significant performance
improvements. The amount of memory allocated for Phase 1 IN-SET caching
is controlled by the STARSIZE= server parameter. You can use the STARJOIN
DETAILS option to see which partial results of the Phase 1 IN-SET
strategy are cached, and whether sufficient memory was allocated for
STARJOIN to cache all partial results.
The IN-SET join strategy
uses an IN-SET transformation of dimension table metadata to produce
a powerful compound WHERE clause to be used on the STARJOIN fact table.
In the term
IN-SET,
IN refers
to an IN specification in the SQL WHERE clause. The
IN-SET is
the set of values that populate the contents of the SQL IN query expression.
For example, in the following SQL WHERE clause, the cities
Raleigh
,
Cary
,
and
Clayton
are the values of the IN-SET:
WHERE location.CITY in ("Raleigh", "Cary", "Clayton");
For the IN-SET strategy,
Phase I dimension tables are subsetted. Then the resulting set of
join keys form the SQL IN expression for the fact table's corresponding
join column. You must have simple indexes on all SQL join columns
in both the fact table and dimension tables before STARJOIN Phase
I can select the IN-SET strategy.
If the dimension table
Location has six rows for Raleigh, Cary, and Clayton, then six STORE_NUMBER
values are applied to the IN-SET WHERE clause that is used to select
the candidate rows from the central fact table. The STARJOIN IN-SET
facility transforms the dimension table's CITY values into STORE_NUMBER
values that can be used to select candidate rows from the Sales fact
table. The transformed WHERE clause that is applied to the fact table
might resemble the following code:
WHERE fact.STORE_NUMBER in
(100,101,102,103,104,105,106);
You can use IN-SET transformations
in a star schema that has any number of dimension tables and a fact
table. Consider the following example subsetting statement for a dimension
table:
WHERE location.CITY in
("Raleigh","Cary","Clayton")
and Time.SALES_WEEK = 1;
The Sales fact table
has no matching CITY column to join with the Location dimension table,
and no matching SALES_WEEK column to join with the Time table. Therefore,
the IN-SET strategy uses transformations to create a WHERE clause
that the Sales fact table can resolve:
WHERE fact.STORE_NUMBER in
(100,101,102,103,104,105,106)
and Time.SALES_DATE in
('01JAN2005'd,'02JAN2005'd,'03JAN2005'd,
'04JAN2005'd,'05JAN2005'd,'06JAN2005'd,
'07JAN2005'd,);
The advantage of the
STARJOIN facility is that it handles all of the transformations on
a fact table, from dimension table subsetting to IN-SET WHERE clauses.
The COMPOSITE join strategy
uses a composite index on the fact table to exhaustively probe the
full Cartesian product of the combined join keys that is produced
by the subsetting of the aggregated dimension table. SPD Server compares
the composite indexes on the fact table to the theoretical composite
index that is made from all of the join keys in the Phase I dimension
tables. Phase I selects the best composite index on the fact table,
based on the join requirements of the dimension tables.
A disadvantage of using
the COMPOSITE join strategy is that when more than a few join keys
exist, the Cartesian product map can become large geometric matrices
that can interfere with processing performance. You must have a composite
index on the fact table that consists of Phase I dimension table join
columns before STARJOIN Phase I can select the COMPOSITE join strategy.
If any Phase I dimension
tables contain join predicates that do not have supporting simple
or composite indexes on the fact table, those Phase I dimension tables
are dropped from Phase I processing and are moved to the Phase II
group.