SPD Server STARJOIN Optimization

Overview of STARJOIN Optimization

The SPD Server STARJOIN optimization process searches for the most efficient SQL strategy to use for computations. The STARJOIN optimization process consists of three steps, regardless of the number of dimension tables that are joined to the fact table in the star schema.

Classify dimension tables that are called by SQL as Phase I tables or Phase II tables.
Phase I of the process probes fact table indexes and selects a STARJOIN strategy.
Phase II of the process performs index lookups and joins subsetted fact table rows with Phase II tables.

Enabling STARJOIN Optimization in SPD Server

SPD Server STARJOIN optimization is enabled by default. For information about statement options that enable or disable the STARJOIN facility in SPD Server, see STARJOIN RESET Statement Options.

Classify Dimension Tables That Are Called by SQL as Phase I Tables or Phase II Tables

After the STARJOIN Planner validates the join subtree, join optimization begins. Join optimization is the process that searches for the most efficient SQL strategy to use to join the tables in the star schema.

The first step in SPD Server’s join optimization is to examine the dimension tables that were called by SQL for structures that SPD Server can use to improve performance. Each dimension table is classified as a Phase I table or a Phase II table. The structure of a dimension table and whether the SQL that you submit filters or subsets the table's contents determine its classification. SPD Server uses different processes to handle Phase I and Phase II dimension tables.

Phase I tables can improve performance. A Phase I table is a dimension table that is either very small (nine rows or fewer), or a dimension table whose SQL queries contain one or more filtering criteria that are expressed with a WHERE clause. A Phase II table is any dimension table that does not meet Phase I criteria. Rows in Phase II tables that are referenced in the SQL query are not subsetted.

Consider the star schema that is shown in Example Star Schema, which contains the fact table Sales and the dimension tables Products, Supplier, Location, and Time.

Suppose that you submit an SQL query that requests transaction reports for all suppliers and for all products that meet the following criteria from the fact table Sales:

the store location is North Carolina
the time period is the month of January

The SQL query subsets the Location and Time tables, so SPD Server classifies the Location and Time tables as Phase I tables. The query requests information from all of the rows in the Product and Supplier tables. Because those tables are not subsetted by a WHERE clause in the SQL, STARJOIN classifies the Products and Supplier tables in this query as Phase II tables.

Now, using the same star schema, add more detail to the SQL query. Set up a new query that requests transaction reports from the fact table Sales for all stores where the location is the state of North Carolina, for the time period of the month of January, and for products where the supplier is from the state of North Carolina. The subsetted dimension tables Location, Time, and Supplier are classified as Phase I tables. The Products table, unfiltered by the SQL query, is classified as a Phase II table.

Dimension tables are classified as Phase I or Phase II tables because the two types of tables require different index probe methods.

Phase I Probes Fact Table Indexes and Selects a STARJOIN Strategy

Phase I uses the SQL join keys from the subsetted Phase I dimension tables to get a smaller set of candidate rows to query in the central fact table. After the Phase I index probe optimizes the candidate rows in the fact table, the probe examines index structures to determine the best STARJOIN strategy to use. There are two SPD Server STARJOIN strategies: the IN-SET strategy and the COMPOSITE strategy. In all but a few cases, the IN-SET strategy is the most robust and efficient processing strategy. You can determine which strategy SPD Server chooses by providing the required table index types in the SQL that you submit.

Phase I creates the smaller set of candidate rows in the central fact table by eliminating fact table rows that do not match the SQL join keys from the subsetted Phase I dimension tables. For example, if the SQL query requests information about transactions that occurred only in North Carolina store locations, the candidate rows that are retained in the fact table uses the SQL that subsets the Location dimension table:

WHERE location.STATE = "NC";

If the Sales fact table contains sales records for all 50 states, Phase I uses the SQL that subsets the Location dimension table to eliminate the sales records of all stores in states other than North Carolina from the fact table candidate rows. The fact table candidate rowset is reduced to transactions from only North Carolina stores, which eliminates massive amounts of nonproductive data processing.

The Phase I index probe inventories the number and types of indexes on the fact table and dimension tables as it attempts to identify the best STARJOIN strategy. To use the STARJOIN IN-SET strategy, Phase I must find simple indexes on all SQL join columns in the fact table and dimension tables. Otherwise, to use the STARJOIN COMPOSITE strategy, Phase I searches for the best composite index that is available on the fact table. The best composite index for the fact table is the composite index that spans the largest set of join predicates from the aggregated Phase I dimension tables.

Based on the fact table and dimension table index probe, SPD Server selects the STARJOIN strategy by using the following logic:

If the probe finds one or more simple indexes on fact table and dimension table SQL join columns, and does not find spanning composite indexes on the fact table, SPD Server selects the STARJOIN IN-SET strategy.
If the probe finds an optimal spanning composite index on the fact table, and does not find simple indexes on fact table and dimension table SQL join columns, SPD Server selects the STARJOIN COMPOSITE strategy.
If the probe finds both simple and spanning composite indexes, SPD Server generally selects the STARJOIN IN-SET strategy. If the composite index is an exact match for all of the Phase I join predicates, and only lesser matches are available with the IN-SET strategy, SPD Server selects the IN-SET strategy.
If the probe does not find suitable indexes for either STARJOIN strategy, SPD Server does not use STARJOIN; it joins the subtree using the standard SPD Server pairwise join.

The IN-SET and COMPOSITE join strategies have some underlying differences.

The IN-SET join strategy will cache temporary Phase 1 probes in memory, when possible, for use by Phase 2. The caching can result in significant performance improvements by using in-memory lookups into the dimension tables for Phase 2 probes, rather than performing more costly file system probes of the dimension tables, which can result in significant performance improvements. The amount of memory allocated for Phase 1 IN-SET caching is controlled by the STARSIZE= server parameter. You can use the STARJOIN DETAILS option to see which partial results of the Phase 1 IN-SET strategy are cached, and whether sufficient memory was allocated for STARJOIN to cache all partial results.

The IN-SET join strategy uses an IN-SET transformation of dimension table metadata to produce a powerful compound WHERE clause to be used on the STARJOIN fact table. In the term IN-SET, IN refers to an IN specification in the SQL WHERE clause. The IN-SET is the set of values that populate the contents of the SQL IN query expression. For example, in the following SQL WHERE clause, the cities Raleigh, Cary, and Clayton are the values of the IN-SET:

WHERE location.CITY in ("Raleigh", "Cary", "Clayton");

For the IN-SET strategy, Phase I dimension tables are subsetted. Then the resulting set of join keys form the SQL IN expression for the fact table's corresponding join column. You must have simple indexes on all SQL join columns in both the fact table and dimension tables before STARJOIN Phase I can select the IN-SET strategy.

If the dimension table Location has six rows for Raleigh, Cary, and Clayton, then six STORE_NUMBER values are applied to the IN-SET WHERE clause that is used to select the candidate rows from the central fact table. The STARJOIN IN-SET facility transforms the dimension table's CITY values into STORE_NUMBER values that can be used to select candidate rows from the Sales fact table. The transformed WHERE clause that is applied to the fact table might resemble the following code:

WHERE fact.STORE_NUMBER in
 (100,101,102,103,104,105,106);

You can use IN-SET transformations in a star schema that has any number of dimension tables and a fact table. Consider the following example subsetting statement for a dimension table:

WHERE location.CITY in
("Raleigh","Cary","Clayton")
 and Time.SALES_WEEK = 1;

The Sales fact table has no matching CITY column to join with the Location dimension table, and no matching SALES_WEEK column to join with the Time table. Therefore, the IN-SET strategy uses transformations to create a WHERE clause that the Sales fact table can resolve:

WHERE fact.STORE_NUMBER in
 (100,101,102,103,104,105,106)
and Time.SALES_DATE in
 ('01JAN2005'd,'02JAN2005'd,'03JAN2005'd,
  '04JAN2005'd,'05JAN2005'd,'06JAN2005'd,
  '07JAN2005'd,);

The advantage of the STARJOIN facility is that it handles all of the transformations on a fact table, from dimension table subsetting to IN-SET WHERE clauses.

The COMPOSITE join strategy uses a composite index on the fact table to exhaustively probe the full Cartesian product of the combined join keys that is produced by the subsetting of the aggregated dimension table. SPD Server compares the composite indexes on the fact table to the theoretical composite index that is made from all of the join keys in the Phase I dimension tables. Phase I selects the best composite index on the fact table, based on the join requirements of the dimension tables.

A disadvantage of using the COMPOSITE join strategy is that when more than a few join keys exist, the Cartesian product map can become large geometric matrices that can interfere with processing performance. You must have a composite index on the fact table that consists of Phase I dimension table join columns before STARJOIN Phase I can select the COMPOSITE join strategy.

If any Phase I dimension tables contain join predicates that do not have supporting simple or composite indexes on the fact table, those Phase I dimension tables are dropped from Phase I processing and are moved to the Phase II group.

Phase II Performs Index Lookups and Joins Subsetted Fact Table Rows with Phase II Tables

Phase I optimizes the join strategies between the Phase I dimension tables and the candidate rows from the fact table . After Phase I terminates, Phase II takes over. Phase II completes the indicated joins between the candidate rows from the fact table and the corresponding rows in the subsetted Phase I dimension tables. After Phase II completes the joins with the Phase I dimension tables, Phase II performs index lookups from the fact table to the Phase dimension II tables. Phase II dimension tables should have indexes created on all columns that join with the fact table.

When SPD Server completes the STARJOIN Phase I and Phase II tasks, the STARJOIN optimizations have been performed, the STARJOIN strategy has been selected, and the subsetted dimension tables and fact table joins are ready to run and produce the SQL results set that you want.