Using the DBINDEX=, DBKEY=, and MULTI_DATASRC_OPT= Options

When you code a join operation in SAS and the join cannot be passed directly to a DBMS for processing, SAS performs the join. Normally, this processing involves individual queries to each data source that belonged to the join, and SAS performs the join internally. When you join a large DBMS table and a small SAS data set or DBMS table, using the DBKEY=, DBINDEX=, and MULTI_DATASRC_OPT= options might enhance performance. These options let you retrieve a subset of the DBMS data into SAS for the join.

When you specify MULTI_DATASRC_OPT=IN_CLAUSE for DBMS data sources in a PROC SQL join operation, the procedure retrieves the unique values of the join column from the smaller table to construct an IN clause. SAS uses this IN clause when SAS retrieves data from the larger DBMS table. SAS performs the join. If you use a SAS data set, no matter how large it is, it is always in the IN_CLAUSE. For better performance, it is recommended that the SAS data set be smaller than the DBMS table. If not, processing can be extremely slow.

MULTI_DATASRC_OPT= generates a SELECT COUNT to determine the size of data sets that are not SAS data sets. If you know the size of your data set, you can use the DBMASTER= data set option to designate the larger table.

MULTI_DATASRC_OPT= might provide performance improvements over DBKEY=. If you specify options, DBKEY= overrides MULTI_DATASRC_OPT=.

MULTI_DATASRC_OPT= is used only when SAS is processing a join with PROC SQL. It is not used for SAS DATA step processing. For certain joins operations, such as those involving additional subsetting applying to the query, PROC SQL might determine that it is more efficient to process the join internally. In these situations, it does not use the MULTI_DATASRC_OPT= optimization even when specified. If PROC SQL determines it can pass the join directly to the DBMS it also does not use this option even though it is specified.

In this example, the MULTI_DATASRC_OPT= option is used to improve the performance of an SQL join statement. MULTI_DATASRC_OPT= instructs PROC SQL to pass the WHERE clause to the SAS/ACCESS engine with an IN clause built from the SAS table. The engine then passes this optimized query to the DBMS server. The IN clause is built from the unique values of the SAS DeptNo variable. As a result, only rows that match the WHERE clause are retrieved from the DBMS. Without this option, PROC SQL retrieves all rows from the Dept table and applies the WHERE clause during PROC SQL processing in SAS. Processing can be both CPU-, input-, and output-intensive if the Dept Oracle table is large.

data keyvalues;
   deptno=30;
   output;
   deptno=10;
   output;
run;

libname dblib oracle user=testuser password=testpass
     path='myorapath' multi_datasrc_opt=in_clause;

proc sql;
   select bigtab.deptno, bigtab.loc
     from dblib.dept bigtab,
         keyvalues smallds
     where bigtab.deptno=smallds.deptno;
quit;

The SQL statement that SAS/ACCESS creates and passes to the DBMS is similar to this one.

SELECT "DEPTNO", "LOC" FROM DEPT WHERE (("DEPTNO" IN (10,30)))

Using DBKEY or DBINDEX decreases performance when the SAS data set is too large. These options cause each value in the transaction data set to generate a new result set (or open cursor) from the DBMS table. For example, if your SAS data set has 100 observations with unique key values, you request 100 result sets from the DBMS, which might be very expensive. Determine whether use of these options is appropriate, or whether you can achieve better performance by reading the entire DBMS table (or by creating a subset of the table).

DBINDEX= and DBKEY= are mutually exclusive. If you specify them together, DBKEY= overrides DBINDEX=. Both of these options are ignored if you specify the SAS/ACCESS DBCONDITION= data set option or the SAS WHERE=data set option.

DBKEY= does not require that any database indexes be defined; nor does it check the DBMS system tables. This option instructs SAS to use the specified DBMS column name or names in the WHERE clause that is passed to the DBMS in the join.

To improve the performance of joins, you can also use the DBKEY= option in a SAS DATA step, with the KEY= option in the SET statement. In this case, specify a value of KEY=DBKEY. The DATA step below creates a new data file by joining the data file KEYVALUES with the DBMS table MYTABLE. It uses the variable DEPTNO with the DBKEY= option to cause SAS/ACCESS to issue a WHERE clause.

data sasuser.new;
   set sasuser.keyvalues;
   set dblib.mytable(dbkey=deptno) key=dbkey;
run;

Note: When you use DBKEY= with the DATA step MODIFY statement, there is no implied ordering of the data that is returned from the database. If the master DBMS table contains records with duplicate key values, using DBKEY= can alter the outcome of the DATA step. Because SAS regenerates result sets (open cursors) during transaction processing, the changes that you make during processing have an impact on the results of subsequent queries. Therefore, before you use DBKEY= in this context, determine whether your master DBMS file has duplicate values for keys. Remember that the REPLACE, OUTPUT, and REMOVE statements can cause duplicate values to appear in the master table.

The DBKEY= option does not require or check for the existence of indexes created on the DBMS table. Therefore, the DBMS system tables are not accessed when you use this option. The DBKEY= option is preferred over the DBINDEX= option for this reason. If you perform a join and use PROC SQL, you must make sure that the columns that you specify through the DBKEY= option match the columns that you specify in the SAS data set.

CAUTION:

Before you use the DBINDEX= option, take extreme care to evaluate some characteristics of the DBMS data.

The number of rows in the table, the number of rows returned in the query, and the distribution of the index values in the table are among the factors to take into consideration. Some experimentation might be necessary to discover the optimum settings.

You can use the DBINDEX= option instead of the DBKEY= option if you know that the DBMS table has one or more indexes that use the column(s) on which the join is being performed. Use DBINDEX=index-name if you know the name of the index, or use DBINDEX=YES if you do not know the name of the index. Use this option as a data set option, and not a LIBNAME option, because index lookup can potentially be an expensive operation.

DBINDEX= requires that the join table must have a database index that is defined on the columns involved in the join. If there is no index, all processing of the join takes place in SAS, where all rows from each table are read into SAS and SAS performs the join.

Note: NULLCHAR= and NULLCHARVAL= data set options determine how SAS missing character values are handled during DBINDEX= and DBKEY= processing.