When
you code a join operation in SAS and the join cannot be passed directly
to a DBMS for processing, SAS performs the join. Normally, this processing
involves individual queries to each data source that belonged to the
join, and SAS performs the join internally. When you join a large
DBMS table and a small SAS data set or DBMS table,
using
the DBKEY=,
DBINDEX=, and MULTI_DATASRC_OPT= options
might enhance performance. These options let
you retrieve a subset of the DBMS data into SAS for the join.
When you specify MULTI_DATASRC_OPT=IN_CLAUSE
for DBMS data sources in a PROC SQL join operation, the procedure
retrieves the unique values of the join column from the smaller table
to construct an IN clause. SAS uses this IN clause when SAS retrieves
data from the larger DBMS table. SAS performs the join. If you use
a SAS data set, no matter how large it is, it is always in the IN_CLAUSE.
For better performance, it is recommended that the SAS data set be
smaller than the DBMS table. If not, processing can be extremely slow.
MULTI_DATASRC_OPT= generates
a SELECT COUNT to determine the size of data sets that are not SAS
data sets. If you know the size of your data set, you can use the
DBMASTER= data set option to designate the larger table.
MULTI_DATASRC_OPT= might
provide performance improvements over DBKEY=. If you specify options,
DBKEY= overrides MULTI_DATASRC_OPT=.
MULTI_DATASRC_OPT= is
used only when SAS is processing a join with PROC SQL. It is not used
for SAS DATA step processing. For certain joins operations, such as
those involving additional subsetting applying to the query, PROC
SQL might determine that it is more efficient to process the join
internally. In these situations, it does not use the MULTI_DATASRC_OPT=
optimization even when specified. If PROC SQL determines it can pass
the join directly to the DBMS it also does not use this option even
though it is specified.
In this example, the
MULTI_DATASRC_OPT= option is used to improve the performance of an
SQL join statement. MULTI_DATASRC_OPT= instructs PROC SQL to pass
the WHERE clause to the
SAS/ACCESS engine with an IN clause built
from the SAS table. The engine then passes this optimized query to
the DBMS server. The IN clause is built from the unique values of
the SAS DeptNo variable. As a result, only rows that match the WHERE
clause are retrieved from the DBMS. Without this option, PROC SQL
retrieves all rows from the Dept table and applies the WHERE clause
during PROC SQL processing in SAS. Processing can be both CPU-, input-,
and output-intensive if the Dept Oracle table is large.
data keyvalues;
deptno=30;
output;
deptno=10;
output;
run;
libname dblib oracle user=testuser password=testpass
path='myorapath' multi_datasrc_opt=in_clause;
proc sql;
select bigtab.deptno, bigtab.loc
from dblib.dept bigtab,
keyvalues smallds
where bigtab.deptno=smallds.deptno;
quit;
The SQL statement that
SAS/ACCESS creates and passes to the DBMS is similar to this one.
SELECT "DEPTNO", "LOC" FROM DEPT WHERE (("DEPTNO" IN (10,30)))
Using DBKEY or DBINDEX
decreases performance when the SAS data set is too large. These options
cause each value in the transaction data set to generate a new result
set (or open cursor) from the DBMS table. For example, if your SAS
data set has 100 observations with unique key values, you request
100 result sets from the DBMS, which might be very expensive. Determine
whether use of these options is appropriate, or whether you can achieve
better performance by reading the entire DBMS table (or by creating
a subset of the table).
DBINDEX= and DBKEY=
are mutually exclusive. If you specify them together, DBKEY= overrides
DBINDEX=. Both of these options are ignored if you specify the
SAS/ACCESS DBCONDITION=
data set option or the SAS WHERE=data
set option.
DBKEY= does not require
that any database indexes be defined; nor does it check the DBMS system
tables. This option instructs SAS to use the specified DBMS column
name or names in the WHERE clause that is passed to the DBMS in the
join.
To improve the performance
of joins, you can also use the DBKEY= option in a SAS DATA step, with
the KEY= option in the SET statement. In this case, specify a value
of KEY=DBKEY. The DATA step below creates a new data file by joining
the data file KEYVALUES with the DBMS table MYTABLE. It uses the variable
DEPTNO with the DBKEY= option to cause
SAS/ACCESS to issue a WHERE
clause.
data sasuser.new;
set sasuser.keyvalues;
set dblib.mytable(dbkey=deptno) key=dbkey;
run;
Note: When you use DBKEY= with
the DATA step MODIFY statement, there is no implied ordering of the
data that is returned from the database. If the master DBMS table
contains records with duplicate key values, using DBKEY= can alter
the outcome of the DATA step. Because SAS regenerates result sets
(open cursors) during transaction processing, the changes that you
make during processing have an impact on the results of subsequent
queries. Therefore, before you use DBKEY= in this context, determine
whether your master DBMS file has duplicate values for keys. Remember
that the REPLACE, OUTPUT, and REMOVE statements can cause duplicate
values to appear in the master table.
The DBKEY= option does
not require or check for the existence of indexes created on the DBMS
table. Therefore, the DBMS system tables are not accessed when you
use this option. The DBKEY= option is preferred over the DBINDEX=
option for this reason. If you perform a join and use PROC SQL, you
must make sure that the columns that you specify through the DBKEY=
option match the columns that you specify in the SAS data set.
CAUTION:
Before
you use the DBINDEX= option, take extreme care to evaluate some characteristics
of the DBMS data.
The number of rows
in the table, the number of rows returned in the query, and the distribution
of the index values in the table are among the factors to take into
consideration. Some experimentation might be necessary to discover
the optimum settings.
You can use the DBINDEX=
option instead of the DBKEY= option if you know that the DBMS table
has one or more indexes that use the column(s) on which the join is
being performed. Use DBINDEX=
index-name if
you know the name of the index, or use DBINDEX=YES if you do not know
the name of the index. Use this option as a data set option, and not
a LIBNAME option, because index lookup can potentially be an expensive
operation.
DBINDEX= requires that
the join table must have a database index that is defined on the columns
involved in the join. If there is no index, all processing of the
join takes place in SAS, where all rows from each table are read into
SAS and SAS performs the join.
Note: NULLCHAR= and
NULLCHARVAL= data
set options determine how SAS missing character values are handled
during DBINDEX= and DBKEY= processing.