Optimizing Your SQL Usage |
When you code a join operation in SAS, and the join cannot be passed directly to a DBMS for processing, the join is performed by SAS. Normally, this processing will involve individual queries to each data source that belonged to the join, and the join being performed internally by SAS. When you join a large DBMS table and a small SAS data set or DBMS table, using the DBKEY= , DBINDEX=, and MULTI_DATASRC_OPT= options might enhance performance. These options enable you to retrieve a subset of the DBMS data into SAS for the join.
When MULTI_DATASRC_OPT=IN_CLAUSE is specified for DBMS data sources in a PROC SQL join operation, the procedure retrieves the unique values of the join column from the smaller table to construct an IN clause. This IN clause is used when SAS is retrieving the data from the larger DBMS table. The join is performed in SAS. If a SAS data set is used, no matter how large, it is always in the IN_CLAUSE. For better performance, it is recommended that the SAS data set be smaller than the DBMS table. If not, processing can be extremely slow.
MULTI_DATASRC_OPT= generates a SELECT COUNT to determine the size of data sets that are not SAS data sets. If you know the size of your data set, you can use DBMASTER= Data Set Option to designate the larger table.
MULTI_DATASRC_OPT= might provide performance improvements over DBKEY=. If you specify options, DBKEY= overrides MULTI_DATASRC_OPT=.
MULTI_DATASRC_OPT= is used only when SAS is processing a join with PROC SQL. It is not used for SAS DATA step processing. For certain joins operations, such as those involving additional subsetting applying to the query, PROC SQL might determine that it is more efficient to process the join internally. In these situations it does not use the MULTI_DATASRC_OPT= optimization even when specified. If PROC SQL determines it can pass the join directly to the DBMS it also does not use this option even though it is specified.
In this example, the MULTI_DATASRC_OPT= option is used to improve the performance of an SQL join statement. MULTI_DATASRC_OPT= instructs PROC SQL to pass the WHERE clause to the SAS/ACCESS engine with an IN clause built from the SAS table. The engine then passes this optimized query to the DBMS server. The IN clause is built from the unique values of the SAS DeptNo variable. As a result, only rows that match the WHERE clause are retrieved from the DBMS. Without this option, PROC SQL retrieves all rows from the Dept table and applies the WHERE clause during PROC SQL processing in SAS. Processing can be both CPU-intensive and I/O-intensive if the Oracle Dept table is large.
data keyvalues; deptno=30; output; deptno=10; output; run; libname dblib oracle user=testuser password=testpass path='myorapath' multi_datasrc_opt=in_clause; proc sql; select bigtab.deptno, bigtab.loc from dblib.dept bigtab, keyvalues smallds where bigtab.deptno=smallds.deptno; quit;
The SQL statement that SAS/ACCESS creates and passes to the DBMS is similar to the following
SELECT "DEPTNO", "LOC" FROM DEPT WHERE (("DEPTNO" IN (10,30)))
Using DBKEY or DBINDEX decreases performance when the SAS data set is too large. These options cause each value in the transaction data set to generate a new result set (or open cursor) from the DBMS table. For example, if your SAS data set has 100 observations with unique key values, you request 100 result sets from the DBMS, which might be very expensive. Determine whether use of these options is appropriate, or whether you can achieve better performance by reading the entire DBMS table (or by creating a subset of the table).
DBINDEX= and DBKEY= are mutually exclusive. If you specify them together, DBKEY= overrides DBINDEX=. Both of these options are ignored if you specify the SAS/ACCESS data set option DBCONDITION= or the SAS data set option WHERE=.
DBKEY= does not require that any database indexes be defined; nor does it check the DBMS system tables. This option instructs SAS to use the specified DBMS column name or names in the WHERE clause that is passed to the DBMS in the join.
The DBKEY= option can also be used in a SAS DATA step, with the KEY= option in the SET statement, to improve the performance of joins. You specify a value of KEY=DBKEY in this situation. The following DATA step creates a new data file by joining the data file KEYVALUES with the DBMS table MYTABLE. The variable DEPTNO is used with the DBKEY= option to cause SAS/ACCESS to issue a WHERE clause.
data sasuser.new; set sasuser.keyvalues; set dblib.mytable(dbkey=deptno) key=dbkey; run;
Note: When you use DBKEY= with the DATA step MODIFY statement, there is no implied ordering of the data that is returned from the database. If the master DBMS table contains records with duplicate key values, using DBKEY= can alter the outcome of the DATA step. Because SAS regenerates result sets (open cursors) during transaction processing, the changes you make during processing have an impact on the results of subsequent queries. Therefore, before you use DBKEY= in this context, determine whether your master DBMS file has duplicate values for keys. Remember that the REPLACE, OUTPUT, and REMOVE statements can cause duplicate values to appear in the master table.
The DBKEY= option does not require or check for the existence of indexes created on the DBMS table. Therefore, the DBMS system tables are not accessed when you use this option. The DBKEY= option is preferred over the DBINDEX= option for this reason. If you perform a join and use PROC SQL, you must ensure that the columns that are specified through the DBKEY= option match the columns that are specified in the SAS data set.
The number of rows in the table, the number of rows returned in the query, and the distribution of the index values in the table are among the factors to take into consideration. Some experimentation might be necessary to discover the optimum settings.
You can use the DBINDEX= option instead of the DBKEY= option if you know that the DBMS table has one or more indexes that use the column(s) on which the join is being performed. Use DBINDEX=index-name if you know the name of the index, or use DBINDEX=YES if you do not know the name of the index. Use this option as a data set option, and not a LIBNAME option, because index lookup can potentially be an expensive operation.
DBINDEX= requires that the join table must have a database index that is defined on the columns involved in the join. If there is no index, then all processing of the join takes place in SAS, where all rows from each table are read into SAS and SAS performs the join.
Note: The data set options NULLCHAR= and NULLCHARVAL= determine how SAS missing character values are handled during DBINDEX= and DBKEY= processing.
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.