Advanced Topics for Users |
A BY key is similar to a BY group in SAS, which groups observations based on one or more fields. Many SAS procedures process records in BY groups. Also, some updates in the DATA step are performed by matching specified BY variables in different data sets. A similar matching process occurs with BY-key items in the SAS/ACCESS interface to SYSTEM 2000. Use the BY-key capability to eliminate redundancy and to help the interface view engine find an existing path for inserting the new records.
Each time the interface view engine is called to insert an observation, it inspects the changes you made from observation to observation, in order to determine how many data records to insert into the database.
If none of the data changed, or if the changes were only at the lowest level of the view, the engine needs to insert only a single new data record at the lowest level. Because the engine inserts at least one record for any addition, and only one record is called for here, there is no question about how many records to insert, that is, the insert is not ambiguous.
However, if any data values changed in records above the lowest record in the path, an ambiguous situation occurs. A specific number of new records seem to be required by your changes, but some of the new data might already exist in the database records. That is, the actual number of new records to be added to the database might be different.
In insert mode, the engine can determine whether some of the new data already exists in a record. If the data exists, the engine needs to insert records only for the data that does not exist in the database. If the data does not exist, the engine needs to insert a record at every level.
In optimized load mode, the engine ignores the ambiguity; it inserts all of the new data that is at or below the highest-level record that changed. Therefore, when you specify optimized load mode, make sure that your incoming data is always sorted by major-to-minor sort keys at every level (from level 0 down to the lowest level in the view). If the data is not sorted correctly, redundancy will occur.
If you specify a BY key, it should contain one or more database items at each level above the lowest level in your view descriptor.
BY keys cause extra processing time because the engine issues one or more where-clauses to look for already-existing records.
Examples Using a BY Key |
You have a view with C1 and C11 in the BY key and three observations.
Suppose you are using the FSEDIT procedure on observation 1, and you issue the DUP command and enter values A, CCC, and 4. This is not an ambiguous insert; a BY key is not required. The changes in values from observation 1 to your new input are confined to the lowest level of the view. Here is the result.
Now, suppose you are using the FSEDIT procedure on observation 1, and you issue an ADD or a DUP command and enter the values B, DDD, and 5 for C1, C11, and C21, respectively. The insert is ambiguous because all the fields in the new observation are different from observation 1. Without a BY key, the result is
With a BY key, the engine finds the BY key values C1=B and C11=DDD in the database. Then, the result is
BY-Key Effects on Performance |
The recommended way to use BY keys is to
include an item at every level above the lowest level of the view descriptor
standardize all database updates through the same view or through consistent views.
The engine does not enforce that a BY key must contain at least one item at every level above the lowest level in the view descriptor. However, if the BY key does not contain enough unique items, it might be inadequate to help the engine. The engine might behave as if there were no BY key.
The engine does not enforce consistent use of BY keys; one view descriptor might have a BY key and another might not. In this instance, redundant data could be added to the database through the view descriptor that does not have a BY key. Also, some applications that use the QUEST procedure could enter redundant data. PROC QUEST does not call the engine for database updates.
If data is added in any way other than through a view descriptor using a BY key, the engine might find several qualified database records that match the incoming data. The engine would pick one record that works and use it when inserting the new records, and the incoming data might be attached beneath a different existing record than the one you expect.
To avoid this, make sure that all users who update the database follow the same rules. That is, ensure that all data entry is performed through the interface view engine and that all users use the same view descriptor (or consistent view descriptors).
In addition, the content of a prior observation is important during inserts because the engine compares your new data to it. The prior observation is obvious for SAS procedures that pass through a file sequentially, such as the DBLOAD procedure. However, other SAS procedures can pass randomly through a file, such as the FSEDIT procedure.
When you add observations by using procedures that do not use sequential processing, remember that the prior observation is the last observation that the procedure showed you. For example, in the FSVIEW procedure, the prior observation is the last observation that the procedure displayed at the bottom of your monitor before your first update.
In some instances, there is no prior observation, such as when you use the DBLOAD procedure. PROC DBLOAD calls the engine to add an observation without any prior retrieval. If this occurs, the engine issues a GET1 ... LAST command for the record at the top of the view and retrieves the last record that was inserted into the database.
Copyright © 2007 by SAS Institute Inc., Cary, NC, USA. All rights reserved.