Comparing Tables

Problem

You want to detect changes between two tables such as an update table and a master table. For example, a PRICE_COST update table could contain the unit cost and unit price of all products currently being promoted through a channel. At the same time, a PRICE_COST_HIST table could serve as the master table.

You need to accommodate the following inputs and outputs for the comparison:

Input Port 1: The source table
Input Port 2: The comparison table
Output Port 1: Matched: Changed Records
Output Port 2: Unmatched: New Records (source table only)
Output Port 3: Matched: Unchanged Records
Output Port 4: Unmatched: Missing Records (comparison table only)

Solution

You can create a job that contains the Compare Tables transformation. The transformation generates a variety of output for matched and unmatched records. The Compare Tables transformation supports either a direct in-memory lookup (hash object) or a sequential disk-based lookup (DATA step merge). The hash lookup method offers better performance than the DATA step merge lookup method, but it requires that the entire cross-reference table fit into system memory. If sufficient system memory is not available, the DATA step merge lookup method can be used instead.

The sample job uses the direct lookup method. See Use a Sequential Data Merge for information about the data merge method.

Perform the following tasks:

You could use Compare Tables to update the tables as follows:

New PRICE_COST records are added to PRICE_COST_HIST.
Any updates to UNIT_PRICE or UNIT_COST for a PRICE_COST record cause the current historical record to be logically deleted and the updated record is added.
Records that now longer appear on the current PRICE_COST table are logically deleted from the PRICE_COST_HIST table.
All logically deleted records are available for query or reactivation, because they are retained in the table but are marked as deleted.

The transformation can handle new records, changed records, missing records, and unchanged records as output. You can choose to retain or delete any of the possible outputs as needed to increase efficiency. The transformation also generates its results in a single pass of the data.

Tasks

Create and Populate the Job

Perform the following steps to create and populate the job:

Create an empty SAS Data Integration Studio job.
Drag the Compare Tables transformation to the Diagram tab of the job.
Locate the tables that you want to compare, such as CUSTOMER_SOURCE and CUSTOMER_COMPARE. Drag them to the right side of the Diagram tab of the job.
Drag the cursor from each comparison table to Compare Tables transformation. This action connects the comparison tables to the Compare Tables transformations.

The following display shows the completed process flow:

Compare Tables Process Flow

Configure the Job

Perform the following steps to configure the job:

Open the properties window for the Compare Tables transformation and select the Compare Tables tab.
Select the match keys for the comparison in your job in the Available source columns field and move them to the Selected source column mappings field. The sample job uses the column representing the customer number for each record. If necessary, you can map the column to the appropriate target table column on the Mappings tab.
Select one or more source table columns to compare with the comparison table columns. You can also map these columns on the Mappings tab, if needed.

Several columns that contain customer contact information are selected in the sample job, as shown in the following display:

Compare Tables Properties Window
Click Options. Set the Use direct lookup (hash table) option in the General category to Yes. This setting ensures that the Compare Tables transformation will use the hash lookup method, which enhances processing performance. Note that you can also adjust the hash object’s internal table size. Selecting the appropriate value can result in significant performance increases. The internal table size for the sample job is set to 10.

The value of HASHEXP is used as a power-of-two exponent to create the hash table size. For example, a value of 4 for HASHEXP equates to a hash table size of 24, or 16. The maximum value for HASHEXP is 20. You should specify the hash table size relative to the amount of data in the hash object in order to maximize the efficiency of the hash object lookup routines. Try different HASHEXP values until you get the best result.

Note: If your source tables contain any duplicate business keys, remove the keys before you use the hash lookup method by adding a Sort transformation between each source and the transformation. If the source table does not contain duplicate business keys, a sort is not required. A sort is always required, however, when a DATA step merge lookup is performed. Finally, the cross-reference table should be redirected to a register library to improve performance. This approach prevents you from having to recreate the change-digest for each record in the target table on each job execution.
Click OK to save your settings and close the properties window.

Note that you can register the temporary output tables for changed records, new records. unchanged records, and missing records that are attached to the Compare Tables transformation. This capability enables you to create permanent tables to collect these outputs. Right-click each table and click Register Table. Then, use the Register Table window to perform the registration.
Note: You should not delete these temporary output tables and replace them with previously registered tables. In addition, you should always wait until after you have performed the following compare tables configuration:
- adding match keys
- selecting compare columns

Run the Job and View the Output

Perform the following steps to run the job and view the output:

Right-click on an empty area of the job, and click Run in the pop-up menu. SAS Data Integration Studio generates code for the job and submits it to the SAS Application Server for execution.
If error messages are displayed on the Status tab, read and respond to the messages as needed.
If the job completes successfully, you can review the output. Right-click the changed records output table and click Open in the pop-up menu.
Review the output displayed in the View Data window for the changed records table, as shown in the following display:

Changed Records Output
Review the output displayed in the View Data window for the new records table, as shown in the following display:

New Records Output
Review the output displayed in the View Data window for the unchanged records table, as shown in the following display:

Missing Records Output
Review the output displayed in the View Data window for the missing records table, as shown in the following display:

Missing Records Output
Note the following results:
- 1000 changed rows
- 1000 new rows
- 9000 changed rows
- 1000 missing rows

Use a Sequential Data Merge

Perform the following steps to configure the Compare Tables transformation to use a sequential data merge:

Open the properties window for the Compare Tables transformation.
Click Options. Set the Use direct lookup (hash table) option in the General category to No. This setting saves memory at a potential cost to processing performance.
Click OK to save your settings and close the properties window.

Note: The DATA step merge lookup method requires the source table to be presorted by business key with duplicate business keys removed. If the source table is not presorted by business key or the source table contains duplicate business keys, you should add a Sort transformation before you sort and remove duplicate business keys with the Compare Tables transformation. In the Sort transform, navigate to the Options tab and change the first SAS Sort option entitled Remove duplicate records to Remove rows with duplicate keys (NODUPKEY). You should remove duplicate business keys from source tables before you use the Compare Tables transformation to avoid unexpected results.
Insert a Sort transformation between each of the source tables and the Compare Tables transformation.
Open each Sort transformation and click Sort Columns.
Move the columns that you want to use for the match keys in the comparison tables to the Sort by columns field.
Click OK to save your settings and close the properties window.