Using Match Codes to Improve Record Matching

Problem

You want to use match codes to improve the quality of record-matching operations in jobs. Comparing match codes instead of actual data enables you to identify records that are in fact the same entity, despite minor variations in the data.

Solution

There are a number of ways to use match codes in SAS Data Integration Studio jobs. You can select Use Match Definition when this option is available for a transformation, as described in Configure the Apply Lookup Standardization Transformation. You can create a data service in DataFlux Data Management Studio that generates match codes and clustering information, and then call that service in a SAS Data Integration Studio job. For more information, see Using a DataFlux Data Service in a SAS Data Integration Studio Job.
You can also create a job in SAS Data Integration Studio that uses the Create Match Code transformation, as described in the “Tasks” section below. You would perform the following tasks:

Tasks

Verify Prerequisites

The Create Match Code transformation that is used in this topic requires SAS Data Quality Server. One or more locales must be available to SAS Data Integration Studio, as described in Global Options on the Data Quality Tab. Locales have a set of default match definitions that can be used to generate match codes. Assume that the sample job for this topic uses the standard match definitions for the ENUSA locale.

Create and Populate the Job

Match codes can be used to identify members of the same household in a set of demographic data. In order to do that, you could create a job flow that reads a table of demographic data (CONTACTS); generates match codes and cluster numbers for records that have the same last name and street address, and then writes the match codes and cluster numbers to a target table (CONTACTS_OFFICE_CLUSTER). The flow would look similar to the following figure.
Create Match Code Job Flow
Create Match Code Job Flow
  1. Create an empty SAS Data Integration Studio job.
  2. From the Data folder in the Transformations tree, select and drag a Create Match Code transformation and drop it in the empty job on the Diagram tab in the Job Editor window.
  3. Select and drag the source table from its folder and drop it before the Create Match Code transformation on the Diagram tab.
  4. Drag the cursor from the source table to the input port of the Create Match Code transformation. This action connects the transformation to the source. In this example, the source is a table of contact information called CONTACTS, which contains a large number of records. The data has not been standardized, so the spelling of names and addresses might differ while still referring to the same entities. The following display depicts the source data. When the job is run, rows 1004 and 1005 receive the same cluster number, as do rows 1007 and 1008, despite the fact that the data varies in the COMPANY and ADDRESS rows.
    Source Data in the CONTACTS Table
    Source Data in the CONTACTS Table
  5. From the Transformations tab, under Access, drag a Table Loader transformation into the job and drop it after the Create Match Code transformation.
  6. Select and drag from the transformation's temporary output table to the Table Loader transformation. This action connects the output of the transformation to the Table Loader. The Table Loader is used to ensure that the target is always completely overwritten each time the job is run. This default configuration for the Table Loader is depicted in the following display of the Table Loader's Load Technique tab.
    Using the Table Loader to Overwrite the Target
    Using the Table Loader to Overwrite the Target
  7. Select and drag the target table from its folder and drop it after the Table Loader transformation on the Diagram tab. In this example, the target is named CONTACTS_OFFICE_CLUSTER. The target contains the same columns as the source, plus a numeric column named CLUSTER and a character column named MATCH CODE (length 120).
  8. Drag the cursor from the output port of the Table Loader transformation to the target table. This action connects the transformation to the target.
  9. To propagate and map columns, right-click the Create Match Codes transformation and select Propagate Columnsthen selectTo Selected Transformation's Sources Sourcesthen selectFrom Targets. This action maps the source columns to the target and propagates the new columns in the target into the Create Match Codes transformation.
    The job flow should now look similar to Create Match Code Job Flow.

Configure the Create Match Code Transformation

Perform the following steps to configure the Create Match Code transformation:
  1. In the Job Editor, double-click the Create Match Code transformation to display its properties window.
  2. In the properties window, click the Match Code tab.
  3. In the Locale field, select the locale that is most appropriate for your data. In this example, the locale is ENUSA.
  4. In the Cluster Column field, select the new cluster column, which is named CLUSTER in this example.
  5. In the Match code column field, select the new match code column, which is MATCH_CODE in this example.
  6. Set up one or more conditions that determine the assignment of cluster numbers. For this example, in the Match Definition column, for the ADDRESS column, pull down the list of available match definitions and select Address. In the Sensitivity column, leave the default value of 85 . A lower number is a less-exact match.
  7. Repeat step 6 for the COMPANY column. Choose Organization as the match definition and leave the sensitivity value at 85.
  8. For the STATE column, choose the State match definition and leave the sensitivity setting of 85. The following display shows the completed Match Code tab:
    Fully Configured Match Code Tab
    Fully Configured Match Code Tab
  9. Click OK to save your input and close the properties window. The job is now ready to run.

Run the Job and View the Output

Perform the following steps to run the job and view the output:
  1. Run the job.
  2. If the job completes without error, go to the next step. If error messages appear, read and respond to the messages.
  3. Right-click the target table and select View Data. The following display depicts the cluster and match code columns in the target.
    Cluster Numbers and Match Codes in the Target Table
    Cluster Numbers and Match Codes in the Target Table

Usage Notes

ERROR: Failure in the clustering engine. If you run a job that generates clustering information, and the job fails with this error in the log, try increasing the amount of memory that is allocated to the SAS Application Server that executes the job. To increase the memory allocation, set the option –maxmemquery to a higher value in the sasv9_usermods.cfg file. For example, you might set the option as follows:-maxmemquery 600M