Generating an Entity Resolution File

DataFlux Data Management Studio 2.6: User Guide

Generating an Entity Resolution File

Overview

You can merge records from multiple files or duplicate records within a single file so that records referring to the same physical object such as an individual, company, or product are treated as a single record. These records are matched based on the information that they have in common. Of course, the more information held in common among the records, the higher the confidence in the match.

You can create a data job to prepare an entity resolution file by performing the following tasks:

Create an Entity Resolution Data Job
Generate Match Codes
Set Clustering Properties
Identify Surviving Records
Prepare the Entity Resolution File and Run the Job

Create an Entity Resolution Data Job

You can create a data job. Then, you can populate it with the nodes that you need to merge your data and generate an entity resolution file. Perform the following steps:

Open a data job.
Open the Data Inputs folder in the Nodes tree. Select the Data Source node and drop it into the data flow.
Open the Entity Resolution folder in the Nodes tree. Select the Match Codes node and drop it into the data flow. Then, connect the Data Source node to the node.
Select the Clustering node and drop it into the data flow. Then, connect the Match Codes node to the Clustering node.
Select the Surviving Record Identification node and drop it into the data flow. Then, connect the Clustering node to the Surviving Record Identification node.
Open the Data Outputs folder in the Nodes tree. Select the Entity Resolution File Output node and drop it into the data flow. Then, connect the Surviving Record Identification node to the Entity Resolution File Output node.
Double-click the Data Source node to display its properties window. Specify an input table that contains the data that you need to merge and process for entity resolution. For example, you could select a Contacts table that combines data from the other tables in the data connection. You could also move all of its fields to the Selected field in the Output fields section.

Note Note: Because the data source for the Entity Resolution File Output node must have a primary key, only fields from the original source table should be selected as output fields. If a data job pushes copies of rows, then the input to the Entity Resolution Output node would contain more than one row with the same primary key value. The Entity Resolution Editor has no way to retrieve field values for separate rows that have the same primary key, so you must have a single primary key for the Entity Resolution Editor to find all the records for a cluster.

Note Note: You can use a text file as an input instead of a database table. To do this, replace the Data Source node with a Text File node and select a delimited text file as the input. Then click Options in the Entity Resolution File Output node and click Embed field data in the output file. When you view the entity resolution file created in such a job, select Embedded data in the Entity Resolution file in the Data sources section of the Properties tab of the entity resolution viewer.

Click OK to save the properties and close the window.

The following display shows a sample entity resolution flow:

Generate Match Codes

You need to set the properties for the Match Codes node to determine how match codes are generated in the job. Perform the following steps:

Double-click the Match Codes node to display its properties window.
Select a Quality Knowledge Base locale for the match codes.
If you want to enable multiple match codes for source fields, select the Allow generation of multiple match codes per definition for each sensitivity check box. Multiple match codes enable you to assign a source field to multiple clusters. This function can be useful when the clustering algorithm cannot figure out the one best cluster to place a field. Instead, the job can generate multiple target records that can be distributed to multiple related record clusters where they can be resolved in the entity resolution process.
Select your match code fields and move them to the Selected field in the Match code fields section. You need to use the drop-down menus to specify a definition and sensitivity for each match code field. Note that an output name is created for each match code field that you create.
Click Additional Outputs to add the fields to the output. You can add all of the fields in the table for this example.
Click OK to save the properties and close the window.

Set Clustering Properties

You need to set the properties for the Clustering node to set the parameters for the clusters identified in your entity resolution file. Perform the following steps:

Double-click the Clustering node to display its properties window.
Specify a name for the cluster ID field, such as Clusters.
Set cluster field values such as treating blank field values as nulls, sorting output by cluster numbers, and including both single- and multi_row clusters.
Specify your cluster conditions. In this case, I am specifying CONTACT_MatchCode_Score and CONTACT_MatchCode conditions.
Click Additional Outputs to add the fields to the output. You can add all of the available fields.
Click OK to save the properties and close the window.

Identify Surviving Records

You need to set the properties for the Surviving Record Identification node to select a cluster ID field and the output fields for the entity resolution file. Perform the following tasks:

Double-click the Surviving Record Identification node to display its properties window.
Click the drop-down menu in Cluster ID field and select the ID that you specified in the Clustering node (Clusters).
Review the Record rules field to examine the rules associated with selected cluster ID. Add or delete rules as needed.
Move all of the output fields and clusters in the Available field to the Selected field.
Click OK to save the properties and close the window.

Prepare the Entity Resolution File and Run the Job

You need to set the properties for the Entity Resolution File Output node to set parameters for the entity resolution file. Perform the following tasks:

Double-click the Entity Resolution File Output node to display its properties window.
Specify the properties for the entity resolution file. The properties for the sample job are displayed in the following table:

Property	Value
Cluster ID field	Clusters
Source table	Contracts
Output file	Specify a field in an accessible repository
Display file after job runs	Selected
Options	Confidence value field: CONTACT_MatchCode_Score Surviving record ID field: SRID
Target	Source table: Commit every row Data removal: Delete duplicate records Delete flag fields: ID (populated from selected primary keys section in entity resolution properties) Audit file name: Specify any convenient value
Output fields	Specify all of the table fields (but not the clusters, SRID, and match codes. The SRID file (.SRI) must be output to the default location. This happens automatically if you when you click the "..." button.)

Click OK to save the properties and close the window.
Run the job. The following display shows a portion of the log for the job:

Note that if you selected Display file after job runs, the entity resolution file is displayed after a successful job submission. You can inspect the log by clicking the tab for the job.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: DMCust_EntityResJob.html