DataFlux Data Management Studio 2.6: User Guide
Working with an Entity Resolution File
Overview
You can examine the entity resolution file generated in an entity resolution job. These files reflect the match code, clustering, survivor identification, and entity resolution output file settings made in the job.
Note: Some entity resolution jobs use a text file as an input and add embedded field data to the Entity Resolution File Output node. When you view the entity resolution file created in such a job, select Embedded data in the Entity Resolution file in the Data sources section of the Properties tab of the entity resolution viewer.
You can review the entity resolution file that is displayed at the end of an entity resolution job or you can also select a file from the Entity Resolution folder in the Folders tree. To examine the file, perform the following tasks:
Examine Clusters
You can use the Cluster tab and the Cluster Analysis tab to examine the clusters. Perform the following steps:
- Review the list of clusters. Note that you can go to a specific cluster. You can also filter the clusters list.
- Look for resolved clusters, which are marked by a check mark in the Action column, as shown in the following display:
You can double-click a resolved cluster to examine it in the Cluster Records tab.
- Review the records for the cluster in the Records tab in the Details pane. (If you do not see the Details pane, click Show Details in the toolbar.) Then, review the Related Clusters and Notes tabs as needed.
- Examine the charts in the Cluster Analysis pane. You can examine the Record Count Distribution chart, which shows how many clusters in the entity resolution file have a specific record count. For example, when you put your cursor over a bar you learn that 208 clusters have a record count of four.
- Examine the Record Count/Confidence/Related Clusters chart. The triangles in the chart represent clusters that have related clusters, while the points mark clusters that lack related clusters. You can put your cursor over a triangle or a point to see the cluster number and information about the record count, the confidence level, and the number of related clusters. The chart is as shown in the following display:
Review and Resolve Related Clusters
Some of the clusters in the clusters list might have related clusters. You must resolve these related records before you resolve the cluster in the list because you cannot apply changes to a cluster while it has active related records. Therefore, you must remove the related record from either the current cluster or from its related clusters before you can apply changes for the current cluster. Then, you must repeat this process for all of the related records in the current cluster.
Related clusters are created when you enable the generation of multiple match codes when you set the properties for the Match Codes node in the entity resolution job. Multiple match codes enable you to assign a source field to multiple clusters.
This function can be useful when the clustering algorithm cannot figure out the one best cluster to place a field. Instead, the job can generate multiple target records that can be distributed to multiple related record clusters. You can review related clusters from the Cluster tab. You can also use the Cluster Records tab in the entity resolution file to assign related records to the most appropriate clusters. Perform the following steps:
- Click a cluster that has related clusters. You can find them with either the Related Clusters columns in the clusters list or the Record Count/Confidence/Related Clusters chart.
- Click the Related Clusters tab in the Details pane. Then, click one of the related clusters to review its records. Note that you can click Find in Cluster Records to display the record in the Records tab.
- Double-click a cluster that has related clusters to access the Related Clusters tab in conjunction with the Cluster Records tab.
- Click a cluster in the related clusters list and review its related cluster records.
- Click a record in the related cluster records. You can remove the record from the selected cluster or remove the record from the other related clusters. You can also restore all of the records that you have removed and find a selected record in the Records tab.
The following display shows a Related Clusters tab with a cluster selected and a record selected for resolution:
Process Cluster Records
You can process cluster records on the Cluster Records tab. Perform the following steps:
- Double-click a cluster to open it in the Cluster Records tab.
- Compare the list of cluster records to the surviving record created in the entity resolution job.
- Select an action to resolve the entity. You can preserve one record, preserve all records, or delete all records. For this example, select a record and click Preserve One Record, as shown in the following display:
- Delete any duplicates of the selected record.
- Click Apply to resolve the cluster record. The following display shows the resolved cluster:
Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.
|
Doc ID: DMCust_EntityResViewer.html
|