When to Use Suggestion-Based Matching
The following is a list of some situations where switching from combination-based matching to suggestion-based matching might be useful:
- An increase in computation time, storage, and/or complexity in the deduplication workflow is acceptable, in exchange for higher accuracy. As an example, when compared to a Match Codes node that uses the combination-based Name match definition, a Match Codes node that uses the Name (with Suggestions) match definition shipped in QKB CI 2011A (ENUSA locale) runs several times more slowly and produces many more match code output rows, but also greatly reduces the error rate in the following clustering step.
- The recommended best practice of using low sensitivities for a single match definition and then reducing false matches by combining match codes over multiple data types (for example, Name and ZIP code) is not practical. This is commonly seen when working with records that contain only a single piece of information, such as an entity name, or when the other data types do not contain information reliable or relevant enough to the application at hand. When it is not possible to match over multiple data types, the individual match definition must be more discriminating.
- The records that do not match (but are desired to match) differ from each other at the sub-token or character level. Some of these differences are implicitly handled by the transformation schemes, regexes, and phonetic reduction rules in traditional match definitions. However, with these libraries, it is difficult to address character-level errors in a comprehensive way while avoiding over-generalization. For example, traditional match definitions cannot easily match records that differ by a transposition of characters.
Note: In situations where the difficulty arises at the inter-token level (that is, from transposition of certain tokens or missing tokens), combination-based matching should be used.
- Behavior is desired that is actually equivalent to allowing a record to appear in more than one cluster. For example, "LINA" should match "LINDA" and/or "LENA", yet "LINDA" should not also match "LENA". This is not possible using traditional matching, because each record must appear in exactly one cluster.
- A numeric score is needed to aid decision making when resolving clusters or selecting survivors. Each suggestion-based match code has a score attached, and after clustering, these scores are aggregated for the cluster. These scores can be used during manual review of clusters. The presence of scores also paves the way for business rules that automatically resolve clusters and select surviving records.
Note: Suggestion-based matching is not available in all locales. In locales that do not support suggestion-based matching, the system will not permit you to add a Suggestions node to a match definition.