When to Use Combination-Based Matching
The following is a list of some situations where switching from legacy matching to combination-based matching could be useful.
- An increase in computation time, storage and/or complexity in the deduplication workflow is acceptable, in exchange for increased flexibility or accuracy. As an example, when compared to a Matchcodes node that uses the legacy Address (Full) definition, a Matchcodes node that uses the Address (Full) (with Combinations) match definition shipped in CI 2011A in the English (United Kingdom) locale runs several times slower and produces many more matchcode output rows, but also greatly reduces the error rate in the following clustering step.
- The recommended best practice of using low sensitivities for a single match definition and then reducing false matches by combining matchcodes over multiple data types (e.g. Name and Address) is not practical or does not sufficiently address the ambiguity present. This is commonly seen when working with records that only contain a single piece of information, such as an entity name, or when the other data types simply do not contain information reliable or relevant enough to the application at hand.
- The records that do not match (but are desired to match) differ from each other at the token level. This occurs, for example, when the differences between the records are due to a transposition of tokens (as in the M/D/Y vs. D/M/Y date example), or to tokens that are missing in one record and present in the other. Note that in situations where the difficulty arises at the sub-token or character level, use suggestion-based matching.
- Behavior is desired that is actually equivalent to allowing a record to appear in more than one cluster. For example, the desired behavior might be that "ELTON JOHN" should match "ELTON A JOHN" or "JOHN ELTON", but "JOHN ELTON" should not match "ELTON A JOHN". This is not possible using standard legacy matching, because each record must appear in exactly one cluster.
- A numeric score is needed to aid decision-making when resolving clusters or selecting survivors. Each combination-based matchcode has a score attached, and after clustering, these scores are aggregated for the cluster. These scores can be used to aid manual review of clusters. The presence of scores also paves the way for business rules that automatically resolve clusters and select surviving records.