You are here: Customizing Quality Knowledge Bases>Combination-Based Matching>Overview of Combination-Based Matching

DataFlux Data Management Studio 2.5: User Guide

Overview of Combination-Based Matching

Starting in DM Studio 2.1, match definitions have the ability to output multiple matchcodes per input (at a given sensitivity setting) by making use of token combination rules—this is known as combination-based matching. These rules allow the user to omit different combinations of tokens or change the order of tokens, so that multiple token assignments can be evaluated, resulting in multiple matchcodes for a given input string.

One application of token combination rules is to address parsing ambiguity. For example, given a date string "2/4/2012", we can read this as either the 2nd of April, 2012 or the 4th of February, 2012, depending on whether the D/M/Y or M/D/Y convention is used. When the date convention is not known with certainty, scenarios may arise where both interpretations are possible. A similar problem arises with personal names where the given and family names can be interchanged (e.g. "ELTON JOHN", "JOHN ELTON").

Another common application concerns entities where certain tokens can be omitted at will. Addresses written in the style of the United Kingdom are an example. According to postal regulations, the only mandatory elements of a United Kingdom address are Town and Postcode (though it is, naturally, possible to find records that incorrectly omit one or both of those tokens). The two mandatory elements are insufficient to fully identify an address, so some other information must be provided for the address to be complete. Depending on the specific situation, different combinations of tokens can be used to complete the address, and there is no way of knowing in advance which combination of these tokens will be populated for any given record. Several simple examples are shown in the following table; all of these records could refer to the same physical address. Note that the last two records have mandatory tokens omitted.

ID Building Number Street District/Village Town County Postcode
0 The Grange 32 Green Rd Bishops Cleeve Cheltenham Gloucestershire GL52 8XX
1 The Grange     Bishops Cleeve Cheltenham   GL52 8XX
2   32 Green Rd   Cheltenham Gloucestershire GL52 8XX
3 The Grange   Green Rd   Cheltenham   GL52 8XX
4 The Grange       Cheltenham Gloucestershire  
5   32 Green Rd       GL52 8XX


A token combination rule expresses a mapping between the tokens at the input of the match definition, and the tokens that are actually used in the matchcode generation processing steps within the match definition. A rule consists, broadly, of two main parts: the conditions and the actions. The conditions specify when the rule will be applied, and are expressed in terms of a predicate involving one or more incoming tokens. The actions describe the output of the rule—specifically, what incoming tokens are mapped to the effective tokens that will be used in the rest of the matchcode generation.

Each token combination rule may produce a matchcode, depending on whether its conditions are met for the particular record under consideration. In addition, the "default" matchcode, i.e. the matchcode that would have been produced by the legacy definition with no token combination rules, may still be output. A single input to a combination-based match definition will therefore generally produce multiple output matchcodes. This means that a given record may appear in more than one potential cluster. In the case of combination-based matching, each matchcode corresponds to a different combination of tokens.

In DM Studio 2.2, suggestion-based matching was added. This is another, entirely separate, mechanism that allows a match definition to produce multiple output matchcode for a single input. A match definition may be configured for both combination-based matching and suggestion-based matching, if desired. The diagram below gives a conceptual overview of the match definition flow when both types of functionality are in use. Naturally, the required computation and storage resources increase when these features are used.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfDMStd_CBM_overview.html