Create Match Codes

Overview

Match codes are encoded representations of character values that are used for analysis, transformation, and standardization of data. Use the following procedures and functions to create match codes:
The DQMATCH procedure
creates match codes for one or more variables or parsed tokens that have been extracted from a variable. The procedure can also assign cluster numbers to values with identical match codes. See DQMATCH Procedure for additional information.
The DQMATCH Function
generates match codes for tokens that have been parsed from a variable. See DQMATCH Function for additional information.
The DQMATCHPARSED Function
See DQMATCHPARSED Function for additional information.
Match codes are created by the DQMATCH procedure and by the DQMATCH and DQMATCHPARSED functions. The functions DQMATCH and DQMATCHPARSED return one match code for one input character variable. With these tools, you can create match codes for an entire character value or a parsed token extracted from a character value.
  • During processing, match codes are generated according to the specified locale, match definition, and sensitivity-level.
  • The locale identifies the language and geographical region of the source data. For example, the locale ENUSA specifies that the source data uses the English language as it is used in the United States of America.
  • The match definition in the Quality Knowledge Base identifies the category of the data and determines the content of the match codes. Examples of match definitions are named ADDRESS, ORGANIZATION, and DATE(YMD).
To determine the match definitions that are available in a Quality Knowledge Base, consult the QKB documentation from DataFlux (a SAS company). Alternatively, use the DQLOCALEINFOLIST function to return the names of the locale's match definitions. Use the DQLOCALEINFOLIST function if your site modifies the default Quality Knowledge Base using DataFlux dfPower Customize software.
The sensitivity level is a value between 50 and 95 that determines the amount of information that is captured in the match code, as described in Sensitivity.
If two or more match codes are identical, a cluster number can be assigned to a specified variable, as described in Clusters.
The content of the output data set is determined by option values. You can include values that generate unique match codes, and you can include and add a cluster number to blank or missing values. You can also concatenate multiple match codes.
Match codes are also generated internally when you create a scheme with the DQSCHEME procedure, as described in Schemes. Match codes are also created internally by the DQSCHEMEAPPLY function and the DQSCHEMEAPPLY CALL routine. The match codes are used in the process of creating or applying a scheme.

How to Create a Match Code

You can create two types of match codes:
  • Simple match codes from a single input character variable.
  • a concatenation of match codes from two or more input character variables. The separate match codes are concatenated into a composite match code.
    Use the DELIMITER= option to specify that a delimiter exclamation point (!) is to be inserted between the simple match codes in the combined match code.
To create simple match codes, specify one CRITERIA statement, one input variable identified in the VAR= option, and one output variable identified with the MATCHCODE= option.
Composite match codes are similar, except that you specify multiple CRITERIA statements for multiple variables. All the CRITERIA statements specify the same output variable in their respective MATCHCODE= options.
SAS Data Quality Server software creates match codes using these general steps:
  1. Parse the input character value to identify tokens.
  2. Remove insignificant words.
  3. Remove some of the vowels. Remove fewer vowels when a scheme-build match definition has been specified.
  4. Standardize the format and capitalization of words.
  5. Create the match code by extracting the appropriate amount of information from one or more tokens, based on the specified match definition and level of sensitivity.
Certain match definitions skip some of these steps.
Note: To analyze or join two or more data sets using match codes, create the match codes in each data set with identical sensitivity levels and match definitions.

Match Code Length

Match codes can vary in length between 1 and 1024 bytes. The length is determined by the specified match definition. If you receive a message in the SAS log that states that match codes have been truncated, extend the length of the match code variable. Truncated match codes do not produce accurate results.