SAS Institute. The Power to Know

SAS(R) Data Quality Server 9.2: Reference

space
Previous Page | Next Page

Using the SAS Data Quality Server Software

Create Match Codes

Match codes are encoded representations of character values that are used for analysis, transformation, and standardization of data. Match codes are created by the following procedures and functions:

PROC DQMATCH

creates match codes for one or more variables or parsed tokens that have been extracted from a variable. The procedure can also assign cluster numbers to values with identical match codes. For syntax information, see DQMATCH Procedure Syntax.

DQMATCH

generates match codes for a variable. See DQMATCH Function.

DQMATCHPARSED

generates match codes for tokens that have been parsed from a variable. See DQMATCHPARSED Function.

Match codes are created by the DQMATCH procedure and by the functions DQMATCH and DQMATCHPARSED. The functions DQMATCH and DQMATCHPARSED return one match code for one input character variable. With these tools you can create match codes for an entire character value or a parsed token extracted from a character value

During processing, match codes are generated according to the specified locale, match definition, and sensitivity.

The locale identifies the language and geographical region of the source data. For example, the locale ENUSA specifies that the source data uses the English language as it is used in the United States of America.

The match definition in the Quality Knowledge Base identifies the category of the data and determines the content of the match codes. Examples of match definitions are named ADDRESS, ORGANIZATION, and DATE(YMD). To determine the match definitions that are available in a Quality Knowledge Base, consult the QKB documentation from DataFlux (a SAS company), or use the function DQLOCALEINFOLIST to return the names of the match definitions in your locale. Use the function if your site modifies the default Quality Knowledge Base using the dfPower Customize software from DataFlux.

The sensitivity level is a value between 0 and 99 that determines the amount of information that is captured in the match code, as described in About Sensitivity.

If two or more match codes are identical, a cluster number can be assigned to a specified variable, as described in About Clusters. The content of the output data set is determined by option values. You can choose to include values that generate unique match codes and you can choose to include and add a cluster number to blank or missing values. You can also concatenate multiple match codes.

Note also that match codes are also generated internally when you create a scheme with PROC DQSCHEME, as described in Transform Data with Schemes. Note that match codes are created internally by the DQSCHEME procedure, the DQSCHEMEAPPLY function, and the DQSCHEMEAPPLY CALL routine. These match codes are used in the process of creating or applying a scheme, as described in Transform Data with Schemes.


How Match Codes Are Created

You can create two types of match codes:

  • Simple match codes are created from a single input character variable.

  • Composite match codes consist of a concatenation of match codes from two or more input character variables. Then the separate match codes are concatenated into a composite match code. You have the option of specifying that a delimiter, in the form of an exclamation point (!), is to be inserted between the simple match codes that comprise the combined match code (via the DELIMITER option or argument).

To create simple match codes, you specify one CRITERIA statement, with one input variable identified in the VAR= option and one output variable identified with the MATCHCODE= option. Composite match codes are similar, except that you specify multiple CRITERIA statements for multiple variables, and all of those CRITERIA statements specify the same output variable in their respective MATCHCODE= options.

The SAS Data Quality Server software creates match codes using these general steps:

  1. Parse the input character value to identify tokens.

  2. Remove insignificant words.

  3. Remove some of the vowels. Remove fewer vowels when a scheme-build match definition has been specified, as described in About the Scheme Build Match Definitions.

  4. Standardize the format and capitalization of words.

  5. Create the match code by extracting the appropriate amount of information from one or more tokens, based on the specified match definition and level of sensitivity.

Certain match definitions skip some of these steps.

Note:   When you work with two or more data sets that you intend to analyze together or join using match codes, be sure to use identical sensitivities and match definitions when you create the match codes in each data set.  [cautionend]


About the Length of Match Codes

Match codes can vary in length between 1 and 1024 bytes. The length is determined by the specified match definition. If you receive a message in the SAS log that states that match codes have been truncated, you should extend the length of your match code variable. Truncated match codes will not produce accurate results.

space
Previous Page | Next Page | Top of Page