Locale Definitions

Parse Definitions

Parse definitions are referenced when you want to create parsed input values. Parsed input values are delimited so that the elements in those values can be associated with named tokens. After parsing, specific contents of the input values can be returned by specifying the names of tokens.
Parse definitions and tokens are referenced by the following routine and functions:
For a brief example of how tokens are assigned and used, see Specify Definitions in SAS Data Cleansing Programs.
Parsing a character value assigns tokens only when the content in the input value meets the criteria in the parse definition. Parsed character values can therefore contain empty tokens. For example, three tokens are empty when you use the DQPARSE function to parse the character value Ian M. Banks. When using the NAME parse definition in the ENUSA locale, the resulting token/value pairs are as follows:
NAME PREFIX
empty
GIVEN NAME
Ian
MIDDLE NAME
M.
FAMILY NAME
Banks
NAME SUFFIX
empty
NAME APPENDAGE
empty
Note: For parse definitions that work with dates, such as DATE (DMY) in the ENUSA locale, input values must be character data rather than SAS dates.

Global Parse Definitions

Global parse definitions contain a standard set of parse tokens that enable the analysis of similar data from different locales. For example, the ENUSA locale and the DEDEU locale both contain the parse definition ADDRESS (GLOBAL). The parse tokens are the same in both locales. This global parse definition enables the combination of parsed character data from multiple locales.
All global parse definitions are identified by the (GLOBAL) suffix.

Extraction Definitions

Extraction definitions extract parts of an input string and assign them to corresponding tokens of the associated data type. Extraction input values are delimited so that the elements in those values can be associated with named tokens. After extraction, specific contents of the input values can be returned by specifying the names of tokens.
Extraction definitions and tokens are referenced by the following functions:
For a brief example of how tokens are assigned and used, see Specify Definitions in SAS Data Cleansing Programs.
Extracting a character value assigns tokens when the content in the input value meets the criteria in the extraction definition. For example, using the string "100 Slightly used green Acme MAB-6200 telephone $100 including touch-tone buttons" as input results in the following output mapping between tokens and substrings.
QUANTITY
100
BRAND
“ACME”
MODEL
“MAB-6200”
COLOR
“green”
PRICE
“$100”
DESCRIPTION
“slightly used telephone including touch-tone buttons”
Extracted character values can also contain empty tokens. For example, in the illustration above, if the input string did not contain a price, then PRICE would contain an “empty” token.

Match Definitions

Match definitions are referenced during the creation of match codes. Match codes provide a variable method of clustering similar input values as a basis for data cleansing jobs such as the application of schemes.
When you create match codes, you determine the number of clusters (values with the same match code) and the number of members in each cluster by specifying a sensitivity level. The default sensitivity level is specified by the procedure or function, rather than the match definition. For information about sensitivity levels, seeSensitivity.
Match definitions are referenced by the following procedures and functions:
When you create match codes for parsed character values, your choice of match definition depends on the parse definition that was used to parse the input character value. To determine the parse definition that is associated with a given match definition, use the DQMATCHINFOGET Function.
Note: For match definitions that work with dates, such as DATE (MDY) in the ENUSA locale, input values must be character data rather than SAS dates.

Scheme Build Match Definitions

Locales contain certain match definitions that are recommended for use in the DQSCHEME procedure. These match definitions produce more desirable schemes. The names of these scheme-build match definitions always end with “(SCHEME BUILD)”.
Scheme-build match definitions are advantageous because they create match codes that contain more vowels. Match codes that contain more vowels result in more clusters with fewer members in each cluster, which in turn results in a larger, more specific set of transformation values.
When you are using the DQMATCH procedure or function to create simple clusters, it is better to have fewer vowels in the match code. For example, using the CITY match definition in the DQMATCH procedure, the values Baltimore and Boltimore receive the same match codes. The match codes would differ if you used the match definition CITY (SCHEME BUILD).

Case and Standardization Definitions

Case and standardization definitions are applied to character values to make them more consistent for the purposes of display or in preparation for transforming those values with a scheme.
Case definitions are referenced by the DQCASE Function. Standardization definitions are referenced by the DQSTANDARDIZE Function.
Case definitions transform the capitalization of character values. For example, the case definition Proper in the ENUSA locale takes as input any general text. It capitalizes the first letter of each word, and uses lowercase for the other letters in the word. It also recognizes and retains or transforms various words and abbreviations into uppercase. Other case definitions, such as PROPER – ADDRESS, apply to specific text content.
Standardization definitions standardize the appearance of specific data values. In general, words are capitalized appropriately based on the content of the input character values. Also, adjacent blank spaces are removed, along with unnecessary punctuation. Additional standardizations might be made for specific content. For example, the standardization definition STATE (FULL NAME) in the locale ENUSA converts abbreviated state names to full names in uppercase.

Standardization of Dates in the EN Locale

In the EN locale, dates are standardized to two-digit days (00–31), two-digit months (01–12), and four-digit years. Input dates must be character values rather than SAS dates.
Spaces separate (delimit) the days, months, and years, as shown in the following table:
Sample Date Standardizations
Input Date
Standardization Definition
Standardized Date
July04, 03
Date (MDY)
07 04 2003
July 04 04
Date (MDY)
07 04 1904
July0401
Date (MDY)
07 04 2001
04.07.02
Date (DMY)
04 07 2002
04-07-2004
Date (DMY)
04 07 2004
03/07/04
Date (YMD)
2003 07 04
Two-digit year values are standardized as follows:
  • If an input year is greater than 00 and less than or equal to 03, the standardized year is 2000, 2001, 2002, or 2003.
  • Two-digit input year values that are greater than or equal to 04 and less than or equal to 99 are standardized into the range of 1904–1999.
For example, an input year of 03 is standardized as 2003. An input year of 04 is standardized as 1904. These standardizations are not affected by the value of the SAS system option YEARCUTOFF= .

Gender Analysis, Locale Guess, and Identification Definitions

Gender analysis, locale guess, and identification definitions enable you to make determinations about character values. With these definitions, you can determine the following:
  • the gender of an individual based on a name value
  • the locale that is the most suitable for a given character value
  • the category of a value, which is chosen from a set of available categories
Gender analysis definitions determine the gender of an individual based on that individual's name. The gender is determined to be unknown if the first name is used by both males and females. If no other clues are provided in the name, or if conflicting clues are found, gender analysis definitions are referenced by the DQGENDER Function.
Locale guess definitions allow the software to determine the locale that is most likely represented by a character value. All locales that are loaded into memory as part of the locale list are considered, but only if they contain the specified guess definition. If a definite locale determination cannot be made, the chosen locale is the first locale in the locale list. Locale guess definitions are referenced by the DQLOCALEGUESS Function.
Identification definitions are used to categorize character values. For example, using the Entity identification definition in the ENUSA locale, a name value can apply to an individual or an organization. Identification definitions are referenced by the DQIDENTIFY Function.

Pattern Analysis Definitions

Pattern analysis definitions enable you to determine whether an input character value contains characters that are alphabetic, numeric, non-alphanumeric (punctuation marks or symbols), or a mixture of alphanumeric and non-alphanumeric. The ENUSA locale contains two pattern analysis definitions: WORD and CHARACTER.
The pattern analysis definition WORD is referenced by the DQPATTERN function. This generates one character of analytical information for each word in the input character value. See DQPATTERN Function for additional information. The CHARACTER definition generates one character of analytical information for each character in the input character value.