Locale Definitions |
Parse Definitions |
Parse definitions are referenced when you want to create parsed input values. Parsed input values are delimited so that the elements in those values can be associated with named tokens. After parsing, specific contents of the input values can be returned by specifying the names of tokens.
Parse definitions and tokens are referenced by the following functions:
For a brief example of how tokens are assigned and used, see Specify Definitions In SAS Data Cleansing Programs.
Parsing a character value assigns tokens only when the content in the input value meets the criteria in the parse definition. Parsed character values can therefore contain empty tokens. For example, three tokens are empty when you use the DQPARSE function to parse the character value Ian M. Banks. When using the NAME parse definition in the ENUSA locale, the resulting token/value pairs are:
NAME PREFIX |
empty |
GIVEN NAME |
Ian |
MIDDLE NAME |
M. |
FAMILY NAME |
Banks |
NAME SUFFIX |
empty |
NAME APPENDAGE |
empty |
Note: For parse definitions that work with dates, such as DATE (DMY) in the ENUSA locale, input values must be character data rather than SAS dates.
Global Parse Definitions |
Global parse definitions contain a standard set of parse tokens that enable the analysis of similar data from different locales. For example, the ENUSA locale and the DEDEU locale both contain the parse definition ADDRESS (GLOBAL). The parse tokens are the same in both locales. This global parse definition enables the combination of parsed character data from multiple locales.
All global parse definitions are identified by the (GLOBAL) suffix.
Match Definitions |
Match definitions are referenced during the creation of match codes. Match codes provide a variable method of clustering similar input values as a basis for data cleansing jobs such as the application of schemes.
When you create match codes, you determine the number of clusters (values with the same match code) and the number of members in each cluster by specifying a sensitivity level. The default sensitivity level is specified by the procedure or function, rather than the match definition. For information about sensitivity levels, see Sensitivity.
Match definitions are referenced by the following procedures and functions:
When you create match codes for parsed character values, your choice of match definition depends on the parse definition that was used to parse the input character value. To determine the parse definition that is associated with a given match definition, use the DQMATCHINFOGET Function.
Note: For match definitions that work with dates, such as DATE (MDY) in the ENUSA locale, input values must be character data rather than SAS dates.
Scheme Build Match Definitions |
Locales contain certain match definitions that are recommended for use in the DQSCHEME procedure. These match definitions produce more desirable schemes. The names of these scheme-build match definitions always end with "(SCHEME BUILD)".
Scheme-build match definitions are advantageous because they create match codes that contain more vowels. Match codes that contain more vowels result in more clusters with fewer members in each cluster, which in turn results in a larger, more specific set of transformation values.
When you are using the DQMATCH procedure or function to create simple clusters, it is better to have fewer vowels in the match code. For example, using the CITY match definition in the DQMATCH procedure, the values Baltimore and Boltimore receive the same match codes. The match codes would differ if you used the match definition CITY (SCHEME BUILD).
Case and Standardization Definitions |
Case and standardization definitions are applied to character values to make them more consistent for the purposes of display or in preparation for transforming those values with a scheme.
Case definitions are referenced by the DQCASE Function. Standardization definitions are referenced by the DQSTANDARDIZE Function.
Case definitions transform the capitalization of character values. For example, the case definition Proper in the ENUSA locale takes as input any general text. It capitalizes the first letter of each word, and uses lowercase for the other letters in the word. It also recognizes and retains or transforms various words and abbreviations into uppercase. Other case definitions, such as PROPER - ADDRESS, apply to specific text content.
Standardization definitions standardize the appearance of specific data values. In general, words are capitalized appropriately based on the content of the input character values. Also, adjacent blank spaces are removed, along with unnecessary punctuation. Additional standardizations might be made for specific content. For example, the standardization definition STATE (FULL NAME) in the locale ENUSA converts abbreviated state names to full names in uppercase.
Standardization of Dates in the EN Locale |
In the EN locale, dates are standardized to two-digit days (00-31), two-digit months (01-12), and four-digit years. Input dates must be character values rather than SAS dates.
Spaces separate (delimit) the days, months, and years, as shown in the following table:
Input Date | Standardization Definition | Standardized Date |
July04, 03 | Date (MDY) | 07 04 2003 |
July 04 04 | Date (MDY) | 07 04 1904 |
July0401 | Date (MDY) | 07 04 2001 |
04.07.02 | Date (DMY) | 04 07 2002 |
04-07-2004 | Date (DMY) | 04 07 2004 |
03/07/04 | Date (YMD) | 2003 07 04 |
Two-digit year values are standardized as follows:
If an input year is greater than 00 and less than or equal to 03, the standardized year is 2000, 2001, 2002, or 2003.
Two-digit input year values that are greater than or equal to 04, and less than or equal to 99 are standardized into the range of 1904-1999
Gender Analysis, Locale Guess, and Identification Definitions |
Gender analysis, locale guess, and identification definitions enable you make determinations about character values. With these definitions you can determine:
the gender of an individual based on a name value
the locale that is the most suitable for a given character value
the category of a value, which is chosen from a set of available categories.
Locale guess definitions allow the software to determine the locale that is most likely represented by a character value. All locales that are loaded into memory as part of the locale list are considered, but only if they contain the specified guess definition. If a definite locale determination cannot be made, the chosen locale is the first locale in the locale list. Locale guess definitions are referenced by the DQLOCALEGUESS Function.
Identification definitions are used to categorize character values. For example, using the Entity identification definition in the ENUSA locale, a name value can apply to an individual or an organization. Identification definitions are referenced by the DQIDENTIFY Function.
Pattern Analysis Definitions |
Pattern analysis definitions enable you to determine whether an input character value contains characters that are alphabetic, numeric, non-alphanumeric (punctuation marks or symbols), or a mixture of alphanumeric and non-alphanumeric. The ENUSA locale contains two pattern analysis definitions: The pattern analysis definition WORD is referenced by the DQPATTERN function. To generate one character of analytical information for each word in the input character value. See DQPATTERN Function. The CHARACTER definition generates one character of analytical information for each character in the input character value.
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.