Identification Analysis Definitions
Identification analysis definitions specify logic that can be used to classify a data string. The output of an identification analysis definition is an Identity that classifies the input data according to the categories specified by the definition. For example:
- A Name (Single/Multiple) definition might return Single for John Smith and Multiple for Mr & Mrs Smith.
- An Address Type definition might return Contact, Extension, PO Box, or Street depending on the type of input.
Because there are many ways that data can be classified, identification analysis definitions have many uses. Some examples include the following:
- Perform analytics on a field in a database table to determine which fields contain personal information, such as addresses and phone numbers.
- Aggregate the identities of a column in a database using data quality software to determine the type of data stored in that column.
- Analyze input data before applying another type of definition. For example, determine whether a string is an address or the name of a company before applying an Address match definition to the strings that represent addresses.
- Validate the contents of a data string by verifying that it conforms to the format of an email address.
When an identification analysis definition is unable to identify the input, the default identity is returned. The default identity could be any of the possible identities. However, it is often set to UNKNOWN or INVALID.
The number of identities that are returned by an identification analysis definition is determined by the definition and the DQ function that calls the definition. The legacy identification analysis definition returns a single identity determined to be the most likely or the default identity. When two or more identities tie for the most likely identity, such as New York, which is both a CITY and a STATE, the legacy identification analysis definition returns the default identity. New DQ functions that are available in SAS Viya 3.4 or later return all of the identities along with their relative likelihoods.
Field Content Identification Analysis Definition
Field Content identification analysis is a special type of identification analysis definition. It is designed to analyze any input string and determine what type of data it represents.
The identities available in the Field Content identification analysis definition fall into the following three groups:
- Global identities are for data that is global in scope, such as IBAN numbers, email addresses, and website URLs. Processing for these identities is shared among all of the Field Content definitions.
- Language-specific identities are for individual names and organizations. Processing for these identities is shared among Field Content definitions in locales that share the same language.
- Locale-specific identities are for data such as addresses, phone numbers, and cities that are specific to each locale. For example, the string CHICAGO would be identified as a city in the English, United States locale, but not in the French, France locale.
The identities returned by the Field Content identification analysis definition are described below. For examples of language- and locale-specific identities, consult the documentation for the definition within your locale.
Global Identities
Identity | Description | Examples |
---|---|---|
COUNTRY | A string is identified as COUNTRY if it consists of the common name of a country in English, French, or the language of the locale, or the three-character ISO code for a country. | United States of America Espainia AFG |
CURRENCY | A string is identified as CURRENCY if it consists of a currency code or symbol, optionally preceded or followed by a numeric value. |
$150.00 |
DATE | A string is identified as DATE if it consists of a valid date with no time component. The position of the month and day are not specific to the locale. For example, 28/12/2017 is identified as DATE in any locale. Fully spelled and abbreviated month names are recognized in English and the language of the locale. | 9/12/2016 16-march-12 24-Avril-2015 20120316 1999 12 31 |
DATE/TIME | A string is identified as DATE/TIME if it consists of a date (as defined above) followed by a time value. The format for the time value is hh:mm:ss.ss optionally followed by time zone information and/or AM or PM. | 09.15.2012 03:53:00 pm 3/23/1998T12:34UTC+5:30 3/23/1998 12:34 CET |
A string is identified as E-MAIL if it consists of an embedded @ character and ends with a period followed by a valid top-level domain. | john.smith@sas.com | |
EMPTY | If the input string contains no characters, only spaces, or the word null, it is identified as EMPTY. | <no characters> <whitespace only> null |
GEOGRAPHICAL POINT |
A string is identified as GEOGRAPHICAL POINT if it consists of the following:
|
35.89421911 139.94637467 50°40'46,461"N 95°48'26,533"W 40° 26.767′ N 79° 58.933′ W 50°40'46,461"N 79° 58.933′ W |
IBAN | A string is identified as IBAN if it consists of a two-character ISO country code followed by 13 to 32 alphanumeric characters. The string can be broken up by delimiters (space, hyphen, or period) after every fourth character. | DE44500105175407324931 SA03-8000-ABCD-6080-1016-7519 |
NETWORK ADDRESS | A string is identified as NETWORK ADDRESS if it consists of an IPV4, IPV6, or MAC formatted address. | 100.64.0.0 2001:0db8:0000:0:0:ff00:42:8329 48:2C:6A:1E:59:3D |
PAYMENT CARD NUMBER |
A string is identified as PAYMENT CARD NUMBER if it conforms to the structure of one of the following payment card types:
The digits can be separated into logical groupings delimited by spaces, hyphens, or periods. |
Omitted for reasons of privacy. |
UNKNOWN | A string is identified as UNKNOWN if a single identity cannot be determined. See Ambiguous Data below. | G&*RE%^W |
URL |
A string is identified as URL if it meets one of the following criteria:
|
http://www.sas.com google.com |
Language Identities
Identity | Description | Examples |
---|---|---|
INDIVIDUAL | A string is identified as INDIVIDUAL if it consists of the name of a person (or in some instances, two persons) that is common in the language of the locale. | Details and examples of these identities are specific to the language. See the locale-level documentation for examples. |
FAMILY NAME | A string is identified as FAMILY NAME if it consists of a known family name in the locale. | |
GIVEN NAME | A string is identified as GIVEN NAME if it consists of a known given name in the locale. | |
MONTH | A string is identified as MONTH if it consists of a month name in the language of the locale or in English. | |
ORGANIZATION |
A string is identified as ORGANIZATION if it meets any of the following criteria:
|
Locale Identities
Identity | Description | Examples |
---|---|---|
CITY | A string is identified as CITY if it consists of the name of a well-known city within the locale. | Details and examples of these identities are specific to the locale. See the locale-level documentation for examples. |
CITY-STATE/PROVINCE-POSTAL CODE | A string is identified as CITY-STATE/PROVINCE-POSTAL CODE if it consists of information used by the locale's postal system to identify the city of delivery, which usually includes a city name and a postal code, a state or province name, or both. | |
DELIVERY ADDRESS | A string is identified as DELIVERY ADDRESS if it consists of a street address, post office box, building information, or any combination thereof. Recipient information is optional. | |
FULL ADDRESS | A string is identified as FULL ADDRESS if it consists of a complete address as defined by the locale. This usually includes a DELIVERY ADDRESS (as defined above) followed by CITY-STATE/PROVINCE-POSTAL CODE information (as defined above). | |
PHONE |
A string is identified as PHONE if it meets any of the following criteria:
Strings of digits can be delimited with spaces, periods, or hyphens. |
|
POSTAL CODE | A string is identified as POSTAL CODE if it conforms to the structure of a postal code within the locale. | |
STATE/PROVINCE | A string is identified as STATE/PROVINCE if it consists of the name or abbreviation of a state, province, or the locale's administrative equivalent within the locale. | |
<GOVERNMENT ID TYPE> | The <GOVERNMENT ID TYPE> designation is a placeholder for any sort of government-issued identification used within the locale. Common government IDs include Social Security numbers, voter registration codes, or vehicle registration codes. Each of these return their own identity value. See the documentation for the locale to determine which government ID types are supported. |
Ambiguous Data
It is possible that some input can be classified as having more than one possible identity. When this occurs, the identity that is returned depends on the calling DQ function and the design of the identification analysis definition called.
The DQIDENTIFY function and the Identification Analysis node (currently available in DataFlux Data Management Studio) return only a single identity per input. Unless there is a tie, the identity that has the highest score, indicating that it is the most likely or most reasonable identity, will be returned. If there are multiple identities with the highest score, indicating that they are determined to be equally likely, the default identity is returned. The default identity can be any of the available identities, but it is usually set to UNKNOWN.
The following set of DQ functions are available in SAS Viya and can be used to obtain all of the identities that are generated by the identification analysis definition for each input:
- DQIDENTIFYINFOGET
- DQIDENTIFYMULTI
- DQIDENTIFYIDGET
Field Content identification analysis definitions are designed to work with both the single and multiple identity types of calling functions by avoiding tie scores. In the case of ambiguous data, where two or more identities are equally likely, a higher score is assigned to the identity that is more likely over the aggregate. See the documentation for each locale's Field Content definition for the priorities that are used to rank the potential identities.
The following table shows the potential output for a Field Content definition in the English, United States locale that prioritizes CITY over STATE/PROVINCE, COUNTRY, and FAMILY NAME.
Data | Single Result | Multiple Identity Result | |
---|---|---|---|
Identity | Score | ||
NEW YORK | CITY | CITY | 90 |
STATE/PROVINCE | 80 | ||
COLUMBIA | CITY | CITY | 90 |
COUNTRY | 70 | ||
FAMILY NAME | 40 | ||
SEATTLE | CITY | CITY | 90 |