SAS Quality Knowledge Base for Contact Information 25
In the SAS Quality Knowledge Base, the Hebrew definitions are shared by all Hebrew-language locales. Shared Hebrew definitions are described below.
Case Definitions
Gender Analysis Definitions
Identification Analysis Definitions
Match Definitions
Parse Definitions
Pattern Analysis Definitions
Standardization Definitions
Inherited Definitions
None.
Name | ||
---|---|---|
Description | The Name gender analysis definition determines the gender of a name. | |
Possible Outputs | M F U |
|
Input | Output | |
Examples | בן-שששש סיקולובסקי | M |
בת-שששש סיקולובסקי | F | |
דוד סיקולובסקי | M | |
יעל סיקולובסקי | F | |
עדי סיקולובסקי | U | |
אדון עדי סיקולובסקי | M | |
גברת עדי סיקולובסקי | F | |
סיקולובסקי | U | |
מלכה וינשטיין | F | |
שאול וינשטיין | M | |
Remarks | Since a very large percentage of Hebrew names are not gender specific, an 80% threshold is applied. A gender is identified if the name is associated with that gender at least 80% of the time. |
Field Name | ||
---|---|---|
Description |
The Field Name identification analysis definition identifies database column names. |
|
Possible Outputs | NAME ORGANIZATION ADDRESS CITY STATE/PROVINCE POSTALCODE COUNTRY PHONE DATE UNKNOWN URL GENDER MATCHCODE PERSONAL_ID ORGANIZATION_ID GENERIC_ID COUNTY MARITAL_STATUS |
|
Input | Output | |
Examples | Company Name | ORGANIZATION |
HEVRA | ORGANIZATION | |
Address | ADDRESS | |
כתובת | ADDRESS | |
Telephone | PHONE | |
MISPAR_AVODAH | PHONE | |
Remarks |
This definition is recommended to determine the type of data stored in a database column based on the name of the column. |
|
The Field Name (v23) identification analysis definition is now deprecated and will be removed in a future release of the QKB. The Field Name identification analysis definition has been replaced with a copy of the Field Name (v23) definition which takes advantage of updated processing. If you changed your jobs to use the Field Name (v23) definition it is suggested that you change them back. |
Field Name (v23) | ||
---|---|---|
Description | The Field Name (v23) identification analysis definition identifies database column names. | |
Possible Outputs | NAME ORGANIZATION ADDRESS CITY STATE/PROVINCE POSTALCODE COUNTRY PHONE DATE UNKNOWN URL GENDER MATCHCODE PERSONAL_ID ORGANIZATION_ID GENERIC_ID COUNTY MARITAL_STATUS |
|
Input | Output | |
Examples | Company Name | ORGANIZATION |
HEVRA | ORGANIZATION | |
Address | ADDRESS | |
כתובת | ADDRESS | |
Telephone | PHONE | |
MISPAR_AVODAH | PHONE | |
Remarks |
This definition is recommended to determine the type of data stored in a database column based on the name of the column. |
|
The Field Name (v23) identification analysis definition is now deprecated and will be removed in a future release of the QKB. The Field Name identification analysis definition has been replaced with a copy of the Field Name (v23) definition which takes advantage of updated processing. If you changed your jobs to use the Field Name (v23) definition it is suggested that you change them back. |
Date (DMY) | ||
---|---|---|
Description | The Date (DMY) match definition generates match codes which can be used to cluster records containing dates that have the format DMY. | |
Max Length of Match Code | 15 characters | |
Input | Cluster ID | |
Examples | 05March1969 | 1 |
5-3-1969 | 1 | |
Remarks |
Note: The results listed above reflect the default match sensitivity (85). |
Date (DMY) (with Combinations) | ||
---|---|---|
Description | The Date (DMY) (with Combinations) match definition generates match codes which can be used to cluster records containing dates, with a score for each match code, that have the format DMY. | |
Max Length of Match Code | 15 characters | |
Example 1 | Input | Cluster ID |
Sensitivities 50 - 100 Weight 100 |
01/02/2013 | 0 |
Feb-01-2013 | 0 | |
Example 2 | Input | Cluster ID |
Sensitivities 50 - 100 Weight 40 |
Feb-01-2013 | 1 |
01/02/2013 | 1 | |
02/01/2013 | 1 | |
Remarks |
A date with DMY format will match with the date with MDY format. This definition generates one or more match codes for each input string. The number of match codes generated for an input string depends on the content of the string. Each match code represents a combination of different parts of the input string; this enables two strings to be matched even when some parts of one or both of the strings differ. See the examples above for an illustration of clusters that can be produced using match codes generated by this definition. Note that a consequence of generating multiple match codes is that a record can be placed in more than one cluster by a subsequent clustering operation. Therefore, special attention should be given to the entity resolution process when using this definition. Generation of multiple match codes is achieved through the use of token-combination rules in the match definition. Each match code generated by the definition is associated with one token-combination rule. There is a weight assigned to each rule; each rule's weight is used to calculate a score that is assigned to the match code that is generated by that rule. The score for a match code is equal to the weight of the rule used to generate the match code times the sensitivity that is selected when the definition is executed. When a record is clustered, the score for the record’s match code represents the confidence with which we can assert that the record belongs in the cluster. Note that when different rules lead to identical clustering results, the scores of the match codes generated by the different rules can be aggregated using the Cluster Aggregation node in a Data Job. The Cluster Aggregation node allows several different methods for aggregating match code scores, such as minimum, maximum, or mean across instances of a record, or minimum, maximum, or mean across all records in a cluster. For information on the Cluster Aggregation node, refer to the documentation provided with the DataFlux Data Management Studio installation. |
Date (MDY) | ||
---|---|---|
Description | The Date (MDY) match definition generates match codes which can be used to cluster records containing dates that have the format MDY. | |
Max Length of Match Code | 15 characters | |
Input | Cluster ID | |
Examples | 8.30.1997 | 0 |
August 30th, 1997 | 0 | |
Remarks |
Note: The results listed above reflect the default match sensitivity (85). |
Date (MDY) (with Combinations) | ||
---|---|---|
Description | The Date (MDY) (with Combinations) match definition generates match codes which can be used to cluster records containing dates, with a score for each match code, that have the format MDY. | |
Max Length of Match Code | 15 characters | |
Example 1 | Input | Cluster ID |
Sensitivities 50 - 100 Weight 100 |
01/02/2013 | 0 |
Jan-02-2013 | 0 | |
Example 2 | Input | Cluster ID |
Sensitivities 50 - 100 Weight 40 |
Jan-02-2013 | 1 |
02/01/2013 | 1 | |
Remarks |
A date with MDY format will match with the date with DMY format. This definition generates one or more match codes for each input string. The number of match codes generated for an input string depends on the content of the string. Each match code represents a combination of different parts of the input string; this enables two strings to be matched even when some parts of one or both of the strings differ. See the examples above for an illustration of clusters that may be produced using match codes generated by this definition. Note that a consequence of generating multiple match codes is that a record can be placed in more than one cluster by a subsequent clustering operation. Therefore, special attention should be given to the entity resolution process when using this definition. Generation of multiple match codes is achieved through the use of token-combination rules in the match definition. Each match code generated by the definition is associated with one token-combination rule. There is a weight assigned to each rule; each rule's weight is used to calculate a score that is assigned to the match code that is generated by that rule. The score for a match code is equal to the weight of the rule used to generate the match code times the sensitivity that is selected when the definition is executed. When a record is clustered, the score for the record’s match code represents the confidence with which we can assert that the record belongs in the cluster. Note that when different rules lead to identical clustering results, the scores of the match codes generated by the different rules can be aggregated using the Cluster Aggregation node in a Data Job. The Cluster Aggregation node allows several different methods for aggregating match code scores, such as minimum, maximum, or mean across instances of a record, or minimum, maximum, or mean across all records in a cluster. For information on the Cluster Aggregation node, refer to the documentation provided with the DataFlux Data Management Studio installation. |
Date (YMD) | ||
---|---|---|
Description | The Date/Time (YMD) match definition generates match codes which can be used to cluster records containing date/time information. | |
Max Length of Match Code | 15 characters | |
Input | Cluster ID | |
Examples | 2002dec31 | 1 |
2002.12.31 | 1 | |
Remarks |
Note: The results listed above reflect the default match sensitivity (85). |
Field Name | ||
---|---|---|
Description | The Field Name match definition generates match codes which can be used to cluster records containing database field names. | |
Max Length of Match Code | 15 characters | |
Input | Cluster ID | |
Examples | Company Name | 0 |
HEVRA | 0 | |
Address | 1 | |
כתובת | 1 | |
MISPAR_AVODAH | 2 | |
Phone | 2 | |
Remarks | This definition should be used to find potential matches between database column names. | |
Note: The results listed above reflect the default match sensitivity (85). |
Name | ||
---|---|---|
Description | The Name match definition generates match codes which can be used to cluster records containing names of individuals. | |
Max Length of Match Code | 27 characters | |
Input | Cluster ID | |
Examples | דוד משה נתניהו | 0 |
דוד ת. נתניהו | 0 | |
דוד נתניהו | 0 | |
רונית לפידור-בלו | 1 | |
רונית בלו | 1 | |
בָרוּךְ נָתַן קֵיְיְ | 2 | |
ברוך נתן קיי | 2 | |
ח'אלד זשצ'ירינסקי | 3 | |
חאלד זשצירינסקי | 3 | |
איתן מ' ר' הלוי | 4 | |
איתן מ. ר. הלוי | 4 | |
מיקי מורשת | 5 | |
מיכאל מורשת | 5 | |
Remarks | Note: The results listed above reflect the default match sensitivity (85). |
Organization | ||
---|---|---|
Description | The Organization match definition generates match codes which can be used to cluster records containing organization names. | |
Max Length of Match Code | 35 characters | |
Input | Cluster ID | |
Examples | מ.י.ה. מחשבים בע"מ, סניף הרצליה | 0 |
מ.י.ה. מחשבים בע"מ, תל אביב | 0 | |
מ.י.ה. מחשבים בע"מ | 0 | |
מ.י.ה. מחשבים | 0 | |
החברה ג.ג.ג. | 1 | |
ג.ג.ג. חברה | 1 | |
חברת ג.ג.ג. | 1 | |
ג.ג.ג. ושותפיו | 1 | |
ג.ג.ג. | 1 | |
הבנק הבינ"ל הראשון | 2 | |
הבנק הבינ"ל ה-1 | 2 | |
א.ב.ג. עמותה | 3 | |
א.ב.ג. תאגיד | 3 | |
א.ב.ג. בע"מ | 3 | |
א.ב.ג. | 3 | |
Remarks | Note: The results listed above reflect the default match sensitivity (85). |
Date (DMY) | |||
---|---|---|---|
Description | The Date (DMY) parse definition parses dates with format DMY into a set of tokens. | ||
Output Tokens | Day Month Year |
||
Input | Output | ||
Example | 05March1969 | Day | 05 |
Month | March | ||
Year | 1969 | ||
Remarks |
Date (MDY) | |||
---|---|---|---|
Description | The Date (MDY) parse definition parses dates with format MDY into a set of tokens. | ||
Output Tokens | Day Month Year |
||
Input | Output | ||
Example | 03-05-1969 | Day | 05 |
Month | 03 | ||
Year | 1969 | ||
Remarks |
Date (YMD) | |||
---|---|---|---|
Description | The Date (YMD) parse definition parses dates with format YMD into a set of tokens. | ||
Output Tokens | Day Month Year |
||
Input | Output | ||
Example | 1969.3.5 | Day | 5 |
Month | 3 | ||
Year | 1969 | ||
Remarks |
Name | |||
---|---|---|---|
Description | The Organization parse definition parses company and organization information into a set of tokens. | ||
Output Tokens | Prefix Given Name Middle Name Family Name Suffix Title/Additional Info |
||
Input | Output | ||
Example 1 | ד"ר יוסף יצחק קליין | Prefix | ד"ר |
Given Name | יוסף | ||
Middle Name | יצחק | ||
Family Name | קליין | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 2 | Dr. James Goodnight, CEO | Prefix | Dr. |
Given Name | James | ||
Middle Name | |||
Family Name | Goodnight | ||
Suffix | |||
Title/Additional Info | CEO | ||
Input | Output | ||
Example 3 | יצחק-דוד פרלמן | Prefix | |
Given Name | יצחק-דוד | ||
Middle Name | |||
Family Name | פרלמן | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 4 | בן ציון נתניהו | Prefix | |
Given Name | בן ציון | ||
Middle Name | |||
Family Name | נתניהו | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 5 | רונית לפידור בלו | Prefix | |
Given Name | רונית | ||
Middle Name | |||
Family Name | לפידור בלו | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 6 | חיים משה | Prefix | |
Given Name | חיים | ||
Middle Name | |||
Family Name | משה | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 7 | משה חיים | Prefix | |
Given Name | משה | ||
Middle Name | |||
Family Name | חיים | ||
Suffix | |||
Title/Additional Info | |||
Remarks |
Name (Global) | |||
---|---|---|---|
Description | The Name (Global) parse definition parses names of individuals into a globally recognized set of tokens. | ||
Output Tokens | Prefix Given Name Middle Name Family Name Suffix Title/Additional Info |
||
Input | Output | ||
Example 1 | ד"ר יוסף יצחק קליין | Prefix | ד"ר |
Given Name | יוסף | ||
Middle Name | יצחק | ||
Family Name | קליין | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 2 | Dr. James Goodnight, CEO | Prefix | Dr. |
Given Name | James | ||
Middle Name | |||
Family Name | Goodnight | ||
Suffix | |||
Title/Additional Info | CEO | ||
Input | Output | ||
Example 3 | יצחק-דוד פרלמן | Prefix | |
Given Name | יצחק-דוד | ||
Middle Name | |||
Family Name | פרלמן | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 4 | בן ציון נתניהו | Prefix | |
Given Name | בן ציון | ||
Middle Name | |||
Family Name | נתניהו | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 5 | רונית לפידור בלו | Prefix | |
Given Name | רונית | ||
Middle Name | |||
Family Name | לפידור בלו | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 6 | חיים משה | Prefix | |
Given Name | חיים | ||
Middle Name | |||
Family Name | משה | ||
Suffix | |||
Title/Additional Info | |||
Input | Output | ||
Example 7 | משה חיים | Prefix | |
Given Name | משה | ||
Middle Name | |||
Family Name | חיים | ||
Suffix | |||
Title/Additional Info | |||
Remarks |
Parse definitions named with the Global keyword use a set of output tokens that is consistent across every locale. Results obtained from these definitions can be stored in the same database fields as the results obtained from definitions of the same name in other locales. |
Name (Multiple Name) | |||
---|---|---|---|
Description | The Name (Multiple Name) parse definition parses strings that contain the names of two individuals into a set of tokens. | ||
Output Tokens | Name 1 Name 2 |
||
Input | Output | ||
Example 1 | אדון אייל וגברת יעל מורשת | Name 1 | אדון אייל מורשת |
Name 2 | גברת יעל מורשת | ||
Input | Output | ||
Example 2 | מר וגב' נתניהו | Name 1 | מר נתניהו |
Name 2 | גב' נתניהו | ||
Input | Output | ||
Example 3 | מר וגב' בנימין נתניהו | Name 1 | מר בנימין נתניהו |
Name 2 | גב' נתניהו | ||
Input | Output | ||
Example 4 | משה פרלמן ואייל מורשת | Name 1 | משה פרלמן |
Name 2 | אייל מורשת | ||
Input | Output | ||
Example 5 | אייל מורשת | Name 1 | אייל מורשת |
Name 2 | |||
Input | Output | ||
Example 6 | יוסף אהרון ורד | Name 1 | יוסף אהרון ורד |
Name 2 | |||
Remarks | If only one name is present in the input, the first token is used. |
Organization | |||
---|---|---|---|
Description | The Organization parse definition parses company and organization information into a set of tokens. | ||
Output Tokens | Name Legal Form Site Additional Info |
||
Input | Output | ||
Example 1 | מ.י.ה. מחשבים בע"מ סניף הרצליה (תוכנה) | Name | מ.י.ה. מחשבים |
Legal Form | בע"מ | ||
Site | סניף הרצליה | ||
Additional Info | (תוכנה) | ||
Input | Output | ||
Example 2 | אוניברסיטת בן גוריון, שלוחת אילת | Name | אוניברסיטת בן גוריון |
Legal Form | |||
Site | שלוחת אילת | ||
Additional Info | |||
Input | Output | ||
Example 3 | מסעדת עזרא ובניו, ירושלים | Name | מסעדת עזרא ובניו |
Legal Form | |||
Site | ירושלים | ||
Additional Info | |||
Input | Output | ||
Example 4 | מסעדת ירושלים | Name | מסעדת ירושלים |
Legal Form | |||
Site | |||
Additional Info | |||
Remarks |
Organization (Global) | |||
---|---|---|---|
Description | The Organization (Global) parse definition parses company and organization names into a globally recognized set of tokens. | ||
Output Tokens | Name Legal Form Site Additional Info |
||
Input | Output | ||
Example 1 | מ.י.ה. מחשבים בע"מ סניף הרצליה (תוכנה) | Name | מ.י.ה. מחשבים |
Legal Form | בע"מ | ||
Site | סניף הרצליה | ||
Additional Info | (תוכנה) | ||
Input | Output | ||
Example 2 | אוניברסיטת בן גוריון, שלוחת אילת | Name | אוניברסיטת בן גוריון |
Legal Form | |||
Site | שלוחת אילת | ||
Additional Info | |||
Input | Output | ||
Example 3 | מסעדת עזרא ובניו, ירושלים | Name | מסעדת עזרא ובניו |
Legal Form | |||
Site | ירושלים | ||
Additional Info | |||
Input | Output | ||
Example 4 | מסעדת ירושלים | Name | מסעדת ירושלים |
Legal Form | |||
Site | |||
Additional Info | |||
Remarks |
Parse definitions named with the Global keyword use a set of output tokens that is consistent across every locale. Results obtained from these definitions can be stored in the same database fields as the results obtained from definitions of the same name in other locales. |
None.
Date (DMY) | ||
---|---|---|
Description | The Date (DMY) standardization definition standardizes dates that have format DMY. The output is a zero-padded two-digit day, followed by a zero-padded two-digit month, followed by a four-digit year. The day, month, and year are separated by spaces. | |
Input | Output | |
Examples | 04/07/02 | 04 07 2002 |
04July05 | 04 07 1905 | |
04.07.05 | 04 07 1905 | |
04July2005 | 04 07 2005 | |
04-07-2005 | 04 07 2005 | |
Remarks | If the input year is a two-digit value, it is assumed to be within the hundred year span with 2019 as the end of the span. For example, a year of 19 will be 2019, but a year of 20 will be 1920. |
Date (MDY) | ||
---|---|---|
Description | The Date (MDY) standardization definition standardizes dates that have format MDY. The output is a zero-padded two-digit month, followed by a zero-padded two-digit day, followed by a four-digit year. The month, day, and year are separated by spaces. | |
Input | Output | |
Examples | July04, 02 | 07 04 2002 |
07/04/02 | 07 04 2002 | |
July04, 05 | 07 04 1905 | |
07.04.05 | 07 04 1905 | |
July 4, 2005 | 07 04 2005 | |
07-04-2005 | 07 04 2005 | |
Remarks | If the input year is a two-digit value, it is assumed to be within the hundred year span with 2019 as the end of the span. For example, a year of 19 will be 2019, but a year of 20 will be 1920. |
Date (YMD) | ||
---|---|---|
Description | The Date (YMD) standardization definition standardizes dates that have format YMD. The output is a four-digit year, followed by a zero-padded two-digit month, followed by a zero-padded two-digit day. The year, month, and day are separated by spaces. | |
Input | Output | |
Examples | 02July04 | 2002 07 04 |
02/07/04 | 2002 07 04 | |
05July04 | 1905 07 04 | |
05.07.04 | 1905 07 04 | |
2005July04 | 2005 07 04 | |
2005-07-04 | 2005 07 04 | |
Remarks | If the input year is a two-digit value, it is assumed to be within the hundred year span with 2019 as the end of the span. For example, a year of 19 will be 2019, but a year of 20 will be 1920. |
Name | ||
---|---|---|
Description |
The Name standardization definition standardizes names of individuals. |
|
Input | Output | |
Examples | גונן אייל | אייל גונן |
דר דניאל לוין | ד"ר דניאל לוין | |
מ ש נתניהו | מ.ש. נתניהו | |
ברוך קיי (מנכ"ל) | ברוך קיי, מנכ"ל | |
Remarks |
Nikud Removal | ||
---|---|---|
Description | The Nikud Removal standardization definition removes Hebrew diacritics. | |
Input | Output | |
Examples | נְקֻדּוֹת | נקדות |
חֲטַף סֶגּוֹל | חטף סגול | |
קָמַץ מָלֵא | קמץ מלא | |
שַׁלְשֶׁ֓לֶת | שלשלת | |
פַּשְׁטָא֙ | פשטא | |
דָּוִד בֶּן-גּוּרִיּוֹן | דוד בן-גוריון | |
Remarks |
Organization | ||
---|---|---|
Description | The Organization standardization definition standardizes organization names. | |
Input | Output | |
Examples | מ.י.ה. מחשבים בערבון מוגבל | מ.י.ה. מחשבים בע"מ |
דני וולך, עו"ד | עו"ד דני וולך | |
מ.י.ה. מחשבים בע"מ הרצליה | מ.י.ה. מחשבים בע"מ, הרצליה | |
א.ד.מטלון | א.ד. מטלון | |
דרור אורטס - שפיגל | דרור אורטס-שפיגל | |
Remarks |
In addition to the definitions listed on this page, all Hebrew-language locales also inherit all Global definitions.
Documentation Feedback: yourturn@sas.com
|
Doc ID: QKBCI_HE_defs.html |