You are here: Definitions>Chinese Definitions>Chinese, China Definitions

SAS Quality Knowledge Base for Contact Information 25

Chinese, China Definitions

Definitions for the Chinese, China locale are described below.

Case Definitions
Gender Analysis Definitions
Identification Analysis Definitions
Match Definitions

Parse Definitions

Pattern Analysis Definitions

Standardization Definitions

Inherited Definitions

Case Definitions

Proper (Name)
Description Propercases names written in the Latin alphabet.
  Input Output
Examples XIANG LI CHEN Xiang Li Chen
(赛仕总经理)LIDAWEI (赛仕总经理)Lidawei
Remarks  

 

Upper (Address)
Description Uppercases Latin characters found in address data.
  Input Output
Examples 益田新村106栋22g 益田新村106栋22G
黄兴路2005弄1号23号楼a座 黄兴路2005弄1号23号楼A座
cbd商务外环路1号蓝码地王大厦3001单元 CBD商务外环路1号蓝码地王大厦3001单元
Remarks  

 

Upper (Organization)
Description Uppercases Latin characters found in organization names. Well-known words are propercased where appropriate.
  Input Output
Examples 上海mwb互感器有限公司 上海MWB互感器有限公司
沃尔玛深国投百货有限公司成都sm广场分店 沃尔玛深国投百货有限公司成都SM广场分店
Remarks Certain well-known company names are propercased.

Gender Analysis Definitions

ID Number
Description Determines the gender associated with an ID number.
Possible Outputs M
F
U
  Input Output
Examples 130503196704010012 M
330108198503179268 F
0174685503(D) U
Remarks Gender is determined from the sequence code within the ID number.

Identification Analysis Definitions

Individual/Organization
Description Determines whether a string represents the name of an individual or an organization.
Possible Outputs INDIVIDUAL
ORGANIZATION
UNKNOWN
  Input Output
Examples 张晓东 INDIVIDUAL
李大伟 赛仕(北京) INDIVIDUAL
司徒怀(先生) INDIVIDUAL
深圳海王药业有限公司 ORGANIZATION
李宁 INDIVIDUAL
李宁有限公司 ORGANIZATION
UNKNOWN
Remarks  

Match Definitions

Address
Description The Address match definition generates match codes which can be used to cluster records containing addresses.
Max Length of Match Code 237 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
人民北路群星广场开元大厦A单元2层2306室(底商) 1
人民北路群星广场开元大厦A单元2层2306室(商铺) 2
人民北路群星广场开元大厦A单元2层2307室(商铺) 3
人民北路群星广场开元大厦A单元3层2307室(商铺) 4
Remarks All components of the address are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 2

Sensitivities
90-94
人民北路群星广场开元大厦A单元2层2306室(底商) 1
人民北路群星广场开元大厦A单元2层2306室(商铺) 1
人民北路群星广场开元大厦A单元2层2307室(商铺) 2
人民北路群星广场开元大厦A单元3层2307室(商铺) 3
人民北路群星广场开元大厦B单元3层2307室(商铺) 4
Remarks Street, Block/Lane, Building, Unit, Floor, and Room are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID

Example 3

Sensitivities 85-89

人民北路群星广场开元大厦A单元2层2306室(商铺) 1
人民北路群星广场开元大厦A单元2层2307室(商铺) 1
人民北路群星广场开元大厦A单元3层2307室(商铺) 2
人民北路群星广场开元大厦B单元3层2307室(商铺) 3
人民北路群星广场开天大厦B单元3层2307室(商铺) 4
Remarks Street, Block/Lane, Building, Unit, and Floor are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 4

Sensitivities
80-84
人民北路群星广场开元大厦A单元2层2307室(商铺) 1
人民北路群星广场开元大厦A单元3层2307室(商铺) 1
人民北路群星广场开元大厦B单元3层2307室(商铺) 2
人民北路群星广场开天大厦B单元3层2307室(商铺) 3
Remarks Street, Block/Lane, Building, and Unit are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 5

Sensitivities
75-79
人民北路群星广场开元大厦A单元3层2307室(商铺) 1
人民北路群星广场开元大厦B单元3层2307室(商铺) 1
人民北路群星广场开天大厦B单元3层2307室(商铺) 2
人民北路群众广场开天大厦B单元3层2307室(商铺) 3
Remarks Street, Block/Lane, and Building are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 6

Sensitivities
70-74
人民公社北路群星广场开元大厦B单元3层2307室(商铺) 1
人民公社北路群星广场开天大厦B单元3层2307室(商铺) 1
人民公社南路群星广场开天大厦B单元3层2307室(商铺) 2
人民公社北路群众广场开天大厦B单元3层2307室(商铺) 3
人民公社南路群众广场开天大厦B单元3层2307室(商铺) 4
Remarks Street and Block/Lane are evaluated. Different forms of some words will match. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 7

Sensitivities
50-69
人民公社北路群星广场开元大厦B单元3层2307室(商铺) 1
人民公社南路群星广场开天大厦B单元3层2307室(商铺) 1
人民公社北路群众广场开天大厦B单元3层2307室(商铺) 2
人民公社南路群众广场开天大厦B单元3层2307室(商铺) 2
Remarks Fewer characters in Street and the same characters in Block/Lane are evaluated, since in sampling data Block/Lane is more popular than Street. Different forms of some words will match. Note that fewer characters in the address are considered as the sensitivity is lowered.

 

Address (Full)
Description The Address (Full) match definition generates match codes which can be used to cluster records containing complete two-line addresses.
Max Length of Match Code 223 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元2层2306室(底商) 123456 1
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元2层2306室(商铺) 123456 1
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元2层2306室(商铺) 邮编:123456 1
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元2层2306室(商铺) 100052 2
Remarks All components of the address are evaluated except for additional info. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 2

Sensitivities
90-94
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元2层2306室(商铺) 100052 1
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元2层2307室(商铺) 100052 1
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元3层2307室(商铺) 2
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦B单元3层2307室(商铺) 3
Remarks Additional info, postal code, and room info are ignored. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 3

Sensitivities
85-89
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元2层2307室(商铺) 1
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元3层2307室(商铺) 1
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦B单元3层2307室(商铺) 2
广东省深圳市宝安区西乡镇人民北路群星广场开天大厦B单元3层2307室(商铺) 3
Remarks Additional info, postal code, room, and floor info are ignored. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 4

Sensitivities
80-84
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦A单元3层2307室(商铺) 1
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦B单元3层2307室(商铺) 1
广东省深圳市宝安区西乡镇人民北路群星广场开天大厦B单元3层2307室(商铺) 2
广东省深圳市宝安区西乡镇人民北路群众广场开天大厦B单元3层2307室(商铺) 3
Remarks Additional info, postal code, room, floor, and unit info are ignored. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 5

Sensitivities
75-79
广东省深圳市宝安区西乡镇人民北路群星广场开元大厦B单元3层2307室(商铺) 1
广东省深圳市宝安区西乡镇人民北路群星广场开天大厦B单元3层2307室(商铺) 1
广东省深圳市宝安区西乡镇人民北路群众广场开天大厦B单元3层2307室(商铺) 2
广东省深圳市宝安区西乡镇人民南路群众广场开天大厦B单元3层2307室(商铺) 3
Remarks Additional info, postal code, room, floor, unit, and building info are ignored. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 6

Sensitivities
70-74
广东省深圳市宝安区西乡镇人民北路群星广场开天大厦B单元3层2307室(商铺) 1
广东省深圳市宝安区西乡镇人民北路群星广场开天大厦B单元3层2307室(商铺) 1
广东省深圳市宝安区西乡镇人民南路群众广场开天大厦B单元3层2307室(商铺) 2
广东省深圳市宝安区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 3
Remarks Only province, city, district/prefecture/county, town/village, and street info are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 7

Sensitivities
65-69
广东省深圳市宝安区西乡镇人民北路群众广场开天大厦B单元3层2307室(商铺) 1
广东省深圳市宝安区西乡镇人民南路群众广场开天大厦B单元3层2307室(商铺) 1
广东省深圳市宝安区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 2
广东省深圳市南山区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 3
Remarks Only province, city, district/prefecture/county, and town/village info are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 8

Sensitivities
60-64
广东省深圳市宝安区西乡镇人民南路群众广场开天大厦B单元3层2307室(商铺) 1
广东省深圳市宝安区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 1
广东省深圳市南山区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 2
广东省中山市南山区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 3
Remarks Only province, city, district/prefecture/county, and town/village info are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 9

Sensitivities
55-59
广东省深圳市宝安区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 1
广东省深圳市南山区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 1
广东省中山市南山区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 2
广西省中山市南山区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 3
Remarks Only province and city info are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 10

Sensitivities
50-54
广东省深圳市南山区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 1
广东省中山市南山区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 1
广西省中山市南山区西风镇人民南路群众广场开天大厦B单元3层2307室(商铺) 2
Remarks Only province info is evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.

 

Address (PO Box Only)
Description The Address (PO Box Only) match definition generates match codes which can be used to cluster records containing the PO Box portion of an address.
Max Length of Match Code 23 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
北京国际邮电局邮政信箱“100600-9082”号 1
北京国际邮电局邮政信箱第100600-9082号 1
北京邮政信箱100600-9082号 1
北京邮政一零零六零零-九零八二信箱 1
北京邮政100600-9082信箱 1
北京邮政100600-9010信箱 2
北京邮政100600-9999信箱 3
Remarks The first 13 digits of the PO Box number and the first 2 characters of the city are evaluated.
  Input Cluster ID
Example 2

Sensitivities
90-94
北京邮政100600-9082信箱 1
北京邮政100600-9010信箱 1
北京邮政100600-9999信箱 2
北京邮政100600-8888信箱 3
Remarks The first 9 digits of the PO Box number and the first 2 characters of the city are evaluated.
  Input Cluster ID
Example 3

Sensitivities
85-89
北京邮政100600-9010信箱 1
北京邮政100600-9999信箱 1
北京邮政100600-8888信箱 2
北京邮政1006007777信箱 3
Remarks The first 8 digits of the PO Box number and the first 2 characters of the city are evaluated.
  Input Cluster ID
Example 4

Sensitivities
80-84
北京邮政100600-9999信箱 1
北京邮政100600-8888信箱 1
北京邮政1006007777信箱 2
北京邮政100606-6666信箱 3
Remarks The first 7 digits of the PO Box number and the first 2 characters of the city are evaluated.
  Input Cluster ID
Example 5

Sensitivities
75-79
北京邮政100600-8888信箱 1
北京邮政1006007777信箱 1
北京邮政100606-6666信箱 2
北京邮政100655-5555信箱 3
Remarks Additional info, postal code, room, floor, unit, and building info are ignored. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 6

Sensitivities
70-74
北京邮政1006007777信箱 1
北京邮政100606-6666信箱 1
北京邮政100655-5555信箱 2
北京邮政100444-4444信箱 3
Remarks Only province, city, district/prefecture/county, town/village, and street info are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 7

Sensitivities
65-69
北京邮政100606-6666信箱 1
北京邮政100655-5555信箱 1
北京邮政100444-4444信箱 2
北京邮政103333-3333信箱 3
Remarks The first 4 digits of the PO Box number and the first 2 characters of the city are evaluated.
  Input Cluster ID
Example 8

Sensitivities
60-64
北京邮政100655-5555信箱 1
北京邮政100444-4444信箱 1
北京邮政103333-3333信箱 2
北京邮政122222-2222信箱 3
Remarks The first 3 digits of the PO Box number and the first 2 characters of the city are evaluated.
  Input Cluster ID
Example 9

Sensitivities
55-59
北京邮政100444-4444信箱 1
北京邮政103333-3333信箱 1
北京邮政122222-2222信箱 2
北京邮政211111-1111信箱 3
京211111-1111信箱 3
天津100101-88信箱 4
Remarks The first 2 digits of the PO Box number and the first 2 characters of the city are evaluated.
  Input Cluster ID
Example 10

Sensitivities
50-54
北京邮政103333-3333信箱 1
北京邮政122222-2222信箱 1
北京邮政211111-1111信箱 2
京211111-1111信箱 2
天津100101-88信箱 3
津100101-88信箱 3
上海100101-88信箱 4
Remarks The first digit of the PO Box number and the first 2 characters of the city are evaluated.

 

Address (Street Only)
Description The Address (Street Only) match definition generates match codes which can be used to cluster records containing the street portion of an address. Because addresses containing PO Box information are rare for mainland China, this definition is a copy of the Address match definition.
Max Length of Match Code 237 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
人民北路群星广场开元大厦A单元2层2306室(底商) 1
人民北路群星广场开元大厦A单元2层2306室(商铺) 2
人民北路群星广场开元大厦A单元2层2307室(商铺) 3
人民北路群星广场开元大厦A单元3层2307室(商铺) 4
Remarks All components of the address are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 2

Sensitivities
90-94
人民北路群星广场开元大厦A单元2层2306室(底商) 1
人民北路群星广场开元大厦A单元2层2306室(商铺) 1
人民北路群星广场开元大厦A单元2层2307室(商铺) 2
人民北路群星广场开元大厦A单元3层2307室(商铺) 3
人民北路群星广场开元大厦B单元3层2307室(商铺) 4
Remarks Street, Block/Lane, Building, Unit, Floor, and Room are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 3

Sensitivities
85-89
人民北路群星广场开元大厦A单元2层2306室(商铺) 1
人民北路群星广场开元大厦A单元2层2307室(商铺) 1
人民北路群星广场开元大厦A单元3层2307室(商铺) 2
人民北路群星广场开元大厦B单元3层2307室(商铺) 3
人民北路群星广场开天大厦B单元3层2307室(商铺) 4
Remarks Street, Block/Lane, Building, Unit, and Floor are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 4

Sensitivities
80-84
人民北路群星广场开元大厦A单元2层2307室(商铺) 1
人民北路群星广场开元大厦A单元3层2307室(商铺) 1
人民北路群星广场开元大厦B单元3层2307室(商铺) 2
人民北路群星广场开天大厦B单元3层2307室(商铺) 3
Remarks Street, Block/Lane, Building, and Unit are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 5

Sensitivities
75-79
人民北路群星广场开元大厦A单元3层2307室(商铺) 1
人民北路群星广场开元大厦B单元3层2307室(商铺) 1
人民北路群星广场开天大厦B单元3层2307室(商铺) 2
人民北路群众广场开天大厦B单元3层2307室(商铺) 3
Remarks Street, Block/Lane, and Building are evaluated. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 6

Sensitivities
70-74
人民北路群星广场开元大厦B单元3层2307室(商铺) 1
人民北路群星广场开天大厦B单元3层2307室(商铺) 1
人民北路群众广场开天大厦B单元3层2307室(商铺) 2
人民南路群众广场开天大厦B单元3层2307室(商铺) 3
Remarks Street and Block/Lane are evaluated. Different forms of some words will match. Note that fewer characters in the address are considered as the sensitivity is lowered.
  Input Cluster ID
Example 7

Sensitivities
50-69
人民幸福北路群星广场开天大厦B单元3层2307室(商铺) 1
人民幸福北路群众广场开天大厦B单元3层2307室(商铺) 2
人民幸福南路群众广场开天大厦B单元3层2307室(商铺) 2
Remarks Fewer characters in Street and the same characters in Block/Lane are evaluated, since in sampling data Block/Lane is more popular than Street. Different forms of some words will match. Note that fewer characters in the address are considered as the sensitivity is lowered.

 

City
Description The City match definition generates match codes which can be used to cluster records containing city names.
Max Length of Match Code 85 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
乌鲁木齐市克孜勒苏柯尔克孜自治州 1
乌鲁木齐市克孜勒苏柯尔克孜地区 2
昆明市西双版纳傣族自治州 3
昆明市西双版纳傣族地区 4
Beijing 5
北京 5
Remarks The first 11 Chinese characters of the district/prefecture/county and the first 6 Chinese characters of the city are evaluated.
  Input Cluster ID
Example 2

Sensitivities
90-94
乌鲁木齐市克孜勒苏柯尔克孜自治州 1
乌鲁木齐市克孜勒苏柯尔克孜地区 1
昆明市西双版纳傣族自治州 2
昆明市西双版纳傣族地区 3
呼和浩特市阿拉善左旗 4
Beijing 5
北京 5
Remarks The first 8 Chinese characters of the district/prefecture/county and the first 6 Chinese characters of the city are evaluated.
  Input Cluster ID
Example 3

Sensitivities
85-89
乌鲁木齐市克孜勒苏柯尔克孜自治州 1
乌鲁木齐市克孜勒苏柯尔克孜地区 1
昆明市西双版纳傣族自治州 2
昆明市西双版纳傣族地区 2
呼和浩特市阿拉善左旗 3
Beijing 5
北京 5
Remarks The first 6 Chinese characters of the district/prefecture/county and the first 6 Chinese characters of the city are evaluated.
  Input Cluster ID
Example 4

Sensitivities
80-84
乌鲁木齐市克孜勒苏柯尔克孜自治州 1
乌鲁木齐市克孜勒苏柯尔克孜地区 1
昆明市西双版纳傣族自治州 2
昆明市西双版纳傣族地区 2
呼和浩特市阿拉善左旗 3
呼和浩特市阿拉善右旗 4
Beijing 5
北京 5
Remarks The first 4 Chinese characters of the district/prefecture/county and the first 6 Chinese characters of the city are evaluated.
  Input Cluster ID
Example 5

Sensitivities
75-79
昆明市西双版纳傣族地区 1
呼和浩特市阿拉善左旗 2
呼和浩特市阿拉善右旗 2
北京市昌平区 3
北京市宣武区 4
Beijing 5
北京 5
Remarks The first 3 Chinese characters of the district/prefecture/county and the first 6 Chinese characters of the city are evaluated.
  Input Cluster ID
Example 6

Sensitivities
70-74
昆明市西双版纳傣族地区 1
呼和浩特市阿拉善左旗 2
呼和浩特市阿拉善右旗 2
北京市昌平区 3
北京市宣武区 4
Beijing 5
北京 5
Remarks The first 2 Chinese characters of the district/prefecture/county and the first 6 Chinese characters of the city are evaluated.
  Input Cluster ID
Example 7

Sensitivities
65-69
昆明市西双版纳傣族地区 1
呼和浩特市阿拉善左旗 2
呼和浩特市阿拉善右旗 2
北京市昌平区 3
北京市宣武区 3
Beijing 5
北京 5
Remarks The first 6 Chinese characters of the city are evaluated.
  Input Cluster ID
Example 8

Sensitivities
60-64
北京市昌平区 1
北京市宣武区 1
重庆市合川市 2
重庆市涪陵区 3
张家口市桥西区 4
Beijing 5
北京 5
Remarks The first 4 Chinese characters of the city are evaluated.
  Input Cluster ID
Example 9

Sensitivities
55-59
北京市昌平区 1
北京市宣武区 1
重庆市合川市 2
重庆市涪陵区 2
张家口市桥西区 3
Beijing 5
北京 5
Remarks The first 3 Chinese characters of the city are evaluated.
  Input Cluster ID
Example 10

Sensitivities
50-54
北京市昌平区 1
北京市宣武区 1
重庆市合川市 2
重庆市涪陵区 2
张家口市桥西区 3
Beijing 5
北京 5
Remarks The first 2 Chinese characters of the city are evaluated.

 

City - State/Province - Postal Code
Description The City - State/Province - Postal Code match definition generates match codes which can be used to cluster records containing last line address information.
Max Length of Match Code 36 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
福建省泉州市丰田572000 1
福建省泉州市丰泽572000 1
福建省泉州市丰泽572001 2
福建省泉州市丰泽572012 3
Remarks Province, City, and Postal Code are evaluated.
  Input Cluster ID
Example 2

Sensitivities
90-94
福建省泉州市丰泽572000 1
福建省泉州市丰泽572001 1
福建省泉州市丰泽572012 2
福建省泉州市丰泽572123 3
Remarks Province, City, and the first 5 digits of Postal Code are evaluated.
  Input Cluster ID
Example 3

Sensitivities
85-89
福建省泉州市丰泽572001 1
福建省泉州市丰泽572012 1
福建省泉州市丰泽572123 2
福建省泉州市丰泽573234 3
Remarks Province, City, and the first 4 digits of Postal Code are evaluated.
  Input Cluster ID
Example 4

Sensitivities
80-84
福建省泉州市丰泽572012 1
福建省泉州市丰泽572123 1
福建省泉州市丰泽573234 2
福建省泉州市丰泽584345 3
Remarks Province, City, and the first 3 digits of Postal Code are evaluated.
  Input Cluster ID
Example 5

Sensitivities
75-79
福建省泉州市丰泽572123 1
福建省泉州市丰泽573234 1
福建省泉州市丰泽584345 2
福建省泉州市丰泽FR-12345 3
Remarks Province, City, and the first 2 digits of Postal Code are evaluated.
  Input Cluster ID
Example 6

Sensitivities
70-74
福建省泉州市丰泽573234 1
福建省泉州市丰泽584345 1
福建省泉州市丰泽FR-12345 2
福建省泉州市丰泽12345 2
Remarks Province, City, and the first digit of Postal Code is evaluated.
  Input Cluster ID
Example 7

Sensitivities
65-69
福建省泉州市丰泽012345 1
福建省厦门市朝阳234567 2
厦门市朝阳234567 3
广东省深圳市福田345678 4
Remarks Only Province and City are evaluated.
  Input Cluster ID
Example 8

Sensitivities
60-64
福建省泉州市丰泽012345 1
福建省厦门市朝阳234567 2
厦门市朝阳234567 2
广东省深圳市福田345678 3
Remarks Only City is evaluated.

 

Date
Description The Date match definition generates match codes which can be used to cluster records containing date information.
Max Length of Match Code 15 characters
  Input Cluster ID
Example 1

Sensitivities
85-100
2009-10-21 1
2009年10月21日 1
2009/10/21 1
10/21/2009 1
21-Oct-09 1
二零零九年十月二十一日 1
2009/10/22 2
2009/10/23 3
Remarks All digits of the year, month, and day are evaluated. Full-width and half-with characters match. Chinese numerals and Arabic numerals match. Any separators (including Chinese characters) match. Names of months match the corresponding digits that represent those months. When the day and year are ambiguous, it is assumed that the last number is the year. It is assumed that two-digit sequences in the range 00-29 represent years in the range 2000-2029. It is assumed that two-digit sequences in the range 30-99 represent the years 1930-1999.
  Input Cluster ID
Example 2

Sensitivities
80-84
2009/10/21 1
2009/10/22 1
2009/10/12 2
Remarks All digits of the year and month are evaluated. Only one digit of the day is evaluated. Full-width and half-with characters match. Chinese numerals and Arabic numerals match. Any separators (including Chinese characters) match. Names of months match the corresponding digits that represent those months. When the day and year are ambiguous, it is assumed that the last number is the year. It is assumed that two-digit sequences in the range 00-29 represent years in the range 2000-2029. It is assumed that two-digit sequences in the range 30-99 represent the years 1930-1999.
  Input Cluster ID
Example 3

Sensitivities
75-79
2009/10/15 1
2009/10/22 1
2009/11/12 2
Remarks All digits of the year and month are evaluated. The day is ignored. Full-width and half-with characters match. Chinese numerals and Arabic numerals match. Any separators (including Chinese characters) match. Names of months match the corresponding digits that represent those months. When the day and year are ambiguous, it is assumed that the last number is the year. It is assumed that two-digit sequences in the range 00-29 represent years in the range 2000-2029. It is assumed that two-digit sequences in the range 30-99 represent the years 1930-1999.
  Input Cluster ID
Example 4

Sensitivities
70-74
2009/10/15 1
2009/11/22 1
2009/09/12 2
Remarks All digits of the year are evaluated. Only one digit of the month is evaluated. The day is ignored. Full-width and half-with characters match. Chinese numerals and Arabic numerals match. Any separators (including Chinese characters) match. Names of months match the corresponding digits that represent those months. When the day and year are ambiguous, it is assumed that the last number is the year. It is assumed that two-digit sequences in the range 00-29 represent years in the range 2000-2029. It is assumed that two-digit sequences in the range 30-99 represent the years 1930-1999.
  Input Cluster ID
Example 5

Sensitivities
65-69
2009/10/15 1
2009/11/22 1
2008/09/12 2
Remarks All digits of the year are evaluated. The month and day are ignored. Full-width and half-with characters match. Chinese numerals and Arabic numerals match. Any separators (including Chinese characters) match. When the day and year are ambiguous, it is assumed that the last number is the year. It is assumed that two-digit sequences in the range 00-29 represent years in the range 2000-2029. It is assumed that two-digit sequences in the range 30-99 represent the years 1930-1999.
  Input Cluster ID
Example 6

Sensitivities
60-64
2009/10/15 1
2008/11/22 1
2012/09/12 2
Remarks Only the first 3 digits of the year are evaluated. The month and day are ignored. Full-width and half-with characters match. Chinese numerals and Arabic numerals match. Any separators (including Chinese characters) match. Names of months match the corresponding digits that represent those months. When the day and year are ambiguous, it is assumed that the last number is the year. It is assumed that two-digit sequences in the range 00-29 represent years in the range 2000-2029. It is assumed that two-digit sequences in the range 30-99 represent the years 1930-1999.
  Input Cluster ID
Example 7

Sensitivities
50-59
2009/10/15 1
2012/11/22 1
1990/09/12 2
Remarks Only the first 2 digits of the year are evaluated. The month and day are ignored. Full-width and half-with characters match. Chinese numerals and Arabic numerals match. Any separators (including Chinese characters) match. When the day and year are ambiguous, it is assumed that the last number is the year. It is assumed that two-digit sequences in the range 00-29 represent years in the range 2000-2029. It is assumed that two-digit sequences in the range 30-99 represent the years 1930-1999.

 

Name
Description The Name match definition generates match codes which can be used to cluster records containing names of individuals.
Max Length of Match Code 21 characters
  Input Cluster ID
Example 1

Sensitivities
90-100
李友琴先生 1
李友琴 1
李友勤女士 2
黎友琴(总经理) 3
LIYOUQIN 4
Remarks The family name and given name are evaluated. The 5-bit pinyin code to the Family Name token and the 5-bit pinyin code to the Given Name token are applied. The first 11 digits for family name pinyin code and the 12th-21th digits for given name pinyin code are evaluated.
  Input Cluster ID
Example 2

Sensitivities
85-89
李期勤 1
李期 2
李奇 3
黎友 4
Remarks The family name and given name are evaluated. The 5-bit pinyin code to the Family Name token and the 5-bit pinyin code to the Given Name token are applied. The first 10 digits for family name pinyin code and the 12th-21th digits for given name pinyin code are evaluated.
  Input Cluster ID
Example 3

Sensitivities
80-84
李友琴 1
李友勤 1
黎友琴 2
LIYOUQIN 3
黎又青 4
Remarks The family name and given name are evaluated. The 5-bit pinyin code to the Family Name token and the 3-bit pinyin code to the Given Name token are applied. The first 9 digits for family name pinyin code and the 12th-17th digits for given name pinyin code are evaluated.
  Input Cluster ID
Example 4

Sensitivities
75-79
黎友琴 1
黎又青 1
李期勤 2
李期 3
黎友 4
Remarks The family name and given name are evaluated. The 3-bit pinyin code to the Family Name token and the 3-bit pinyin code to the Given Name token are applied. The first 8 digits for family name pinyin code and the 12th-16th digits for given name pinyin code are evaluated.
  Input Cluster ID
Example 5

Sensitivities
70-74
李期 1
李奇 1
黎友 2
欧阳修 3
欧阳休期 4
Remarks The family name and given name are evaluated. The 3-bit pinyin code to the Family Name token and the 3-bit pinyin code to the Given Name token are applied. The first 7 digits for family name pinyin code and the 12th-15th digits for given name pinyin code are evaluated.
  Input Cluster ID
Example 6

Sensitivities
65-69
李期勤 1
李期 1
李奇 1
欧阳修 2
欧阳休期 2
Remarks The family name and given name are evaluated. The 3-bit pinyin code to the Family Name token and the 3-bit pinyin code to the Given Name token are applied. The first 6 digits for family name pinyin code and the 12th-14th digits for given name pinyin code are evaluated.
  Input Cluster ID
Example 7

Sensitivities
60-64
李友琴 1
李友勤 1
黎友 1
李期勤 2
李奇 2
Remarks The family name and given name are evaluated. The 3-bit pinyin code to the Family Name token and the 3-bit pinyin code to the Given Name token are applied. The first 5 digits for family name pinyin code and the 12th-13th digits for given name pinyin code are evaluated.
  Input Cluster ID
Example 8

Sensitivities
55-59
李奇 1
黎友 2
Remarks The family name and given name are evaluated. The 3-bit pinyin code to the Family Name token and the 3-bit pinyin code to the Given Name token are applied. The first 4 digits for family name pinyin code and the 12th digits for given name pinyin code are evaluated.
  Input Cluster ID
Example 9

Sensitivities
50-54
李奇 1
黎友 1
Remarks Only the family name is evaluated. The 3-bit pinyin code to the Family Name token and the 3-bit pinyin code to the Given Name token are applied. The first 3 digits for family name pinyin code are evaluated.

 

Organization
Description The Organization match definition generates match codes which can be used to cluster records containing organization names.
Max Length of Match Code 100 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
中国农业银行上海市分行 1
中国农业银行北京市分行 2
中国石化 3
中国石油化工股份有限公司 3
西门子医疗设备有限公司广东分公司 4
广东西门子医疗设备有限公司 5
  Input Cluster ID
Example 2

Sensitivities
90-94
中国农业银行上海市分行 1
中国农业银行北京市分行 2
中国石化 3
中国石油化工股份有限公司 3
广东西门子医疗设备有限公司 4
西门子医疗设备有限公司广东分公司 4
西门子医疗设备有限公司广东省分公司 5
  Input Cluster ID
Example 3

Sensitivities
85-89
中国农业银行上海市分行 1
中国农业银行北京市分行 2
中国石化 3
中国石油化工股份有限公司 3
广东西门子医疗设备有限公司 4
西门子医疗设备有限公司广东分公司 4
西门子医疗设备有限公司广东省分公司 4
  Input Cluster ID
Example 4

Sensitivities
75-84
中国农业银行上海市分行 1
中国农业银行北京市分行 1
中国石化 2
中国石油化工股份有限公司 2
广东西门子医疗设备有限公司 3
西门子医疗设备有限公司广东分公司 3
西门子医疗设备有限公司广东省分公司 3
上海西门子医疗设备有限公司广东分公司 4
  Input Cluster ID
Example 5

Sensitivities
65-74
中国农业银行上海市分行 1
中国农业银行北京市分行 1
中国石化 2
中国石油化工股份有限公司 2
广东西门子医疗设备有限公司 3
西门子医疗设备有限公司广东分公司 3
西门子医疗设备有限公司广东省分公司 3
上海西门子医疗设备有限公司广东分公司 3
天津三星视界移动有限公司 4
天津三星视界有限公司 5
  Input Cluster ID
Example 6

Sensitivities
55-64
中国农业银行上海市分行 1
中国农业银行北京市分行 1
中国石化 2
中国石油化工股份有限公司 2
广东西门子医疗设备有限公司 3
西门子医疗设备有限公司广东分公司 3
西门子医疗设备有限公司广东省分公司 3
上海西门子医疗设备有限公司广东分公司 3
天津三星视界移动有限公司 4
天津三星视界有限公司 4
上海三星通信技术有限公司 5
  Input Cluster ID
Example 7

Sensitivities
50-54
中国农业银行上海市分行 1
中国农业银行北京市分行 1
中国石化 1
中国石油化工股份有限公司 1
广东西门子医疗设备有限公司 2
西门子医疗设备有限公司广东分公司 2
西门子医疗设备有限公司广东省分公司 2
上海西门子医疗设备有限公司广东分公司 2
天津三星视界移动有限公司 3
天津三星视界有限公司 3
上海三星通信技术有限公司 3
Remarks For sensitivities 85-100, name and site information are evaluated. Legal forms and additional info are ignored. For sensitivities 50-84, only name is evaluated. Note that fewer characters are considered as the sensitivity is lowered.

 

Phone
Description The Phone match definition generates match codes which can be used to cluster records containing phone numbers.
Max Length of Match Code 22 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
+682 356648 1
+683 356648 2
593 6784569 3
594 6784569 4
3456789 5
3456780 6
34567891 7
34567890 8
4473000 ext 12345 9
4473000 ext 12346 9
4473000 ext 12356 10
Remarks First three digits of the country code are evaluated. All digits of the area code are evaluated. All seven characters of a 7-digit base number are evaluated. All eight characters of an 8-digit base number are evaluated. First four characters of the extension are evaluated.
  Input Cluster ID
Example 2

Sensitivities
90-95
+682 356648 1
+683 356648 2
593 6784569 3
594 6784569 4
3456789 5
3456780 6
34567891 7
34567890 8
4473000 ext 12345 9
4473000 ext 12346 9
4473000 ext 12356 9
Remarks First three digits of the country code are evaluated. All digits of the area code are evaluated. All seven digits of a 7-digit base number are evaluated. All eight digits of an 8-digit base number are evaluated. First two digits of the extension are evaluated.
  Input Cluster ID
Example 3

Sensitivities
85-89
+682 356648 1
+683 356648 2
593 6784569 3
594 6784569 4
3456789 5
3456780 5
3456700 6
34567891 7
34567890 7
34567800 8
4473000 ext 12345 9
4473000 ext 12346 9
4473000 ext 12356 9
Remarks First three digits of the country code are evaluated. All digits of the area code are evaluated. First six digits of a 7-digit base number are evaluated. First seven digits of an 8-digit base number are evaluated. Extension is not evaluated.
  Input Cluster ID
Example 4

Sensitivities
80-84
+682 356648 1
+683 356648 1
+62 356648 2
593 6784569 3
594 6784569 4
3456789 5
3456780 5
3456700 6
34567891 7
34567890 7
34567800 8
4473000 ext 12345 9
4473000 ext 12346 9
4473000 ext 12356 9
Remarks First two digits of the country code are evaluated. All digits of the area code are evaluated. First six digits of a 7-digit base number are evaluated. First seven digits of an 8-digit base number are evaluated. Extension is not evaluated.
  Input Cluster ID
Example 5

Sensitivities
75-79
+682 356648 1
+683 356648 1
+62 356648 2
593 6784569 3
594 6784569 4
3456789 5
3456780 5
3456700 6
34567891 7
34567890 7
34567800 7
34567000 8
4473000 ext 12345 9
4473000 ext 12346 9
4473000 ext 12356 9
Remarks First two digits of the country code are evaluated. All digits of the area code are evaluated. First six digits of a 7-digit base number are evaluated. First six digits of an 8-digit base number are evaluated. Extension is not evaluated.
  Input Cluster ID
Example 6

Sensitivities
70-74
+682 356648 1
+683 356648 1
+62 356648 2
593 6784569 3
594 6784569 3
580 6784569 4
3456789 5
3456780 5
3456700 5
3456000 6
34567891 7
34567890 7
34567800 7
34567000 8
4473000 ext 12345 9
4473000 ext 12346 9
4473000 ext 12356 9
Remarks First two digits of the country code are evaluated. First two digits of the area code are evaluated. First five digits of a 7-digit base number are evaluated. First six digits of an 8-digit base number are evaluated. Extension is not evaluated.
  Input Cluster ID
Example 7

Sensitivities
65-69
+682 356648 1
+683 356648 1
+62 356648 1
593 6784569 2
594 6784569 2
580 6784569 2
633 6784569 3
3456789 4
3456780 4
3456700 4
3456000 4
3450000 5
34567891 6
34567890 6
34567800 6
34567000 6
34560000 7
4473000 ext 12345 8
4473000 ext 12346 8
4473000 ext 12356 8
Remarks Country code is not evaluated. First digit of the area code is evaluated. First four digits of a 7-digit base number are evaluated. First five digits of an 8-digit base number are evaluated. Extension is not evaluated.
  Input Cluster ID
Example 8

Sensitivities
60-64
+682 356648 1
+683 356648 1
+62 356648 1
593 6784569 2
594 6784569 2
580 6784569 2
633 6784569 2
3456789 3
3456780 3
3456700 3
3456000 3
3450000 5
34567891 6
34567890 6
34567800 6
34567000 6
34560000 6
34500000 7
4473000 ext 12345 8
4473000 ext 12346 8
4473000 ext 12356 8
Remarks Country code is not evaluated. Area code is not evaluated. First four digits of a 7-digit base number are evaluated. First four digits of an 8-digit base number are evaluated. Extension is not evaluated.
  Input Cluster ID
Example 9

Sensitivities
55-59
+682 356648 1
+683 356648 1
+62 356648 1
593 6784569 2
594 6784569 2
580 6784569 2
633 6784569 2
3456789 3
3456780 3
3456700 3
3456000 3
3450000 3
3400000 5
34567891 6
34567890 6
34567800 6
34567000 6
34560000 6
34500000 6
34000000 7
4473000 ext 12345 8
4473000 ext 12346 8
4473000 ext 12356 8
Remarks Country code is not evaluated. Area code is not evaluated. First three digits of a 7-digit base number are evaluated. First three digits of an 8-digit base number are evaluated. Extension is not evaluated.
  Input Cluster ID
Example 10

Sensitivities
50-54
+682 356648 1
+683 356648 1
+62 356648 1
593 6784569 2
594 6784569 2
580 6784569 2
633 6784569 2
3456789 3
3456780 3
3456700 3
3456000 3
3450000 3
3400000 3
3000000 5
34567891 6
34567890 6
34567800 6
34567000 6
34560000 6
34500000 6
34000000 6
30000000 7
4473000 ext 12345 8
4473000 ext 12346 8
4473000 ext 12356 8
Remarks Country code is not evaluated. Area code is not evaluated. First two digits of a 7-digit base number are evaluated. First two digits of an 8-digit base number are evaluated. Extension is not evaluated.

 

Postal Code
Description The Postal Code match definition generates match codes which can be used to cluster records containing postal codes.
Max Length of Match Code 16 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
654321 1
654321 1
CN-654321 1
邮编654321 1
654322 2
Remarks All 6 digits of the domestic postal code are evaluated.
  Input Cluster ID
Example 2

Sensitivities
90-94
654321 1
654321 1
CN-654321 1
邮编654321 1
654322 1
654332 2
Remarks The first 5 digits of the domestic postal code are evaluated.
  Input Cluster ID
Example 3

Sensitivities
80-89
654321 1
654321 1
CN-654321 1
邮编654321 1
654322 1
654332 1
654432 2
Remarks The first 4 digits of the domestic postal code are evaluated.
  Input Cluster ID
Example 4

Sensitivities
70-79
654321 1
654321 1
CN-654321 1
邮编654321 1
654322 1
654332 1
654432 1
655432 2
Remarks The first 3 digits of the domestic postal code are evaluated.
  Input Cluster ID
Example 5

Sensitivities
55-69
654321 1
654321 1
CN-654321 1
邮编654321 1
654322 1
654332 1
654432 1
655432 1
665432 2
Remarks The first 2 digits of the domestic postal code are evaluated.
  Input Cluster ID
Example 6

Sensitivities
50-54
654321 1
654321 1
CN-654321 1
邮编654321 1
654322 1
654332 1
654432 1
655432 1
665432 1
765432 2
Remarks The first digit of the domestic postal code is evaluated.

 

State/Province
Description The State/Province match definition generates match codes which can be used to cluster records containing states and provinces.
Max Length of Match Code 40 characters
  Input Cluster ID
Example 1

Sensitivities
95-100
新疆 1
新疆省 1
新疆维吾尔自治区 1
新 疆 1
澳门特别行政区 2
澳门行政区 2
澳门特区 2
2
广西壮族自治区 3
广西省 3
广西 3
广东省 4
广东 4
Remarks The first 8 Chinese characters of the province are evaluated.
  Input Cluster ID
Example 2

Sensitivities
85-94
新疆 1
新疆省 1
新疆维吾尔自治区 1
新 疆 1
澳门特别行政区 2
澳门行政区 2
澳门特区 2
2
广西壮族自治区 3
广西省 3
广西 3
广东省 4
广东 4
Remarks The first 5 Chinese characters of the province are evaluated.
  Input Cluster ID
Example 3

Sensitivities
80-84
新疆 1
新疆省 1
新疆维吾尔自治区 1
新 疆 1
澳门特别行政区 2
澳门行政区 2
澳门特区 2
2
广西壮族自治区 3
广西省 3
广西 3
广东省 4
广东 4
Remarks The first 4 Chinese characters of the province are evaluated.
  Input Cluster ID
Example 4

Sensitivities
70-79
新疆 1
新疆省 1
新疆维吾尔自治区 1
新 疆 1
澳门特别行政区 2
澳门行政区 2
澳门特区 2
2
广西壮族自治区 3
广西省 3
广西 3
广东省 4
广东 4
Remarks The first 3 Chinese characters of the province are evaluated.
  Input Cluster ID
Example 5

Sensitivities
65-69
新疆 1
新疆省 1
新疆维吾尔自治区 1
新 疆 1
澳门特别行政区 2
澳门行政区 2
澳门特区 2
2
广西壮族自治区 3
广西省 3
广西 3
广东省 4
广东 4
Remarks The first 2 Chinese characters of the province are evaluated.
  Input Cluster ID
Example 6

Sensitivities
50-64
新疆 1
新疆省 1
新疆维吾尔自治区 1
新 疆 1
澳门特别行政区 2
澳门行政区 2
澳门特区 2
2
广西壮族自治区 3
广西省 3
广西 3
广东省 3
广东 3
Remarks The first Chinese character of the province is evaluated.

Parse Definitions

Address
Description The Parse definition for Address parses address information.
Output Tokens Street
Block/Lane
Building
Unit
Floor
Room
Additional Info
  Input Output
Example 1 星光大道62号海王星科技大厦A座6楼 Street 星光大道62号
Block/Lane  
Building 海王星科技大厦
Unit A座
Floor 6楼
Room  
Additional Info  
  Input Output
Example 2 佳和国小区24号楼2单元602 Street  
Block/Lane 佳和国小区
Building 24号楼
Unit 2单元
Floor  
Room 602
Additional Info  
  Input Output
Example 3 建设路295号云南天达光伏科技股份有限公司组装车 Street 建设路295号
Block/Lane  
Building  
Unit  
Floor  
Room  
Additional Info 云南天达光伏科技股份有限公司组装车
  Input Output
Example 4 芳园南里西区8号楼C段 Street  
Block/Lane 芳园南里西区
Building 8号楼
Unit C段
Floor  
Room  
Additional Info  
  Input Output
Example 5 东风西路195号广州医学院教学学术交流中心大厦A座101室、202室 Street 东风西路195号
Block/Lane 广州医学院
Building 教学学术交流中心大厦
Unit A座
Floor  
Room 101室、202室
Additional Info  
Remarks  

 

Address (Full)
Description The Parse definition for Address (Full) parses full two-line addresses.
Output Tokens Province
City
District/Prefecture/County
Town/Village
Street
Block/Lane
Building
Unit
Floor
Room
Additional Info
Postal Code
  Input Output
Example 1 北京宣武区宣武门外大街10号庄胜广场北翼19层205 Province  
City 北京
District/Prefecture/County 宣武区
Town/Village  
Street 宣武门外大街10号
Block/Lane 庄胜广场
Building 北翼
Unit  
Floor 19层
Room 205
Additional Info  
Postal Code  
  Input Output
Example 2 北京市门头沟区永定镇侯庄子村76号 Province  
City 北京市
District/Prefecture/County 门头沟区
Town/Village 永定镇侯庄子村76号
Street  
Block/Lane  
Building  
Unit  
Floor  
Room  
Additional Info  
Postal Code  
  Input Output
Example 3 北京市宣武区新安中里5-1-101号 Province  
City 北京市
District/Prefecture/County 宣武区
Town/Village  
Street  
Block/Lane 新安中里
Building  
Unit  
Floor  
Room 5-1-101号
Additional Info  
Postal Code  
  Input Output
Example 4 深圳龙岗区雅豪苑6栋3单元301号 Province  
City 深圳
District/Prefecture/County 龙岗区
Town/Village  
Street  
Block/Lane 雅豪苑
Building 6栋
Unit 3单元
Floor  
Room 301号
Additional Info  
Postal Code  
  Input Output
Example 5 邮政编码:014000 包头市九原区哈林格尔镇(滨河路河西生态化工基地1号) Province  
City 包头市
District/Prefecture/County 九原区
Town/Village 哈林格尔镇
Street  
Block/Lane  
Building  
Unit  
Floor  
Room  
Additional Info (滨河路河西生态化工基地1号)
Postal Code 邮政编码:014000
  Input Output
Example 6 北京市东城区安定门西滨河路22号(神华大厦)五、六层100000 Province  
City 北京市
District/Prefecture/County 东城区
Town/Village  
Street 安定门西滨河路22号
Block/Lane  
Building  
Unit  
Floor 五、六层
Room  
Additional Info (神华大厦)
Postal Code 100000
  Input Output
Example 7 北京市西城区复兴门内大街28号凯晨世贸中心中座8层 邮编:100000 Province  
City 北京市
District/Prefecture/County 西城区
Town/Village  
Street 复兴门内大街28号
Block/Lane 凯晨世贸中心
Building  
Unit 中座
Floor 8层
Room  
Additional Info  
Postal Code 邮编:100000
  Input Output
Example 8 P.C. (100000) 宣武区鸭子桥路24号510室 Province  
City  
District/Prefecture/County 宣武区
Town/Village  
Street 鸭子桥路24号
Block/Lane  
Building  
Unit  
Floor  
Room 510室
Additional Info  
Postal Code P.C. (100000)
Remarks Province, city, and district/prefecture/county names are recognized with or without an indicator keyword. Postal codes can only be recognized as 6-digit numeric strings at the beginning or end of the full address.

 

Address (Global)
Description

The Address (Global) parse definition parses addresses into a globally recognized set of tokens.

Output Tokens Recipient
Building/Site
Street
Extension
PO Box
Additional Info
  Input Output
Example 1 星光大道62号海王星科技大厦A座6楼 Recipient  
Building/Site 海王星科技大厦A座
Street 星光大道62号
Extension 6楼
PO Box  
Additional Info  
  Input Output
Example 2 佳和国小区24号楼2单元602 Recipient  
Building/Site 24号楼2单元
Street 佳和国小区
Extension 602
PO Box  
Additional Info  
  Input Output
Example 3 建设路295号云南天达光伏科技股份有限公司组装车 Recipient  
Building/Site  
Street 建设路295号
Extension  
PO Box  
Additional Info 云南天达光伏科技股份有限公司组装车
  Input Output
Example 4 芳园南里西区8号楼C段 Recipient  
Building/Site 8号楼C段
Street 芳园南里西区
Extension  
PO Box  
Additional Info  
  Input Output
Example 5 东风西路195号广州医学院教学学术交流中心大厦A座101室、202室 Recipient  
Building/Site 教学学术交流中心大厦A座
Street 东风西路195号广州医学院
Extension 101室、202室
PO Box  
Additional Info  
Remarks Parse definitions named with the Global keyword use a set of output tokens that is consistent across every locale. Results obtained from these definitions can be stored in the same database fields as the results obtained from definitions of the same name in other locales.

The Address (Global) (v23) parse definition is now deprecated and will be removed in a future release of the QKB.

The Address (Global) parse definition has been replaced with a copy of the Address (Global) (v23) definition which takes advantage of the new tokens and updated processing. If you changed your jobs to use Address (Global) (v23) it is suggested that you change them back.

 

Address (Global) (v23)
Description

The Address (Global) (v23) parse definition parses addresses into a globally recognized set of tokens.

Output Tokens Recipient
Building/Site
Street
Extension
PO Box
Additional Info
  Input Output
Example 1 星光大道62号海王星科技大厦A座6楼 Recipient  
Building/Site 海王星科技大厦A座
Street 星光大道62号
Extension 6楼
PO Box  
Additional Info  
  Input Output
Example 2 佳和国小区24号楼2单元602 Recipient  
Building/Site 24号楼2单元
Street 佳和国小区
Extension 602
PO Box  
Additional Info  
  Input Output
Example 3 建设路295号云南天达光伏科技股份有限公司组装车 Recipient  
Building/Site  
Street 建设路295号
Extension  
PO Box  
Additional Info 云南天达光伏科技股份有限公司组装车
  Input Output
Example 4 芳园南里西区8号楼C段 Recipient  
Building/Site 8号楼C段
Street 芳园南里西区
Extension  
PO Box  
Additional Info  
  Input Output
Example 5 东风西路195号广州医学院教学学术交流中心大厦A座101室、202室 Recipient  
Building/Site 教学学术交流中心大厦A座
Street 东风西路195号广州医学院
Extension 101室、202室
PO Box  
Additional Info  
Remarks Parse definitions named with the Global keyword use a set of output tokens that is consistent across every locale. Results obtained from these definitions can be stored in the same database fields as the results obtained from definitions of the same name in other locales.

The Address (Global) (v23) parse definition is now deprecated and will be removed in a future release of the QKB.

The Address (Global) parse definition has been replaced with a copy of the Address (Global) (v23) definition which takes advantage of the new tokens and updated processing. If you changed your jobs to use Address (Global) (v23) it is suggested that you change them back.

 

City
Description The Parse definition for City parses city and district/prefecture/county names.
Output Tokens City
District/Prefecture/County
Additional Info
  Input Output
Example 1 北京市昌平区* City 北京市
District/Prefecture/County 昌平区
Additional Info  
  Input Output
Example 2 深圳市宝安区3区 City 深圳市
District/Prefecture/County 宝安区3区
Additional Info  
  Input Output
Example 3 北京市(密云县) City 北京市
District/Prefecture/County (密云县)
Additional Info  
  Input Output
Example 4 北京市宣武区南部,西部 City 北京市
District/Prefecture/County 宣武区
Additional Info 南部,西部
Remarks Recognizes city names with or without identifier keywords ("市").

 

City - State/Province - Postal Code
Description The Parse definition for City - State/Province - Postal Code parses address last line data, which typically includes province, city, and postal code information.
Output Tokens City
State/Province
Additional Info
Postal Code
  Input Output
Example 1 北京市100020* City  
State/Province 北京市
Additional Info  
Postal Code 100020
  Input Output
Example 2 江苏扬州(仪征市区西)678300 City 江苏
State/Province 扬州
Additional Info (仪征市区西)
Postal Code 678300
  Input Output
Example 3 邮编:231300 遵义市 City  
State/Province 遵义市
Additional Info  
Postal Code 邮编:231300
  Input Output
Example 4 淄博市 邮政编码242200 City  
State/Province 淄博市
Additional Info  
Postal Code 邮政编码242200
Remarks  

 

City - State/Province - Postal Code (Global)
Description The Parse definition for City - State/Province - Postal Code (Global) parses address last line data into a globally recognized set of tokens.
Output Tokens City
State/Province
Postal Code
Additional Info
  Input Output
Example 1 北京市100020* City 北京市
State/Province  
Postal Code 100020
Additional Info  
  Input Output
Example 2 江苏扬州(仪征市区西)678300 City 扬州
State/Province 江苏
Postal Code 678300
Additional Info (仪征市区西)
  Input Output
Example 3 邮编:231300 遵义市 City 遵义市
State/Province  
Postal Code 邮编:231300
Additional Info  
Remarks Parse definitions named with the Global keyword use a set of output tokens that is consistent across every locale. Results obtained from these definitions can be stored in the same database fields as the results obtained from definitions of the same name in other locales.

 

Date
Description The Parse definition for Date parses date information.
Output Tokens Year
Month
Day
  Input Output
Example 1 2009/10/21 Year 2009
Month 10
Day 21
  Input Output
Example 2 二零零九年十月二十一日 Year 二零零九年
Month 十月
Day 二十一日
  Input Output
Example 3 14Mar, 2001 Year 2001
Month Mar
Day 14
  Input Output
Example 4 20091021 Year 2009
Month 10
Day 21
Remarks  

 

ID Number
Description The Parse definition for ID Number parses ID number information.
Output Tokens Province Code
City/Prefecture Code
District/County Code
Birth Year
Birth Month
Birth Day
Sequence Code
Validation Code
  Input Output
Example 1 130503196704010012 Province Code 13
City/Prefecture Code 05
District/County Code 03
Birth Year 1967
Birth Month 04
Birth Day 01
Sequence Code 001
Validation Code 2
  Input Output
Example 2 130503670401001 Province Code 13
City/Prefecture Code 05
District/County Code 03
Birth Year 67
Birth Month 04
Birth Day 01
Sequence Code 001
Validation Code  
Remarks  

 

Name
Description The Parse definition for Name parses names of individuals.
Output Tokens Family Name
Given Name
Suffix
Title/Additional Info
  Input Output
Example 1 陈胜华 Family Name
Given Name 胜华
Suffix  
Title/Additional Info  
  Input Output
Example 2 李大伟,博士(中国区总裁) Family Name
Given Name 大伟
Suffix  
Title/Additional Info 博士(中国区总裁)
  Input Output
Example 3 司徒怀,先生(中国区总裁) Family Name 司徒
Given Name 怀
Suffix 先生
Title/Additional Info (中国区总裁)
Remarks  

 

Name (Global)
Description The Parse definition for Name (Global) parses names of individuals into a globally recognized set of tokens.
Output Tokens Prefix
Given Name
Middle Name
Family Name
Suffix
Title/Additional Info
  Input Output
Example 陈胜华 Prefix  
Given Name 胜华
Middle Name  
Family Name
Suffix  
Title/Additional Info  
Remarks Parse definitions named with the Global keyword use a set of output tokens that is consistent across every locale. Results obtained from these definitions can be stored in the same database fields as the results obtained from definitions of the same name in other locales.

 

Organization
Description The Parse definition for Organization parses organization names.
Output Tokens Name
Legal Form
Site
Additional Info
  Input Output
Example 1 无锡市城市环境卫生有限公司 Name 城市环境卫生
Legal Form 有限公司
Site 无锡市
Additional Info  
  Input Output
Example 2 国华(呼伦贝尔)风电有限公司南京办事处 Name 国华(呼伦贝尔)风电
Legal Form 有限公司
Site 南京办事处
Additional Info  
  Input Output
Example 3 国华(呼伦贝尔)风电有限公司开发一处 Name 国华 风电
Legal Form 有限公司
Site (呼伦贝尔)
Additional Info 开发一处
  Input Output
Example 4 香港华艺设计顾问(深圳)有限公司 Name 香港华艺设计顾问
Legal Form 有限公司
Site (深圳)
Additional Info  
  Input Output
Example 5 神华集团包头矿业有限责任公司运销处集装站 Name 神华集团包头矿业
Legal Form 有限责任公司
Site  
Additional Info 运销处集装站
  Input Output
Example 6 北京中铁特货冷藏物流有限公司(已于2009年1月8日撤消) Name 中铁特货冷藏物流
Legal Form 有限公司
Site 北京
Additional Info (已于2009年1月8日撤消)
  Input Output
Example 7 北京大学计算机学院 Name 北京大学
Legal Form  
Site  
Additional Info 计算机学院
  Input Output
Example 8 赛仕软件研究开发(北京)有限公司(收) Name 赛仕软件研究开发
Legal Form 有限公司
Site (北京)
Additional Info  
Remarks  

 

Organization (Global)
Description The Parse definition for Organization (Global) parses organization names into a globally recognized set of tokens.
Output Tokens Name
Legal Form
Site
Additional Info
  Input Output
Example 1 无锡市城市环境卫生有限公司 Name 城市环境卫生
Legal Form 有限公司
Site 无锡市
Additional Info  
  Input Output
Example 2 长安汽车(集团)有限责任公司北京分公司 Name 长安汽车(集团)
Legal Form 有限责任公司
Site 北京分公司
Additional Info  
  Input Output
Example 3 国华(呼伦贝尔)风电有限公司南京办事处 Name 国华(呼伦贝尔)风电
Legal Form 有限公司
Site 南京办事处
Additional Info  
  Input Output
Example 4 国华(呼伦贝尔)风电有限公司开发一处 Name 国华 风电
Legal Form 有限公司
Site (呼伦贝尔)
Additional Info 开发一处
  Input Output
Example 5 香港华艺设计顾问(深圳)有限公司 Name 香港华艺设计顾问
Legal Form 有限公司
Site (深圳)
Additional Info  
  Input Output
Example 6 神华集团包头矿业有限责任公司运销处集装站 Name 神华集团包头矿业
Legal Form 有限责任公司
Site  
Additional Info 运销处集装站
  Input Output
Example 7 北京中铁特货冷藏物流有限公司(已于2009年1月8日撤消) Name 中铁特货冷藏物流
Legal Form 有限公司
Site 北京
Additional Info (已于2009年1月8日撤消)
  Input Output
Example 8 北京大学计算机学院 Name 北京大学
Legal Form  
Site  
Additional Info 计算机学院
  Input Output
Example 9 赛仕软件研究开发(北京)有限公司(收) Name 赛仕软件研究开发
Legal Form 有限公司
Site (北京)
Additional Info  
Remarks

Parse definitions named with the Global keyword use a set of output tokens that is consistent across every locale. Results obtained from these definitions can be stored in the same database fields as the results obtained from definitions of the same name in other locales.

 

Phone
Description The Parse definition for Phone parses phone numbers into a set of tokens.
Output Tokens Country Code
Area Code
Base Number
Extension
Line Type
Additional Info
  Input Output
Example 1 Tel(+86)10 8319 3355-3636 办公电话 Country Code 86
Area Code 10
Base Number 8319 3355
Extension 3636
Line Type Tel
Additional Info 办公电话
  Input Output
Example 2 (+86)10 8319 3355-3636 办公电话 Country Code 86
Area Code 10
Base Number 8319 3355
Extension 3636
Line Type 办公电话
Additional Info  
  Input Output
Example 3 TEL(0319)7456537 Country Code  
Area Code 0319
Base Number 7456537
Extension  
Line Type TEL
Additional Info  
  Input Output
Example 4 手机13412345678 Country Code  
Area Code 134
Base Number 12345678
Extension  
Line Type 手机
Additional Info  
  Input Output
Example 5 +1 919-447-3000 Country Code 1
Area Code  
Base Number 919-447-3000
Extension  
Line Type  
Additional Info  
Remarks Mobile vendor ID is parsed into the Area Code token.

 

Phone (Global)
Description The Parse definition for Phone (Global) parses phone numbers into a globally recognized set of tokens.
Output Tokens Country Code
Area Code
Base Number
Extension
Line Type
Additional Info
  Input Output
Example 1 Tel(+86)10 8319 3355-3636 办公电话 Country Code +86
Area Code 10
Base Number 8319 3355
Extension 3636
Line Type Tel
Additional Info 办公电话
  Input Output
Example 2 (+86)10 8319 3355-3636 办公电话 Country Code +86
Area Code 10
Base Number 8319 3355
Extension 3636
Line Type 办公电话
Additional Info  
  Input Output
Example 3 TEL(0319)7456537 Country Code  
Area Code 0319
Base Number 7456537
Extension  
Line Type TEL
Additional Info  
  Input Output
Example 4 手机13412345678 Country Code  
Area Code 134
Base Number 12345678
Extension  
Line Type 手机
Additional Info  
  Input Output
Example 5 +1 919-447-3000 Country Code +1
Area Code  
Base Number 919-447-3000
Extension  
Line Type  
Additional Info  
Remarks Mobile vendor ID is parsed into the Area Code token.

Parse definitions named with the Global keyword use a set of output tokens that is consistent across every locale. Results obtained from these definitions can be stored in the same database fields as the results obtained from definitions of the same name in other locales.

Pattern Analysis Definitions

None.

Standardization Definitions

Address
Description Standardizes address information.
  Input Output
Examples "青年大道3号" 青年大道3号
临江路2号(工商银行六楼) 临江路2号 工商银行六楼
凯旋路451号1楼 , 4楼 凯旋路451号1层, 4层
益田新村106栋22G 益田新村106栋22G
福中路15号大院肆层4-803 福中路15号大院4层4-803
黄兴路2005弄1号23号楼a座 黄兴路2005弄1号23号楼A座
Remarks Floor identifier is standardized to "层". Full-width alphanumeric characters are converted to half-width characters. Non-logical characters are removed: quotes, blanks, and so on. Chinese numerals in Unit, Floor, Room information are converted to Arabic numerals. All English letters are converted to upper case.

 

Address (Full)
Description Standardizes full two-line addresses.
  Input Output
Examples 广东深圳福田益田村 广东省深圳市福田区益田村
杭州市凯旋路451号1楼 , 4楼 杭州市凯旋路451号1层, 4层
广东省深圳市福田区益田村106栋22G 广东省深圳市福田区益田村106栋22G
广东省深圳市福田区福中路15号大院肆层4-803 广东省深圳市福田区福中路15号大院4层4-803
邮政编码:014000 包头市九原区哈林格尔镇(滨河路河西生态化工基地1号) 包头市九原区哈林格尔镇 滨河路河西生态化工基地1号 014000
北京市西城区复兴门内大街28号凯晨世贸中心中座8层 邮编:100000 北京市西城区复兴门内大街28号凯晨世贸中心中座8层 100000
P.C. (100000) 宣武区鸭子桥路24号510室 宣武区鸭子桥路24号510室 100000
Remarks Province, city and district/prefecture/county identifier keywords are added when possible. Floor identifier is standardized to "层". Full-width alphanumeric characters are converted to half-width characters. Non-logical characters are removed: quotes, blanks, and so on. Chinese numerals within unit, floor, and room information are converted to Arabic numerals. All English letters are converted to upper case.

 

City
Description Standardizes city and district/prefecture/county names.
  Input Output
Examples 北京市(密云县) 北京市密云县
北京市宣武区南部,西部 北京市宣武区, 南部, 西部
深圳盐田 深圳市盐田区
北京市密云县南行10公里 北京市密云县, 南行10公里
Remarks Adds city identifier keywords when possible. Removes non-logical characters: quotes, blanks, and so on.

 

City - State/Province - Postal Code
Description Standardizes address "last line" data, which typically includes province, city and postal code information.
  Input Output
Examples 北京市100052 北京市 100052
福建省泉州市572000 福建省泉州市 572000
北京市CN-100052 北京市 100052
邮政编码242200 安徽省蚌埠市 安徽省蚌埠市 242200
安徽安庆潜山(邮编)242500 安徽省安庆市 (潜山) 242500
Remarks Add Identifier when possible. Remove non-logical characters: quotes, blanks, and so on.

 

Date (Chinese Calendar)
Description Standardizes date expressions to Chinese calendar format.
  Input Output Explanation
Examples 2009/10/21 2009年10月21日 Standardize calendar identifier to YYYY年MM月DD日 format.
二零零九年十月二十一日 2009年10月21日  
14-Mar-01 2001年03月14日 Standardize month name. When the day and year are ambiguous, consider the last number to be the year.
2009/10/21 2009年10月21日 Convert full-width to half-width.
20091021 2009年10月21日 8 digits are considered to be YYYYMMDD format.
Remarks Supports dates from 1901 to 2050. Assumes two-digit years 00-29 are 2000-2029. Assumes two-digit years 30-99 are 1930-1999.

 

Date (Western Calendar)
Description Standardizes date expressions to Western calendar format.
  Input Output Explanation
Examples (2009/10/21) 2009/10/21 Standardize calendar identifier to YYYY/MM/DD format.
二零零九年十月二十一日 2009/10/21  
14-Mar-01 2001/03/14 Standardize month name. When the day and year are ambiguous, consider the last number to be the year.
2009/10/21 2009/10/21 Convert full-width to half-width.
20091021 2009/10/21 8 digits are considered to be YYYYMMDD format.
Remarks Supports dates from 1901 to 2050. Assumes two-digit years 00-29 are 2000-2029. Assumes two-digit years 30-99 are 1930-1999.

 

ID Number
Description Standardizes ID numbers.
  Input Output
Examples 130503196704010012 130503196704010012
(130503196704010012) 130503196704010012
Remarks  

 

Name
Description Standardizes names of individuals.
  Input Output
Examples 李大伟先生 李大伟 先生
“刘丽” 刘丽
司徒怀,先生(中国区总裁) 司徒怀 先生 中国区总裁
Remarks  

 

Organization
Description Standardizes organization names.
  Input Output
Examples 碧丽服装有限公司 碧丽服装 有限公司
香港华艺设计顾问(深圳)有限公司 香港华艺设计顾问 有限责任公司, 深圳
DATAFLUX, INC DataFlux Inc
中国石化集团洛阳石油化工工程公司 中国石油化工集团, 洛阳, 石油化工工程公司
上海mwb互感器有限公司 MWB互感器 有限责任公司, 上海
碧丽服装(北京)有限公司上海分公司 碧丽服装(北京) 有限责任公司, 上海分公司
Remarks Full-width ASCII characters are transformed to half-width.

 

Phone
Description Standardizes phone numbers for domestic use.
  Input Output
Examples 采购部电话010-12345678 (010) 12345678, 采购部电话
82741510 转 345 82741510 x345
+86 03197456537 (0319) 7456537
13512345678 135 12345678
1082741510 (010) 82741510
0044 (0)20 12345000 +44 2012345000
Remarks  

 

Phone (Electronic)
Description Standardizes phone numbers for automated calling systems.
  Input Output
Example 采购部电话010-12345678 内線123 +861012345678
Remarks  

 

Phone (with Country Code)
Description Standardizes phone numbers for international use.
  Input Output
Example 采购部电话010-12345678 +86 10 12345678, 采购部电话
Remarks  

 

Postal Code
Description Standardizes postal codes.
  Input Output
Examples 邮编242500 242500
242-500 242500
CN-100052 100052
FR 12345 FR-12345
37100 037100
Remarks Identifies domestic postal code patterns with potentially missing leading zeroes and adds them to the input, as in the final example.

 

Postal Code (with Country Code)
Description Standardizes postal codes for international use.
  Input Output
Examples 邮编242500 CN-242500
242-500 CN-242500
CN 100052 CN-100052
FR 12345 FR-12345
37100 CN-037100
Remarks Identifies domestic postal code patterns with potentially missing leading zeroes and adds them to the input, as in the final example. Uses international formatting, with no spaces.

 

State/Province
Description Standardizes province information.
  Input Output
Examples 内蒙古 内蒙古自治区
"浙江省" 浙江省
江西 江西省
香港 香港特别行政区
山东省
Remarks Adds province identifier keywords when possible. Converts aliases to full names. Removes non-logical characters: quotes, blanks, and so on.

Inherited Definitions

In addition to the definitions listed on this page, the Chinese, China locale also inherits all definitions for the Chinese language and all Global definitions.