case definition
an algorithm used to change the case of an input data string, accounting for unique values that need to be cased according to context, such as abbreviations and business names.
chop table
a proprietary file type in the Quality Knowledge Base. A chop table contains lexing rules used to separate characters in an input data string into semantically significant substrings.
data type
the semantic nature of a data string. The data type provides a context determining which set of logic is applied to a data string when performing data cleansing operations. Example data types are: Name, Address, and Phone Number.
definition
an algorithm in the Quality Knowledge Base. The definitions in the Quality Knowledge Base are the data management algorithms that are available for use in other SAS applications like DataFlux Data Management Studio or SAS Data Quality Server.
extraction definition
an algorithm used to extract attributes from an input data string.
gender analysis definition
an algorithm used to determine the probable gender of a name or identity-type input string.
grammar
a file that contains a set of rules which represent expected patterns of words or characters in a given context. The parsing operation uses a vocabulary to identify the basic categories for each word or character. The patterns constructed from the categories of those words or characters are then compared with rules in the grammar. If a rule which captures these patterns is found, a solution is produced. Rules in a grammar contain patterns that consist of one or more symbols called categories. Categories can be either derived or basic. A derived category is comprised of basic categories or more derived categories. Basic categories cannot be defined further. They are the atomic categories assigned to words or characters in the vocabulary.
identification analysis definition
an algorithm used to identify an input string as a member of a value group or category.
locale
a set of definitions that support data quality operations for a specific country and language combination. For example, Canada has two locales, English, Canada (ENCAN) and French, Canada (FRCAN) which work together to handle both English and French Canadian data.
locale guessing
a process that attempts to identify the country of origin of a particular piece of data based on an address, a country code, or some other field.
match code
the end result of passing data through a match definition. A normalized, encrypted string that represents portions of a data string that are considered to be significant with regard to the semantic identity of the data. Two data strings are said to "match" if the same match code is generated for each of them.
match definition
an algorithm used to generate a match code for a data string of a specific data type.
parse
the process of dividing a data string into a set of token values. For example: Mr. Bob Brauer [Mr. = Prefix , Bob = Given Name, Brauer = Family Name]
parse definition
a name for a context-specific parsing algorithm. A parse definition determines the names and contents of the sub-strings that will hold the results of a parse operation.
pattern analysis definition
a regular expression library that forms the basis of a pattern recognition algorithm.
phonetics
an algorithm applied to a data string to reduce it to a value that will match other data strings with similar pronunciations.
Quality Knowledge Base
the SAS Quality Knowledge Base (QKB) is a collection of files that store data and logic that define data management operations such as parsing, standardization, and matching. SAS software products reference the QKB when performing data management operations on your data.
regular expression
a language composed of symbols and operators that enables you to express how a computer application should search for a specified pattern in text. A pattern can then be replaced with another pattern, also described using the regular expression language.
sensitivity
in matching procedures, sensitivity refers to the relative tightness or looseness of the expected match results. A higher sensitivity indicates you want the values in your match results to be very similar to each other. A lower sensitivity setting indicates that you want the match results to be "fuzzier" in nature.
standardization definition
an algorithm used to standardize an input data string.
standardization scheme
a collection of transformation rules that typically apply to one subject area such as company name standardization or province code standardization.
token
an output substring produced by an extraction or parse definition. A token contains a word or group of words that have a semantically atomic meaning in an input data string. A set of expected tokens is defined for each data type.
vocabulary
a proprietary file type in the Quality Knowledge Base. A vocabulary is a lexicon of words used for categorizing data for context-specific look-ups.