an algorithm used to change the case of an input data string, accounting for unique values that need to be cased according to context, such as abbreviations and business names.
a proprietary file type in the Quality Knowledge Base. A chop table contains lexing rules used to separate characters in an input data string into semantically significant substrings.
the semantic nature of a data string. The data type provides a context determining which set of logic is applied to a data string when performing data cleansing operations. Example data types are: Name, Address, and Phone Number.
an algorithm in the Quality Knowledge Base. The definitions in the Quality Knowledge Base are the data management algorithms that are available for use in other SAS applications like DataFlux Data Management Studio or SAS Data Quality Server.
an algorithm used to extract attributes from an input data string.
an algorithm used to determine the probable gender of a name or identity-type input string.
a file that contains a set of rules which represent expected patterns of words or characters in a given context. The parsing operation uses a vocabulary to identify the basic categories for each word or character. The patterns constructed from the categories of those words or characters are then compared with rules in the grammar. If a rule which captures these patterns is found, a solution is produced. Rules in a grammar contain patterns that consist of one or more symbols called categories. Categories can be either derived or basic. A derived category is comprised of basic categories or more derived categories. Basic categories cannot be defined further. They are the atomic categories assigned to words or characters in the vocabulary.
an algorithm used to identify an input string as a member of a value group or category.
a set of definitions that support data quality operations for a specific country and language combination. For example, Canada has two locales, English, Canada (ENCAN) and French, Canada (FRCAN) which work together to handle both English and French Canadian data.
a process that attempts to identify the country of origin of a particular piece of data based on an address, a country code, or some other field.
the end result of passing data through a match definition. A normalized, encrypted string that represents portions of a data string that are considered to be significant with regard to the semantic identity of the data. Two data strings are said to "match" if the same match code is generated for each of them.
an algorithm used to generate a match code for a data string of a specific data type.
the process of dividing a data string into a set of token values. For example: Mr. Bob Brauer [Mr. = Prefix , Bob = Given Name, Brauer = Family Name]
a name for a context-specific parsing algorithm. A parse definition determines the names and contents of the sub-strings that will hold the results of a parse operation.
a regular expression library that forms the basis of a pattern recognition algorithm.
an algorithm applied to a data string to reduce it to a value that will match other data strings with similar pronunciations.
the SAS Quality Knowledge Base (QKB) is a collection of files that store data and logic that define data management operations such as parsing, standardization, and matching. SAS software products reference the QKB when performing data management operations on your data.
a language composed of symbols and operators that enables you to express how a computer application should search for a specified pattern in text. A pattern can then be replaced with another pattern, also described using the regular expression language.
in matching procedures, sensitivity refers to the relative tightness or looseness of the expected match results. A higher sensitivity indicates you want the values in your match results to be very similar to each other. A lower sensitivity setting indicates that you want the match results to be "fuzzier" in nature.
an algorithm used to standardize an input data string.
a collection of transformation rules that typically apply to one subject area such as company name standardization or province code standardization.
an output substring produced by an extraction or parse definition. A token contains a word or group of words that have a semantically atomic meaning in an input data string. A set of expected tokens is defined for each data type.
a proprietary file type in the Quality Knowledge Base. A vocabulary is a lexicon of words used for categorizing data for context-specific look-ups.