Glossary

analysis data set
in SAS data quality, a SAS output data set that provides information about the degree of divergence in specified character values.
case definition
a part of a locale that is referenced during data cleansing to impose on character values a consistent usage of uppercase and lowercase letters.
cleanse
to improve the consistency and accuracy of data by standardizing it, reorganizing it, and eliminating redundancy.
cluster
in SAS data quality, a set of character values that have the same match code.
composite match code
a match code that consists of a concatenation of match codes from values from two or more input character variables in the same observation. A delimiter can be specified to separate the individual match codes in the concatenation.
compound match code
a match code that consists of a concatenation of match codes that are created for each token in a delimited or parsed string. Within a compound match code, individual match codes might be separated by a delimiter.
data analysis
in SAS data quality, the process of evaluating input data sets in order to determine whether data cleansing is needed.
data cleansing
the process of eliminating inaccuracies, irregularities, and discrepancies from data.
data quality
the relative value of data, which is based on the accuracy of the knowledge that can be generated using that data. High-quality data is consistent, accurate, and unambiguous, and it can be processed efficiently.
data transformation
in SAS data quality, a cleansing process that applies a scheme to a specified character variable. The scheme creates match codes internally to create clusters. All values in each cluster are then transformed to the standardization value that is specified in the scheme for each cluster.
delimiter
a character that serves as a boundary that separates the elements of a text string.
gender definition
a part of a locale that is referenced during data cleansing to determine the gender of individuals based on the names of those individuals.
guess definition
a part of a locale that is referenced during the selection of the locale from the locale list that is the best choice for use in the analysis or cleansing of the specified character values.
identification definition
a part of a locale that is referenced during data analysis or data cleansing to determine categories for specified character values.
locale
a setting that reflects the language, local conventions, and culture for a geographic region. Local conventions can include specific formatting rules for paper sizes, dates, times, and numbers, and a currency symbol for the country or region. Some examples of locale values are French_Canada, Portuguese_Brazil, and Chinese_Singapore.
locale list
an ordered list of locales that is loaded into memory prior to data analysis or data cleansing. The first locale in the list is the default locale.
match
a set of values that produce identical match codes or identical match code components. Identical match codes are assigned to clusters.
match code
an encoded version of a character value that is created as a basis for data analysis and data cleansing. Match codes are used to cluster and compare character values.
match definition
a part of a locale that is referenced during the creation of match codes. Each match definition is specific to a category of data content. In the ENUSA locale. For example, match definitions are provided for names, e-mail addresses, and street addresses, among others.
name prefix
a title of respect or a professional title that precedes a first name or an initial. For example, Mr., Mrs., and Dr. are name prefixes.
name suffix
a part of a name that follows the last name. For example, Jr. and Sr. are name suffixes.
parse
to analyze text, such as a SAS statement, for the purpose of separating it into its constituent words, phrases, punctuation marks, values, or other types of information. The information can then be analyzed according to a definition or set of rules.
parse definition
a part of a locale that is referenced during the parsing of character values. The parse definition specifies the number and location of the delimiters that are inserted during parsing. The location of the delimiters depends on the content of the character values.
parse token
a named element that can be assigned a value during parsing. The specified parse definition provides the criteria that detect the value in the string. After the value is detected and assigned to the token, the character value can be manipulated using the name of the token.
parsed string
in SAS data quality, a text string into which has been inserted a delimiter and name at the beginning of each token in that string. The string is automatically parsed by referencing a parse definition.
Quality Knowledge Base
a collection of locales and other information that is referenced during data analysis and data cleansing. For example, to create match codes for a data set that contains street addresses in Great Britain, you would reference the ADDRESS match definition in the ENGBR locale in the Quality Knowledge Base.
scheme
a reusable collection of match codes and standardization values that is applied to input character values for the purposes of transformation or analysis.
sensitivity
in SAS Data Quality, a value that specifies the amount of information in match codes. Greater sensitivity values result in match codes that contain greater amounts of information. As sensitivity values increase, character values must be increasingly similar to generate the same match codes.
standardization definition
a part of a locale that is referenced during data cleansing to impose a specified format on character values.
standardize
to eliminate unnecessary variation in data in order to maximize the consistency and accuracy of the data.
token
in SAS data quality, a named word or phrase in a parsed or delimited string that can be individually analyzed and cleansed.
transformation
in data integration, an operation that extracts data, transforms data, or loads data into data stores.
transformation value
in SAS data quality, the most frequently occurring value in a cluster. In data cleansing, this value is propagated to all of the values in the cluster.