Glossary: Glossary

Glossary

analysis data set: in SAS data quality, a SAS output data set that provides information about the degree of divergence in specified character values.
Blue Fusion data format: a file format for schemes that can be created and applied in data quality software from SAS and from DataFlux (a SAS company). Schemes in Blue Fusion data format are sometimes referred to as BFD schemes. Schemes can also be created in SAS format.
case definition: a part of a locale that is referenced during data cleansing to impose a capitalization scheme on a character variable.
cleanse: to improve the consistency and accuracy of data by standardizing it, reorganizing it, and eliminating redundancy.
cluster: in SAS data quality, a set of character values that have the same match code.
composite match code: a match code that consists of a concatenation of match codes from values from two or more input character variables in the same observation. A delimiter can be specified to separate the individual match codes in the concatenation.
compound match code: a match code that consists of a concatenation of match codes that are created for each token in a delimited or parsed string. Within a compound match code, individual match codes might be separated by a delimiter.
data analysis: in SAS data quality, the process of evaluating input data sets in order to determine whether data cleansing is needed.
data cleansing: the process of eliminating inaccuracies, irregularities, and discrepancies from data.
data definitions: are contained in the Quality Knowledge Base for a number of locales. Data definitions specify how categories of data are processed.
data quality: the relative value of data, which is based on the accuracy of the knowledge that can be generated using that data. High-quality data is consistent, accurate, and unambiguous, and it can be processed efficiently.
data transformation: in SAS data quality, a cleansing process that applies a scheme to a specified character variable. The scheme creates match codes internally to create clusters. All values in each cluster are then transformed to the standardization value that is specified in the scheme for each cluster.
delimiter: a character that separates words or phrases in a text string.
gender definition: a part of a locale that is referenced during data cleansing to determine the gender of individuals based on the names of those individuals.
guess definition: a part of a locale that is referenced during the selection of the locale from the locale list. This is the best choice for use in the analysis or cleansing of the specified character values.
identification definition: a part of a locale that is referenced during data analysis or data cleansing to determine categories for specified character values.
locale: provide data definitions for a national language and geographical region. The locale reflects the language, local conventions, and culture for a geographic region. Local conventions can include specific formatting rules for dates, times, and numbers, and a currency symbol for the country or region. Collating sequences, paper sizes, and conventions for postal addresses and telephone numbers are also typically specified for each locale. Some examples of locale values are French_Canada, Portuguese_Brazil, and Chinese_Singapore.
locale list: an ordered list of locales that is loaded into memory prior to data analysis or data cleansing. The first locale in the list is the default locale.
match: a set of values that produce identical match codes or identical match code components. Identical match codes are assigned to clusters. See also match code, match code component, and cluster.
match code: an encoded version of a character value that is created as a basis for data analysis and data cleansing. Match codes are used to cluster and compare character values.
match definition: a part of a locale that is referenced during the creation of match codes. Each match definition is specific to a category of data content. For example, in the ENUSA locale, match definitions are provided for names, e-mail addresses, and street addresses, among others. See also sensitivity.
name prefix: a title of respect or a professional title that precedes a first name or an initial. For example, Mr., Mrs., and Dr. are name prefixes.
name suffix: a part of a name that follows the last name. For example, Jr. and Sr. are name suffixes.
parse: in SAS data quality, a process that inserts into a character value a series of delimiters, as determined by a specified parse definition.
parse definition: a part of a locale that is referenced during the parsing of character values. The parse definition specifies the number and location of the delimiters that are inserted during parsing. The location of the delimiters depends on the content of the character values. See also token.
parse token: a named element that can be assigned a value during parsing. Tokens are assigned values based on the specified parse definition. The value can then be manipulated using the name of the token. See also token.
parsed string: in SAS data quality, a text string into which has been inserted a delimiter and name at the beginning of each token in that string. The string is automatically parsed by referencing a parse definition. See also delimited string.
Quality Knowledge Base: a collection of locales and other information that is referenced during data analysis and data cleansing. For example, to create match codes for a data set with addresses in Great Britain, you would reference the ADDRESS match definition, in the ENGBR locale.
SAS data format: a file format for schemes that can be created and applied in data quality software from SAS and from DataFlux (a SAS company). Schemes in SAS data format are sometimes referred to as ??? schemes.
scheme: in SAS data quality, a reusable collection of match codes and standardization values that is applied to input character values for the purposes of transformation or analysis. Schemes can be created in Blue Fusion data format or SAS data format. See also Blue Fusion data format.
sensitivity: in SAS data quality, a value that specifies the amount of information in match codes. Greater sensitivity values result in match codes that contain greater amounts of information. As sensitivity values increase, character values must be increasingly similar to generate the same match codes.
standardization definition: a part of a locale that is referenced during data cleansing to impose a specified format on character values.
standardize: in SAS data quality, to impose a specified format on character values. Standardization definition is used to standardize the data.
token: in SAS data quality, a named word or phrase in a parsed or delimited string that can be individually analyzed and cleansed. See also parse token.
transformation: in SAS Data Quality, a process that converts a group of similar data values to the single value that is most commonly present in the group.
transformation value: in SAS Data Quality, the most frequently occurring value in a cluster. In data cleansing, this value is propagated to all of the values in the cluster.

Top of Page