Previous Page | Next Page

Concepts

Schemes

A scheme is file that you create to transform the values of a character variable. Applying the scheme to the data, transforms similar representations of a data value into a standard representation.

To create and apply multiple schemes in a single procedure, see The DQSCHEME Procedure. Use DQSCHEMEAPPLY Function and DQSCHEMEAPPLY CALL Routine to apply schemes as well.

This only pertains to schemes that are BFD (DataFlux) type of schemes. The Scheme Builder application does not recognize schemes that are stored as SAS data sets.


Create the Schemes

Schemes are created with the CREATE statement in the DQSCHEME procedure. The CREATE statement uses the matching technology, behind the scenes, to effectively group like data values together. A survivor is selected out of each group to be the standard value for that group of data values. The survivor is selected based on highest frequency of occurrence of the data values.

Note:   During scheme creation, the DQSCHEME procedure evaluates the definition of each input variable in each CREATE statement. An error message is generated if the defined length of an input variable exceeds 1024 bytes.  [cautionend]

Scheme data sets are created in SAS format or in Blue Fusion Data format. Blue Fusion Data (BFD) format is recognized by SAS and by DataFlux dfPower Studio software.


Analysis Data Sets

Analysis data sets show the groupings of like data values in the scheme-building process. These are the groupings from which the standard value is selected. The data sets are generated by specifying the ANALYSIS= option in the CREATE statement of the DQSCHEME procedure. The analysis data sets enable you to experiment with different options to create a scheme that provides optimal data cleansing.

The key to optimizing a scheme is to choose a sensitivity value that best suits your data and your goal. You can create a series of analysis data sets using different sensitivity values to compare the results. Changing the sensitivity value changes the clustering of input values, as described in Sensitivity.

When you decide on a sensitivity level, you can create the scheme data set by replacing the ANALYSIS= option with the SCHEME= option in the CREATE statement.

The analysis data set contains one observation for each unique input value. Any adjacent blank spaces are removed from the input values. The COUNT variable describes the number of occurrences of that value.

The CLUSTER variable represents the groupings of data values that are similar based on the selected sensitivity. One standard value is selected from each cluster, based on the value with the highest COUNT (frequency).

Specify the INCLUDE_ALL option in the CREATE statement to include all input values in the scheme. This includes the unique input values that did not receive a cluster number in the analysis data set.

See Create the Schemes.


Applying Schemes

After you create a scheme data set, apply it to an input variable to transform its values. You can apply a scheme with the APPLY statement in the DQSCHEME procedure (see APPLY Statement), or with the DQSCHEMEAPPLY function or CALL routine. Use the DQSCHEMEAPPLY CALL routine if you want to return the number of transformations that occurred during the application of the scheme. See DQSCHEMEAPPLY Function.

The scheme data set consists of the DATA and STANDARD variables. The DATA variable contains the input character values that were used to create the scheme. The STANDARD variable contains the transformation values. All of the DATA values in a given cluster have the same STANDARD value. The STANDARD values are the values that were the most common values in each cluster when the scheme was created.

When you apply a scheme to a SAS data set, an input value is transformed when the it matches a DATA value in the scheme. The transformation replaces the input value with the transformation value.

The lookup method determines how the input value is matched to the DATA values in the scheme. The SCHEME_LOOKUP option or argument specifies that the match must be exact, although case is insensitive. Alternatively the match can consist of a match between the match codes of the input value, and the match codes of the DATA values. When a match occurs, any adjacent blank spaces in the transformation value are replaced with single blank spaces. Then the value is written into the output data set.

If no match is found for an input value, that exact value is written into the output data set.

Specify the MODE argument or the MODE= option to apply schemes in one of two modes: phrase or element. Applying a scheme by phrase compares the entire input value (or the match code of the entire value) to the values (or match codes) in the scheme. Phrase is the default scheme apply mode.

When you apply a scheme by element, each element in the input value (or match code of each element) is compared to the values (or match codes) in the scheme. Applying schemes by element enables you to change one or more elements in an input value, without changing any of the other elements in that value.

The file format of a scheme is important when that scheme is applied. In the z/OS operating environment, schemes must be created and applied in SAS format. Schemes that are stored in a PDS in BFD format can be applied. Schemes in BFD format can be converted to SAS format using the CONVERT statement in the DQSCHEME procedure. Before you apply a scheme, see Applying Schemes.

Note:   Schemes in BFD format cannot be created or displayed in the z/OS operating environment.  [cautionend]


Meta Options

Meta options are stored in the scheme when the scheme is created. The options provide default values for certain options of the DQSCHEME procedure's APPLY statement. The meta options also store default arguments for the DQSCHEMEAPPLY function or CALL routine. Default values are stored for the lookup mode (SCHEME_LOOKUP option or argument), apply mode (MODE option or argument), match definition, and sensitivity level. The values of the meta options are superseded when other values are specified in the APPLY statement or in the DQSCHEMEAPPLY function or CALL routine.

The meta options for the match definition and sensitivity value are valid only when the scheme is applied with match-code lookup, when the value of the SCHEME_LOOKUP option is USE_MATCHDEF.

The meta options are stored differently depending on the scheme format. For schemes in SAS format, the meta options are stored in the data set label. For schemes in BFD format, the meta options are stored within the scheme itself.

Note:   In programs that create schemes in SAS format, do not specify a data set label; doing so deletes the meta options.  [cautionend]

The meta options are stored using the following syntax:

'lookup-method' 'apply-mode' 'sensitivity-level' 'match-definition'
lookup-method

EM specifies that the default value of the SCHEME_LOOKUP option or argument is EXACT. For an input value to be transformed, that value must exactly match a  DATA value in the scheme.

IC specifies SCHEME_LOOKUP=IGNORE_CASE.

UM specifies that SCHEME_LOOKUP=USE_MATCHDEF. Match codes are created and compared for all input values and all DATA values in the scheme.

apply-mode

E specifies that the default value of the MODE option or argument is ELEMENT.

P specifies that MODE=PHRASE.

sensitivity-level

is the amount of information in the match codes that is generated when SCHEME_LOOKUP=USE_MATCHDEF.

Valid values range from 50 to 95.

match-definition

is the name of the default match definition that is used when the value of the SCHEME_LOOKUP option is USE_MATCHDEF.

For example, the following meta options string specifies that the scheme:
'UM' 'P' '80' 'NAME'

Previous Page | Next Page | Top of Page