Schemes

Transport the Values of a Character Variable

A scheme is a file that you create to transform the values of a character variable. Applying the scheme to the data transforms similar representations of a data value into a standard representation.
To create and apply multiple schemes in a single procedure, see DQSCHEME Procedure. Use the information found in DQSCHEMEAPPLY Function and DQSCHEMEAPPLY CALL Routine to apply schemes as well.
This information pertains only to schemes that are BFD (DataFlux) types of schemes. The Scheme Builder application does not recognize schemes that are stored as SAS DATA sets.

Creating Schemes

Schemes are created with the CREATE statement in the DQSCHEME procedure. The CREATE statement uses the matching technology, behind the scenes, to effectively group like data values together. A survivor is selected out of each group to be the standard value for that group of data values. The survivor is selected based on highest frequency of occurrence of the data values.
Note: During scheme creation, the DQSCHEME procedure evaluates the definition of each input variable in each CREATE statement. An error message is generated if the defined length of an input variable exceeds 1024 bytes.
Scheme data sets are created in SAS format or in Blue Fusion Data format. Blue Fusion Data (BFD) format is recognized by SAS and by DataFlux Data Management Studio software.
  • The SAS Data Quality Server software can create and apply schemes. You can also view the schemes with the SAS table viewer.
  • DataFlux Data Management Studio software can create, apply, and edit schemes in BFD format only.
  • In the z/OS operating environment, the SAS Data Quality Server software can create, apply, and display schemes in SAS format. Schemes in Blue Fusion format can be applied.
Note: There is a CONVERT statement that is used to convert schemes between the two formats.

Analysis Data Sets

Analysis data sets show the groupings of like data values in the scheme-building process. These are the groupings from which the standard value is selected. The data sets are generated by specifying the ANALYSIS= option in the CREATE statement of the DQSCHEME procedure. The analysis data sets enable you to experiment with different options to create a scheme that provides optimal data cleansing.
The key to optimizing a scheme is to choose a sensitivity value that best suits your data and your goal. You can create a series of analysis data sets using different sensitivity values to compare the results. Changing the sensitivity value changes the clustering of input values, as described in Sensitivity.
When you decide on a sensitivity level, you can create the scheme data set by replacing the ANALYSIS= option with the SCHEME= option in the CREATE statement.
The analysis data set contains one observation for each unique input value. Any adjacent blank spaces are removed from the input values. The COUNT variable describes the number of occurrences of that value.
The CLUSTER variable represents the groupings of data values that are similar based on the selected sensitivity. One standard value is selected from each cluster, based on the value with the highest COUNT (frequency).
Specify the INCLUDE_ALL option in the CREATE statement to include all input values in the scheme. This includes the unique input values that did not receive a cluster number in the analysis data set.
See Creating Schemes for additional information.

Applying Schemes

After you create a scheme data set, apply it to an input variable to transform its values. You can apply a scheme with the APPLY statement in the DQSCHEME procedure, as described in APPLY Statement. You can also use the DQSCHEMEAPPLY function or CALL routine. Use the DQSCHEMEAPPLY CALL routine if you want to return the number of transformations that occurred during the application of the scheme. See DQSCHEMEAPPLY CALL Routine for additional information.
The scheme data set consists of the DATA and STANDARD variables. The DATA variable contains the input character values that were used to create the scheme. The STANDARD variable contains the transformation values. All of the DATA values in a given cluster have the same STANDARD value. The STANDARD values are the values that were the most common values in each cluster when the scheme was created.
When you apply a scheme to a SAS DATA set, an input value is transformed when it matches a DATA value in the scheme. The transformation replaces the input value with the transformation value.
The lookup method determines how the input value is matched to the DATA values in the scheme. The SCHEME_LOOKUP option or argument specifies that the match must be exact, although case is insensitive. Alternatively, the match can consist of a match between the match codes of the input value and the match codes of the DATA values. When a match occurs, any adjacent blank spaces in the transformation value are replaced with single blank spaces. Then the value is written into the output data set.
If no match is found for an input value, that exact value is written into the output data set.
Specify the MODE argument or the MODE= option to apply schemes in one of two modes: PHRASE or ELEMENT. Applying a scheme by phrase compares the entire input value (or the match code of the entire value) to the values (or match codes) in the scheme. Phrase is the default scheme apply mode.
When you apply a scheme by element, each element in the input value (or match code of each element) is compared to the values (or match codes) in the scheme. Applying schemes by element enables you to change one or more elements in an input value, without changing any of the other elements in that value.
The file format of a scheme is important when that scheme is applied. In the z/OS operating environment, schemes must be created and applied in SAS format. Schemes that are stored in a PDS in BFD format can be applied. Schemes in BFD format can be converted to SAS format using the CONVERT statement in the DQSCHEME procedure.
Note: Schemes in BFD format cannot be created or displayed in the z/OS operating environment.

Meta Options

Meta options are stored in the scheme when the scheme is created. The options provide default values for certain options of the DQSCHEME procedure's APPLY statement. The meta options also store default arguments for the DQSCHEMEAPPLY function or CALL routine. Default values are stored for the lookup mode (SCHEME_LOOKUP option or argument), apply mode (MODE option or argument), match definition, and sensitivity level. The values of the meta options are superseded when other values are specified in the APPLY statement or in the DQSCHEMEAPPLY function or CALL routine.
The meta options for the match definition and sensitivity value are valid only when the scheme is applied with match code lookup, when the value of the SCHEME_LOOKUP option is USE_MATCHDEF.
The meta options are stored differently depending on the scheme format. For schemes in SAS format, the meta options are stored in the data set label. For schemes in BFD format, the meta options are stored within the scheme itself.
Note: In programs that create schemes in SAS format, do not specify a data set label; doing so deletes the meta options.
The meta options are stored using:
'lookup-method' 'apply-mode' 'sensitivity-level' 'match-definition'
lookup-method
EM specifies that the default value of the SCHEME_LOOKUP option or argument is EXACT. For an input value to be transformed, that value must exactly match a  DATA value in the scheme.
IC specifies that SCHEME_LOOKUP=IGNORE_CASE.
UM specifies that SCHEME_LOOKUP=USE_MATCHDEF. Match codes are created and compared for all input values and all DATA values in the scheme.
apply-mode
E specifies that the default value of the MODE option or argument is ELEMENT.
P specifies that MODE=PHRASE.
sensitivity-level
the amount of information in the match codes that is generated when SCHEME_LOOKUP=USE_MATCHDEF.
Valid values range from 50 to 95.
match-definition
the name of the default match definition that is used when the value of the SCHEME_LOOKUP option is USE_MATCHDEF.
For example, the meta options string, 'UM' 'P' '80' 'NAME', specifies that the scheme:
  • lookup method is match code
  • the apply-mode is by phrase
  • the sensitivity-level is 80
  • the match-definition is NAME