| Using the SAS Data Quality Server Software |
A scheme is a data set that you create to transform the values of a character variable. You first create a scheme data set, and then you apply it to the data. The transformation converts related groups of values into a single value.
The DQSCHEME procedure (see The DQSCHEME Procedure) is used to create and apply schemes. You can create and apply multiple schemes in a single procedure step. The DQSCHEMEAPPLY function and CALL routine are also used to apply schemes (see DQSCHEMEAPPLY CALL Routine). You can also display, create, apply, and edit schemes in the dfPower Studio software from DataFlux (a SAS company).
| Creating Schemes |
Before you apply a scheme (see Applying Schemes), you first create the scheme data set. Schemes are created with the CREATE statement in the DQSCHEME procedure. When you submit a CREATE statement, PROC DQSCHEME creates match codes and assigns cluster numbers. A unique cluster number is assigned to each group of two or more input values that generate the same match code. After cluster numbers are assigned, the transformation value is determined. The transformation value is the most common value in each cluster.
Note: During scheme creation, the DQSCHEME procedure evaluates the definition
of each input variable in each CREATE statement. An error message is generated
if the defined length of an input variable exceeds 1024 bytes. ![[cautionend]](../../../../common/61925/HTML/default/images/cautend.gif)
Scheme data sets are created in SAS format or BFD format. BFD stands for Blue Fusion Data, which is a format that is recognized by SAS and by the dfPower Studio software. The SAS Data Quality Server software can create, apply, and display schemes in SAS format or BFD format. The dfPower Studio software can create, apply, display, and edit schemes in BFD format only. In the z/OS operating environment, the SAS Data Quality Server software can create, apply, and display schemes in SAS format; schemes in BFD format can be applied.
| About Analysis Data Sets |
Analysis data sets describe how a transformation would take place, without actually transforming the data; they enable you to experiment with different options to arrive at a scheme that provides optimal data cleansing. Analysis data sets are generated by specifying the ANALYSIS= option in the CREATE statement of the DQSCHEME procedure.
The key to optimizing a scheme is to choose the sensitivity value that best suits your data and your goal. You can create a series of analysis data sets using different sensitivity values to compare the results. Changing the sensitivity value changes the clustering of input values, as described in About Sensitivity.
When you decide on a sensitivity level you can create the scheme data set by replacing the ANALYSIS= option with the SCHEME= option in the CREATE statement. For further information on sensitivity values and how they affect the creation of match codes and clusters, see About Sensitivity.
The analysis data set contains one observation for each unique input value. Any adjacent blank spaces are removed from the input values. The COUNT variable describes the number of occurrences of that value. The CLUSTER variable designates a cluster number for each input character value.
The CLUSTER variable contains a value only when two or more different input values generate the same match code. An input value that is repeated receives a cluster number only if another value generates the same match code.
You can specify the INCLUDE_ALL option in the CREATE statement to include all input values in the scheme, including the unique input values that did not receive a cluster number in the analysis data set.
| Applying Schemes |
After you create a scheme data set (see Creating Schemes), you apply it to an input variable to transform its values. You can apply a scheme with the APPLY statement in the DQSCHEME procedure (see APPLY Statement), or with the DQSCHEMEAPPLY function or CALL routine (see DQSCHEMEAPPLY Function). Use the CALL routine rather than the function if you want to return the number of transformations that occurred during the application of the scheme.
The scheme data set consists of the DATA and STANDARD variables. The DATA variable contains the input character values that were used to create the scheme. The STANDARD variable contains the transformation values. All of the DATA values in a given cluster have the same STANDARD value. The STANDARD values are the values that were the most common values in each cluster when the scheme was created.
When you apply a scheme to a SAS data set, an input value is transformed when the it matches a DATA value in the scheme. The transformation replaces the input value with the transformation value.
The lookup method determines how the input value is matched to the DATA values in the scheme. The SCHEME_LOOKUP option or argument specifies that the match must be exact, case-insensitive, or consist of a match between the match codes of the input value and the match codes of the DATA values in the scheme. When a match occurs, any adjacent blank spaces in the transformation value are replaced with single blank spaces; then the value is written into the output data set. If no match is found for an input value, that exact value is written into the output data set.
You can specify the MODE argument or the MODE= option to apply schemes in one of two modes: phrase or element. Applying a scheme by phrase compares the entire input value (or the match code of the entire value) to the values (or match codes) in the scheme. Phrase is the default scheme apply mode.
When you apply a scheme by element, each element in the input value (or match code of each element) is compared to the values (or match codes) in the scheme. Applying schemes by element enables you to change one or more elements in an input value, without changing any of the other elements in that value.
The file format of a scheme is important when that scheme is applied. In the z/OS operating environment, schemes must be created and applied in SAS format. Schemes that are stored in a PDS in BFD format can be applied, and schemes in BFD format can be converted to SAS format using the CONVERT statement in the DQSCHEME procedure.
Note: Schemes in BFD format cannot be created or displayed in the z/OS
operating environment. ![[cautionend]](../../../../common/61925/HTML/default/images/cautend.gif)
| About Meta Options |
Meta options are stored in the scheme when the scheme is created; they provide default values for certain options of the APPLY statement of the DQSCHEME procedure or default arguments for the DQSCHEMEAPPLY function or CALL routine. Default values are stored for the lookup mode (SCHEME_LOOKUP option or argument), apply mode (MODE option or argument), match definition, and sensitivity level. The values of the meta options are superseded when other values are specified in the APPLY statement or in the DQSCHEMEAPPLY function or CALL routine.
When the scheme is applied, the meta options for the match definition and sensitivity value are valid only when the scheme is applied with match-code lookup (when the value of SCHEME_LOOKUP isUSE_MATCHDEF).
The meta options are stored differently depending on the scheme format. For schemes in SAS format, the meta options are stored in the data set label. For schemes in BFD format, the meta options are stored within the scheme itself.
Note: In programs that
create schemes in SAS format, do not specify a data set label; doing so deletes
the meta options. ![[cautionend]](../../../../common/61925/HTML/default/images/cautend.gif)
The meta options are stored using the following syntax:
| "lookup-method" "apply-mode" "sensitivity-level" "match-definition" |
Valid values are as follows:
| EM |
specifies that the default value of the SCHEME_LOOKUP option or argument is EXACT. For an input value to be transformed, that value must exactly match a DATA value in the scheme. |
| IC |
SCHEME_LOOKUP=IGNORE_CASE. |
| UM |
SCHEME_LOOKUP=USE_MATCHDEF. Match codes are created and compared for all input values and all DATA values in the scheme. |
Valid values are as follows:
| E |
specifies that the default value of the MODE option or argument is ELEMENT. |
| P |
MODE=PHRASE. |
specifies the amount of information in the match codes that is generated when SCHEME_LOOKUP=USE_MATCHDEF. Valid values range from 50 to 95.
specifies the name of the default match definition that will be used when the value of the SCHEME_LOOKUP option or argument is USE_MATCHDEF.
"UM" "P" "80" "NAME"
Copyright © 2007 by SAS Institute Inc., Cary, NC, USA. All rights reserved.