DataFlux Data Management Studio 2.7: User Guide

Scheme Builder - Options for Jobs

The following options are used when a job node, such as the Standardization node, applies a scheme in the context of a job. You can access this dialog by clicking the Options button at lower right of the Scheme Builder window.

You can select options that control how your scheme will be applied to your data.  The options you select will be stored in the scheme file so they can be used in any Data Job that applies your scheme.  Note, however, that some options can be overridden in the Standardization node user interface.

Type Options

Select Element or Phrase to control whether a lookup is performed on individual elements of an input string or only on the entire input string as a whole.  If you select Element, individual elements of an input string may be transformed by a scheme entry.  For example, suppose you have the following entry in your scheme.

Data Standard
MISTER MR


If you apply this scheme to input string MISTER JONES with the Element option selected, the resulting output will be MR JONES.

If you select Phrase, the entire input string must match an entry in your scheme in order for the input string to be transformed.  For instance, if you apply the example scheme above to MISTER JONES with the Phrase option selected, the output will be MISTER JONES.  In other words, no transformation will occur.  In order for a transformation to occur with the Phrase option selected, your scheme would need to contain the following entry:

Data Standard
MISTER JONES MR JONES


Match Definition
Options

Instead of using a simple exact-match lookup with your scheme, you may wish to do a fuzzy lookup using a Match Definition.  A Match Definition is a callable object in the SAS Quality Knowledge Base.  A Match Definition is used to generate matchcodes, which are used for fuzzy matching.  There are different Match Definitions available for use with different types of data.  For more information about Match Definitions, see the online help for your SAS Quality Knowledge Base.

If you select a Match Definition in your Scheme Options, a matchcode is generated for each entry in your scheme when the scheme is applied to your data in a Data Job.  A matchcode is also generated for each input string to which the scheme is applied.  If the matchcode for an input string is the same as the matchcode for a data entry in your scheme, the input string is replaced with the standard associated with that data entry.  This means that the input string does not need to be identical to an entry in your scheme in order for a standard to be applied; the input string needs only to be a fuzzy match for an entry in your scheme.

For example, your scheme might include an entry such as this one:

Data Standard
GREYWOOD HLDG GREYWOOD HOLDINGS LLC


If your input string is GREYWOOD HLDG, an exact match will not occur when you apply the scheme to your data.  But if you select a Match Definition in your Scheme Options – say the Organization Match Definition – GREYWOOD HLDG might still be deemed to match GREYWOOD HLDG, and so GREYWOOD HLDG will be replaced with the standard GREYWOOD HOLDINGS LLC.

Of course, if you use a Match Definition in your Scheme Options, there is a risk that you will encounter false positives – inadvertent transformations that occur due to unwanted matches.  It is recommended that you test your Data Job carefully on a data sample before deploying it to a production environment.

To use a Match Definition with your scheme, select a Match Definition in the Name drop box under the Match Definition area in the Scheme Options dialog.  The Name drop box shows all of the Match Definitions that are available in your active Quality Knowledge Base (QKB).  For information on how to set up an active QKB, see Making a QKB Active. For details about individual Match Definitions, see the online help for your active QKB.

After you have selected a Match Definition, choose a Sensitivity value.  The Sensitivity controls the amount of fuzziness that is allowed when the scheme is applied to your data.  The higher the Sensitivity, the more closely an input string must match an entry in the scheme in order for that input string to be replaced with the standard for that entry.  You may wish to experiment with different Sensitivity values before deploying your Data Job to a production environment.

Case sensitive - Select this checkbox if you wish for data entries to be looked up in your scheme in a case-sensitive manner.

Trim whitespace in data - Select this checkbox if you want scheme lookups to be insensitive to leading and trailing whitespaces and/or the number of whitespaces between words.  For example, suppose you add the following entry in your scheme: 

DATA STANDARD
FOO   BAR BAZ


Note that there are two spaces before the word FOO and three spaces between FOO and BAR.  If you have the Trim whitespace in data checkbox selected, then when you save the scheme, this entry is converted to:

DATA STANDARD
FOO BAR  BAZ


Leading and trailing whitespaces are removed, and multiple spaces between words are collapsed. Now when you apply this scheme to the input string FOO BAR, the result will be BAZ.

Allow duplicate data - Select this checkbox if you want to be able to add multiple entries in your scheme with the same data value.  You can use this option in combination with a Match Definition.  For example:

DATA STANDARD
PRINCIPL PRINCIPAL
PRINCIPL PRINCIPLE


If you use a Match Definition when applying a scheme with the above entries, an input string containing a misspelling of either PRINCIPAL or PRINCIPLE could match both of these entries.  If this happens, both matching entries are examined, and the standard that is closest to the value in the input string is chosen to replace that value.  For instance, if the input string is PRINSIPAL SYSTEMS or PRINNCIPAL SYSTEMS, then the output will be PRINCIPAL SYSTEMS.  But if the input is PRINSIPLE SYSTEMS or PRINNCIPLE SYSTEMS, then the output will be PRINCIPLE SYSTEMS.

Note that this is an advanced option that is typically employed only by users who have in-depth knowledge of phonetics and other aspects of the Match Definition that they have selected.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: DMCust_SchBldr_20005.html