Using Reference Data Manager Domain Items in a Job

DataFlux Data Management Studio 2.6: User Guide

Using Reference Data Manager Domain Items in a Job

Overview

Reference Data Manager is a separately-licensed DataFlux Web Studio module. It enables you to manage a central repository of reference information. Examples of reference information include a list of NASDAQ stock symbols with their associated company names, or a list of valid zip codes with their associated cities and states. The next display shows a Reference Data Manager domain called US_Postal_Abbreviations.

The US_Postal_Abbreviations domain consists of the names and two-letter abbreviations for states in the United States. The two-letter state abbreviations are the key fields (Item Key) in the US_Postal_Abbreviations domain.

The Domain Items node in DataFlux Data Management Studio enables you to list the items within one or more Reference Data Manager domains. You can use these items to validate incoming data in a job. For example, you could use a Domain Items node to compare incoming address data to the state abbreviations in the US_Postal_Abbreviations domain that is shown above. Nonstandard abbreviations could be listed in a report. The next display illustrates such a job.

In the example job, the Domain Items node could list items in the US_Postal_Abbreviations domain, including the two-letter abbreviations for states.

Job Specific Data nodes are typically used for demonstration jobs when it is not convenient to use real data. In the example job, the Job Specific Data node could specify a few cities and states, where the state is specified as a two-letter abbreviation. Some of the abbreviations could be incorrect, to verify that the job could handle such exceptions. In a production job, a Data Source node or some other data input node would be used instead of the Job Specific Data node.

The Data Joining node uses a key field to join incoming data (from the Job Specific Data node) to a set of reference information (from the Domain Items node). In the example job, the key field would be a field for two-letter state abbreviations.

The Data Validation node evaluates the output of the join. In the example job, only rows where the state abbreviation from the data source does not match the state abbreviation from the reference data will be written to the HTML Report node at the end of the flow (Invalid State Abbreviations).

Before you can use the Domain Items node in a data job:

Appropriate domains must have been created in Reference Data Manager. See the Reference Data Manager chapter in the DataFlux Web Studio User Guide.
The Reference Data Manager domains must be linked to the repository where the data job is stored, as described in Link to a Reference Data Management Repository.

Use Reference Data Manager Domain Items in a Job

If the prerequisites have been met, then you can perform the following steps to create a job that lists the items within a Reference Data Manager domain. You can then use these items to validate incoming data. It is assumed that you are familiar with creating and running data jobs. For more information about data jobs, see Maintaining Data Jobs.

Create the Job

Create an empty data job. For the current example, the goal is to create a job flow similar to the example job above.
In the Nodes tree on the left, expand the Data Integration folder and drag a Domain Items node and a Data Joining node into the job editor on the right.
Expand the Data Inputs folder and drag a Job Specific Data node into the job editor.
Expand the Utilities folder and drag a Data Validation node into the job editor.
Expand the Outputs folder and drag an HTML Reports node into the job editor.
Connect the nodes as shown in the sample job above.

Configure the Domain Items Node

The goal is to list items from the US_Postal_Abbreviations domain, including the two-letter state abbreviations. Assume that this domain has the short name States.

Right-click the Domain Items node and select Properties.
Specify the domain from which you will list items. For the current example, you would select States, which is a short name for the example US_Postal_Abbreviations domain, as shown in the next display.
The options shown above specify that the Generated Output Fields (ID, ITEM_KEY, etc.) will be output from the States domain. The two-letter state abbreviations in the domain will appear in the ITEM_KEY output field.
In the previous display, note the job flow navigation tabs for the job (RDM Domain Items Demo) and the current node (Domain Items. When you are finished specifying options for the Domain Items node, click the tab for the job. Given the display above, you would click RDM Domain Items Demo. You will be returned to the job editor.

Configure the Job Specific Data Node

The next step is to specify the input data for the job. For the example job, you could specify a few cities and states, where the state is specified as a two-letter abbreviation. Some of the abbreviations could be incorrect, to verify that the job could handle such exceptions.

Right-click the Job Specific Data node and select Properties.
Add two fields in the Fields area of the properties window: City and State. Accept the default field type and length. These two fields will become columns in the Values area of the properties window.
Add a few cities and states in the Values area of the properties window. Specify the state as a two-letter abbreviation. Some of the abbreviations could be incorrect, as shown in the next display.
Click OK to save your changes to the Job Specific Data node.

Configure the Data Joining Node

The next step is to join the source data from the Job Specific Data node with the reference data from the Domain Items node. Both the source data and the reference data have a field that contains two-letter state abbreviations. This field will be used to create the join expression.

For the example job, you would right-click the Data Joining node and select Properties. The top half of the properties window would look similar to the next display. The fields that will be used in the join expression are highlighted.
To create the join expression for the current example, select the State field from the source data. Then select the ITEM_KEY field in the reference data.
Click the down arrow that is between the left and right table panels. A join expression will be created in the Expressions panel, as shown in the next display.
The next task is to specify outputs for the join in the Output fields area in the properties window. The outputs you select will vary with your goals. For the current example, you could select the City and State fields from the source data and the ITEM_KEY field from the reference data, as shown in the next display.

These fields will appear in the report which is generated at the end of the job.

Configure the Data Validation Node

The next task is to configure the Data Validation node that will evaluate the output of the join. In the example job, only "bad data" will be reported. That is, only rows where the state abbreviation from the data source does not match the state abbreviation from the reference data will be written to the HTML Report node at the end of the flow. The steps below describe one way to configure the Data Validation node to get the desired result.

The logic is as follows: when the data source and reference data tables are joined, both the State field from the source and the ITEM_KEY field from the reference data will have the same two-letter state abbreviation--unless the state abbreviation from the data source does not match the state abbreviation from the reference data. In that case, the ITEM_KEY field will be null. When the ITEM_KEY field is null for a row, that row will be written to the HTML Report node at the end of the flow.

Perform the following steps to implement this logic for the example job.

Right-click the Data Validation node and select Properties.
Select the ITEM_KEY field in the Field control.
Select "is not null" in the Operation control.
Click Add Condition to add the specified condition to the Expression panel, as shown in the next display.
In the Row validation control, select Passes.
In the Action control, select Remove row from output.
Click OK to save your changes.

Configure the HTML Reports Node

The HTML Reports node at the end of the example job displays rows from the data source that do not have valid state abbreviations. That is, the rows have state abbreviations that are not found in the US_Postal_Abbreviations domain. The steps below describe one way to configure the HTML Reports node.

Right-click the HTML Reports node and select Properties.
Specify a name for the report.
Specify a title for the report.
Specify a path to an HTML file for the report.
Select output columns for the report. At this point, the properties dialog for the HTML Reports node might look similar to the next display.
Click OK to save your changes.
Run the job.
A report will display the source table rows with state abbreviations that are not found in the US_Postal_Abbreviations domain. The next display shows an example report.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfDMStd_Task_RDM_DomainItems.html