You are here: Folders Riser Bar>Maintaining Data Jobs>Data Job Nodes

DataFlux Data Management Studio 2.6: User Guide

Data Job Nodes

Data jobs are the main way to process data in DataFlux Data Management Studio. You can add the following nodes to data jobs:

The tables below give brief descriptions of each node. To display the online Help for each data job node, click on the node name in the tables below. You may also open a data job in the data job editor, select a node in the Nodes tree, and then click the Help link in the pane at the bottom of the Nodes tree.

Data Job

Name Description
Data Job (reference) Points to a data job file (*.dds file) on the file system. Used to include a data job within another data job. Enables you to point to an existing data job that has a set of data-processing operations that are appropriate for the current data job.

Data Inputs

You can use these nodes to specify different types of input to a data job.

Name Description
Data Source Specifies a table as an input in a data job. For an example of how to use this node, see Creating a Data Job in the Folders Tree. See also Reading Named Ranges in an Excel Spreadsheet.
SQL Query Specifies an SQL query that selects data from one or more tables. The results table is used as an input in a data job. For an example of how to use this job, see Adding a Data Job Node to a Process Job.
Text File Input Specifies a delimited text file as an input in a data job. For an example of how to use this node, see Using Text Files in a Data Job.
Fixed Width File Input Specifies a fixed-width text file as an input in a data job. For an example of how to use this node, see Using Text Files in a Data Job.
External Data Provider Provides a landing point for source data that is external to the current job. Accepts source data from another job or from user input that is specified at run time. Can be used as the first node in a data job that is called from another job. Can also be used as the first node in a data job that is deployed as a real-time data service. For an example of how to use this node, see Deploying a Data Job as a Real-Time Service.
Table Metadata Extracts metadata for a specified table. This information can be used to identify changes in the corresponding physical table.
JMS Reader Reads information from a Java Message Service (JMS) and makes this information available in a data job. See Overview of JMS Nodes.
COBOL Copybook Incorporates flat files from a mainframe environment into a data job.
Job Specific Data Sends sample data to the External Data Provider node for the purpose of testing and debugging. For an example of how to use this node, see Using Reference Data Manager Domain Items in a Job.
XML Input

Specifies an XML file as an input in a data job. Reads selected pieces of XML and presents them collectively as rows in a table. If the XML source file contains multiple tables, use one XML Input node per table.

Work Table Reader Specifies a work table as an input in a data job.
Document Conversion Enables you to address unstructured data by obtaining UTF-8 encoded text for different document types. In addition to the UTF-8 encoded text, the node can output metadata found in the converted document, such as author, title, and the number of pages. For an example of how to use this node, see Converting and Extracting a Document.

Data Outputs

You can use these nodes to specify different types of output from a data job.

Name Description
Data Target (Update) Outputs data in a variety of data formats and allows for updating existing data.
Data Target (Insert) Outputs data in a variety of data formats to a new data source (leaving your existing data as-is) or overwriting your existing data.
Delete Record Eliminates records from a data source by using the unique key of those records.
HTML Report Creates and allows edits to an HTML-formatted report from the results of your data job. For an example of how to use this node, see Getting a List of Reference Data Manager Domains.
Text File Output Creates a plain-text file with the results of your data job whenever rows of data are received. The node is closed as soon as there are no more rows. This procedure makes the files available to other nodes on the same page. For an example of how to use this node, see Converting and Extracting a Document.
Fixed Width File Output Outputs your data to well-defined fixed-width fields in your output file. For an example of how to use this node, see Using Text Files in a Data Job.
Frequency Distribution Chart Creates a chart that shows how selected values are distributed throughout your data. For an example of how to use this node, see Creating a Data Job in the Folders Tree.
Entity Resolution File Output Writes clustered data to an Entity Resolution file. This file can be viewed from the Entity Resolution folder in the Folders tree. For an example of how this node can be used, see Generating an Entity Resolution File.
Match Report Produces a report listing the duplicate records identified with your match criteria. Then, view the report with the Match Report Viewer.
JMS Writer Writes information from a data job to a Java Message Service (JMS). See Overview of JMS Nodes.
XML Output Writes from a data job to XML.
Work Table Writer Writes data to the output of data jobs.

Data Integration

You can use the data integration nodes to sort, join, and otherwise integrate data.

Name Description
Data Sorting Re-orders the data set at any point in a data job. For an example of how to use this node, see Deploying a Data Job as a Real-Time Service.
Domain Items Lists the items within one or more Reference Data Manager domains. In order to use this node, domains must have been created in Reference Data Manager, which is a separately-licensed DataFlux Web Studio module. For an example of how to use this node, see Using Reference Data Manager Domain Items in a Job.
Domains Lists Reference Data Manager domains. For an example of how to use this node, see Listing Reference Data Manager Domains.
Data Joining Combines two data sets so that the records of one, the other, or both data sets are used as the basis for the resulting data set. The data joining process is similar to SQL joins. For an example of how to use this node, see Using Reference Data Manager Domain Items in a Job.
Data Joining (Non-Key) Joins two tables, each with the same number of records, by location in the file rather than by a unique key. This approach is much quicker than a traditional data joining step that expects to bring records together based on a unique key.
Data Union Combines all of the data from two data sets. Like an SQL union join, this node simply adds the two data sets together; the resulting data set contains one record for each record in each of the original data sets.
SQL Lookup Enables finding rows in a database table that have one or more fields matching those in the data job. This style of processing provides an explicit advantage with performance, especially with large databases.
SQL Execute Constructs and executes any valid SQL statement (or series of statements).
Parameterized SQL Query Writes an SQL query that contains variable inputs, also known as parameters. Each parameter in the query is represented by a question mark. When a row of data is processed, the parameters are replaced with the values of the input fields that have been designated as parameter inputs, and the query is executed.
Web Service Issues SOAP requests to a SOAP server. The requests are constructed using an XML Template that can be discovered. You can extract fields from the response with an XML Template that defines where the fields occur in the response.
HTTP Request Issues HTTP requests to an HTTP server. The requests, and optional replies, are built by the caller and can be any content that the server is expecting. For an example of how to use this node, see Managing an HTTP Request.
SAP Remote Function Call Provides access to SAP data and metadata in the context of a data job.

Quality

You can use the quality nodes to analyze data that is specified in a data job.

Name Description
Gender Analysis Determines a gender value from a list of names. The results are placed in a new field, and have three possible values: M (male), F (female), and U (unknown).
Gender Analysis (Parsed) Performs gender analysis on data that has already been parsed into given and last name fields. The gender analysis definition determines a gender value from a list of names.
Identification Analysis Determines the related type of a data string. For example, the node could be used to determine whether a string represents an individual's name or the name of an organization.
Parsing Separates multi-part field values into multiple, single-part fields.
Standardization Makes similar items the same. Examples of standardization are correcting misspellings (Mary instead of Mmary), using full company names instead of initials (International Business Machines instead of IBM), and using consistent naming conventions for states (North Dakota instead of ND).
Standardization (Parsed) Performs standardization on data that has already been parsed into its constituent parts.
Change Case Makes all alphabetical values in a field uppercase, lowercase, or proper case.
Locale Guessing Accesses information from the Quality Knowledge Base (QKB) and compares with your data to guess the country (locale) to which your data applies. This step creates an additional field that contains the locale's code.
Right Fielding Copies data from one field to another based on the data type, as determined by an identification analysis definition.
Create Scheme Generates a custom scheme file that can be used to standardize data outside of the usual DataFlux Data Management Studio standardization definitions.
Dynamic Scheme Application Performs a single scheme standardization on a specific field containing multiple locales.
Field Extraction Pulls information from a free form text field so you can analyze product information.
Document Extraction Identifies textual entities and their usage within a given text input. The node will identify the terms (words) found in the input text and the term’s usage categorizations, such as vehicle, person, title, or company. For an example of how to use this node, see Converting and Extracting a Document.

Enrichment

You can use the verification nodes to use third-party reference databases to enrich, standardize, and augment the data that is specified in a data job. The distributed nodes enable the integration of a DataFlux dfIntelliServer and a DataFlux Data Management Server. With this integration, verify processing can be offloaded to another machine to help ease the burden on the Data Management Server.

Name Description
Address Update Lookup Runs the United States Postal Service (USPS) National Change of Address (NCOA) data through NCOALink® processing. NCOALink is validated by the postal service in two steps for certification similar to Coding Accuracy Support System (CASS) certification. This node implements the necessary job step to do the processing. For more information, see Using the Address Update Add-On.
Address Verification (US/Canada) Verifies addresses from the US and Canada.
Address Verification (QAS) Verifies country addresses outside of the United States (US) and Canada. To use this node, you must obtain your address reference databases and licenses directly from QAS.
Loqate Verifies international addresses using Loqate software.
Address Verification (World) Verifies addresses from outside of the US and Canada. This node is similar to QAS address verification, but it supports verification and correction for addresses from over 200 locales.
Address Verification (World 2) Verifies international addresses using Address Doctor 5 software. For more information, see Working with the Address Verification (World 2) Node.
Geocoding Matches geographic information from the geocode reference database with ZIP codes in your data to determine latitude, longitude, census tract, Federal Information Processing Standards (FIPS), and block information.
Street-Level Geocoding Conducts street-level geographic location analysis.
US City/State/Zip Validation Verifies that a city and state are correct for the ZIP code provided in the data input.
County Matches information from the phone and geocode reference databases with Federal Information Processing Standards (FIPS). To use this node, you must have FIPS codes in your data.
US City/State/Zip Lookup Looks up either the city or state by the ZIP code or the ZIP code by city and state.
Phone Matches information from the phone reference database with telephone numbers in your data to determine information. To use this node, you must have telephone numbers in your data.
Area Code Matches information from the phone reference database with ZIP codes in your data to calculate Area Code, Overlay, and Result values. To use this node, you must have ZIP codes in your data.
Canadian Postal Code Lookup Enters Canadian postal codes. The output returns a range of addresses for a postal code, including a range of street numbers, street names, cities, provinces, and postal codes.
Distributed Geocoding Offloads geocode processing to a machine other than the one running the current job.
Distributed Address Verification Offloads address verification to a machine other than the one running the current job.
Distributed Phone Offloads phone data processing to a machine other than the one running the current job.
Distributed Area Code Offloads area code data processing to a machine other than the one running the current job.
Distributed County Offloads processing of county data to a machine other than the one running the current job.

Entity Resolution

You can use the entity resolution nodes to perform record matching. Record matching merges multiple files (or duplicate records within a single file) in such a way that the records referring to the same physical object are treated as a single record. Then, records are matched based on the information that they have in common. For examples of how these several of these nodes can be used together in a data job, see Working with Entity Resolutions, Suggestion-Based Matching, and Combination-Based Matching.

Name Description
Match Codes Specifies the duplicate records identified with your match criteria.
Match Codes (Parsed) Specifies the duplicate records identified with your match criteria for parsed data.
Clustering Creates a cluster ID that is appended to each input row. Many rows can share cluster IDs, which indicate that these rows match using the clustering criteria specified.
Surviving Record Identification Examines clustered data and determines a surviving record for each cluster. This surviving record identification (SRI) process allows for eliminating duplicate information in a data source. The surviving record is identified using one or more user-configurable record rules.
Cluster Aggregation Accepts the output of a Clustering node and aggregates clusters that share the same membership (based on Primary Key). Since multiple match codes might have different scores, the Cluster Aggregation node is also able to reconcile clusters of identical membership but different scoring.
Cluster Diff Compares sets of clustered records by taking inputs from each of two tables that are referred to as a "left" and a "right" table. From each table, the node takes two inputs: a record ID field and a cluster number field.
Cluster Analysis Compares pairs of rows within a single match cluster to determine whether each pair is really a match.
Sub-Clustering Functions like the Clustering node, but at a lower level.

Utilities

You can use the utility nodes to perform specialized tasks in a data job.

Name Description
Expression Enables you to add a DataFlux Expression Engine Language (EEL) expression to the flow in a data job.
Data Validation Analyze the content of data by setting validation conditions that are used to filter data for a more accurate view of that data. For an example of how to use this node, see Using Reference Data Manager Domain Items in a Job.
Calculated Field Calculates a score from any number of incoming numeric fields and a selected algorithm. Then the resulting score is placed in a new column. Each field included in the calculation is assigned a weight that determines how much impact it has on the resulting score.
Concatenate Combines one or more fields into a single field.
Branch Enables up to 32 Expression nodes to simultaneously access data from a single source. Depending on configuration, data is passed from the Branch node directly to each of the Expression nodes or the data is temporarily stored in memory or disk caches, before passing it along.
Realtime Service Accesses a real-time service on a DataFlux Data Management Server from your DataFlux Data Management Studio job.
Sequencer (Autonumber) Creates a sequence of numbers given a starting number and a specified interval. Because this node establishes a sequence without repeating values, this step can be useful when creating entries for primary key values.
Field Layout Renames and reorders field names as they pass out of this node.
Java Plugin Runs a specified Java program in the context of a data job. See Running a Java Program in a Data Job.
Safe String Encode/Decode Makes the string safe for use in SQL statements, FTP transmissions, and other tasks by removing punctuation, spaces, and non-ASCII characters.
Email and FTP Adds a step to e-mail and FTP files that sends the output from a DataFlux Data Management Studio job to one or more recipients.
External Program (Delimited Input and Output) Passes data fields into an STDIN from an executable outside of DataFlux® products using a delimited text file. The node then takes data from the STDOUT from the executable and parses the file as delimited.
External Program (Delimited Input, Fixed Width Output) Passes data fields into an STDIN from an executable outside of DataFlux® products using a delimited text file. The node then takes data from the STDOUT from the executable and parses the file as a fixed-width text file.
External Program (Fixed Width Input, Delimited Output) Passes data fields into an STDIN from an executable outside of DataFlux® products using a fixed-width text file. The node then takes data from the STDOUT from the executable and parses the file as a delimited text file.
External Program (Fixed Width Input and Output) Passes data fields into an STDIN from an executable outside of DataFlux® products using a fixed-width text file. The node then takes data from the STDOUT from the executable and parses the file as another fixed-width text file.
External Program (Delimited Output) Takes data from the STDOUT from the executable and parses the file as a delimited text file.
External Program (Fixed Width Output) Takes data from the STDOUT from the executable and parses the file as a fixed-width text file.
XML Column Input Reads XML from a column in a row of data as an input in a data job. It enables you to augment rows with columns pulled from a column containing XML. For example, you could use this node in conjunction with the XML Column Output node with Web Service node. You could also use column input and output when you need to pull data out of data sources that are stored as XML. For an example of how to use this node, see Managing an HTTP Request.
XML Column Output Collects columns in a row into XML that is placed in a column. For example, you could use this node in conjunction with the XML Column Input node with Web Service node. You could also use column input and output when you need to pull data out of data sources that are stored as XML.

Monitor

You can use the monitoring nodes to monitor data programmatically and process the output from data monitoring operations.

Name Description
Data Monitoring Applies a task to its input table. Each task specifies one or more business rules and one or more events that can be triggered based on the results that are returned by a rule. The events triggered by the task can be used to monitor data quality. For an example of how to use this node, see Creating a Data Monitoring Job.
Repository Info Lists the repositories that are defined on the Administration riser in DataFlux Data Management Studio. Lists the rules and tasks that are in the current repository.
Repository Primary Lists summary information about the executions of all tasks in the current repository. This information could be used as input to other nodes in a data monitoring job.
Repository Detail Lists detailed information about the executions of all tasks in the current repository. The information for each execution includes the values for each row returned by the task. These rows can be grouped by execution attributes and sorted by rule attributes.
Repository Log Lists fields that are associated with the business rules that are defined in the current repository. This node is used by some DataFlux solutions.
Execute Business Rule Enables you to select an existing, row-based business rule and apply it to rows of data as they flow through a data job. This node bypasses the need to associate business rules to tasks and associated events to triggered rows. For an example of how to use this node, see Using the Execute Business Rule Node.
Execute Custom Metric Enables you to select an existing custom metric and apply it to a set of data in a data job. This node extends the usage of custom metrics beyond group and set rules and data profiling. For an example of how to use this node, see Using the Execute Custom Metric Node.

Profile

You can use the profiling nodes to profile data programmatically and process the output from data profile analysis.

Note Note: If you are extracting data from XML, first use the XML input step to extract your data into a text file or database, and then profile that text file or database.

Name Description
Pattern Analysis Looks for patterns in a data source.
Basic Statistics Generates basic statistics about your data sources.
Frequency Distribution Adds frequency distribution profiling to the flow.
Basic Pattern Analysis Runs a simplified version of pattern analysis that enables you to run pattern analysis on character sets, digits, or combined input with both characters and numeric digits.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfU_DataFllowNodes.html