Data Job Nodes

You are here: Folders Riser Bar>Maintaining Data Jobs>Data Job Nodes

DataFlux Data Management Studio 2.6: User Guide

Data Job Nodes

Data jobs are the main way to process data in DataFlux Data Management Studio. You can add the following nodes to data jobs:

Data Job
Data Inputs
Data Outputs
Data Integration
Quality
Enrichment
Entity Resolution
Utilities
Monitor
Profile

The tables below give brief descriptions of each node. To display the online Help for each data job node, click on the node name in the tables below. You may also open a data job in the data job editor, select a node in the Nodes tree, and then click the Help link in the pane at the bottom of the Nodes tree.

Data Job

Name	Description
Data Job (reference)	Points to a data job file (*.dds file) on the file system. Used to include a data job within another data job. Enables you to point to an existing data job that has a set of data-processing operations that are appropriate for the current data job.

Data Inputs

You can use these nodes to specify different types of input to a data job.

Name	Description
Data Source	Specifies a table as an input in a data job. For an example of how to use this node, see Creating a Data Job in the Folders Tree. See also Reading Named Ranges in an Excel Spreadsheet.
SQL Query	Specifies an SQL query that selects data from one or more tables. The results table is used as an input in a data job. For an example of how to use this job, see Adding a Data Job Node to a Process Job.
Text File Input	Specifies a delimited text file as an input in a data job. For an example of how to use this node, see Using Text Files in a Data Job.
Fixed Width File Input	Specifies a fixed-width text file as an input in a data job. For an example of how to use this node, see Using Text Files in a Data Job.
External Data Provider	Provides a landing point for source data that is external to the current job. Accepts source data from another job or from user input that is specified at run time. Can be used as the first node in a data job that is called from another job. Can also be used as the first node in a data job that is deployed as a real-time data service. For an example of how to use this node, see Deploying a Data Job as a Real-Time Service.
Table Metadata	Extracts metadata for a specified table. This information can be used to identify changes in the corresponding physical table.
JMS Reader	Reads information from a Java Message Service (JMS) and makes this information available in a data job. See Overview of JMS Nodes.
COBOL Copybook	Incorporates flat files from a mainframe environment into a data job.
Job Specific Data	Sends sample data to the External Data Provider node for the purpose of testing and debugging. For an example of how to use this node, see Using Reference Data Manager Domain Items in a Job.
XML Input	Specifies an XML file as an input in a data job. Reads selected pieces of XML and presents them collectively as rows in a table. If the XML source file contains multiple tables, use one XML Input node per table.
Work Table Reader	Specifies a work table as an input in a data job.
Document Conversion	Enables you to address unstructured data by obtaining UTF-8 encoded text for different document types. In addition to the UTF-8 encoded text, the node can output metadata found in the converted document, such as author, title, and the number of pages. For an example of how to use this node, see Converting and Extracting a Document.

Data Outputs

You can use these nodes to specify different types of output from a data job.

Name	Description
Data Target (Update)	Outputs data in a variety of data formats and allows for updating existing data.
Data Target (Insert)	Outputs data in a variety of data formats to a new data source (leaving your existing data as-is) or overwriting your existing data.
Delete Record	Eliminates records from a data source by using the unique key of those records.
HTML Report	Creates and allows edits to an HTML-formatted report from the results of your data job. For an example of how to use this node, see Getting a List of Reference Data Manager Domains.
Text File Output	Creates a plain-text file with the results of your data job whenever rows of data are received. The node is closed as soon as there are no more rows. This procedure makes the files available to other nodes on the same page. For an example of how to use this node, see Converting and Extracting a Document.
Fixed Width File Output	Outputs your data to well-defined fixed-width fields in your output file. For an example of how to use this node, see Using Text Files in a Data Job.
Frequency Distribution Chart	Creates a chart that shows how selected values are distributed throughout your data. For an example of how to use this node, see Creating a Data Job in the Folders Tree.
Entity Resolution File Output	Writes clustered data to an Entity Resolution file. This file can be viewed from the Entity Resolution folder in the Folders tree. For an example of how this node can be used, see Generating an Entity Resolution File.
Match Report	Produces a report listing the duplicate records identified with your match criteria. Then, view the report with the Match Report Viewer.
JMS Writer	Writes information from a data job to a Java Message Service (JMS). See Overview of JMS Nodes.
XML Output	Writes from a data job to XML.
Work Table Writer	Writes data to the output of data jobs.

Data Integration

You can use the data integration nodes to sort, join, and otherwise integrate data.

Name	Description
Data Sorting	Re-orders the data set at any point in a data job. For an example of how to use this node, see Deploying a Data Job as a Real-Time Service.
Domain Items	Lists the items within one or more Reference Data Manager domains. In order to use this node, domains must have been created in Reference Data Manager, which is a separately-licensed DataFlux Web Studio module. For an example of how to use this node, see Using Reference Data Manager Domain Items in a Job.
Domains	Lists Reference Data Manager domains. For an example of how to use this node, see Listing Reference Data Manager Domains.
Data Joining	Combines two data sets so that the records of one, the other, or both data sets are used as the basis for the resulting data set. The data joining process is similar to SQL joins. For an example of how to use this node, see Using Reference Data Manager Domain Items in a Job.
Data Joining (Non-Key)	Joins two tables, each with the same number of records, by location in the file rather than by a unique key. This approach is much quicker than a traditional data joining step that expects to bring records together based on a unique key.
Data Union	Combines all of the data from two data sets. Like an SQL union join, this node simply adds the two data sets together; the resulting data set contains one record for each record in each of the original data sets.
SQL Lookup	Enables finding rows in a database table that have one or more fields matching those in the data job. This style of processing provides an explicit advantage with performance, especially with large databases.
SQL Execute	Constructs and executes any valid SQL statement (or series of statements).
Parameterized SQL Query	Writes an SQL query that contains variable inputs, also known as parameters. Each parameter in the query is represented by a question mark. When a row of data is processed, the parameters are replaced with the values of the input fields that have been designated as parameter inputs, and the query is executed.
Web Service	Issues SOAP requests to a SOAP server. The requests are constructed using an XML Template that can be discovered. You can extract fields from the response with an XML Template that defines where the fields occur in the response.
HTTP Request	Issues HTTP requests to an HTTP server. The requests, and optional replies, are built by the caller and can be any content that the server is expecting. For an example of how to use this node, see Managing an HTTP Request.
SAP Remote Function Call	Provides access to SAP data and metadata in the context of a data job.

Quality

You can use the quality nodes to analyze data that is specified in a data job.

Name	Description
Gender Analysis	Determines a gender value from a list of names. The results are placed in a new field, and have three possible values: M (male), F (female), and U (unknown).
Gender Analysis (Parsed)	Performs gender analysis on data that has already been parsed into given and last name fields. The gender analysis definition determines a gender value from a list of names.
Identification Analysis	Determines the related type of a data string. For example, the node could be used to determine whether a string represents an individual's name or the name of an organization.
Parsing	Separates multi-part field values into multiple, single-part fields.
Standardization	Makes similar items the same. Examples of standardization are correcting misspellings (Mary instead of Mmary), using full company names instead of initials (International Business Machines instead of IBM), and using consistent naming conventions for states (North Dakota instead of ND).
Standardization (Parsed)	Performs standardization on data that has already been parsed into its constituent parts.
Change Case	Makes all alphabetical values in a field uppercase, lowercase, or proper case.
Locale Guessing	Accesses information from the Quality Knowledge Base (QKB) and compares with your data to guess the country (locale) to which your data applies. This step creates an additional field that contains the locale's code.
Right Fielding	Copies data from one field to another based on the data type, as determined by an identification analysis definition.
Create Scheme	Generates a custom scheme file that can be used to standardize data outside of the usual DataFlux Data Management Studio standardization definitions.
Dynamic Scheme Application	Performs a single scheme standardization on a specific field containing multiple locales.
Field Extraction	Pulls information from a free form text field so you can analyze product information.
Document Extraction	Identifies textual entities and their usage within a given text input. The node will identify the terms (words) found in the input text and the term’s usage categorizations, such as vehicle, person, title, or company. For an example of how to use this node, see Converting and Extracting a Document.

Enrichment

You can use the verification nodes to use third-party reference databases to enrich, standardize, and augment the data that is specified in a data job. The distributed nodes enable the integration of a DataFlux dfIntelliServer and a DataFlux Data Management Server. With this integration, verify processing can be offloaded to another machine to help ease the burden on the Data Management Server.

Name	Description
Address Update Lookup	Runs the United States Postal Service (USPS) National Change of Address (NCOA) data through NCOALink® processing. NCOALink is validated by the postal service in two steps for certification similar to Coding Accuracy Support System (CASS) certification. This node implements the necessary job step to do the processing. For more information, see Using the Address Update Add-On.
Address Verification (US/Canada)	Verifies addresses from the US and Canada.
Address Verification (QAS)	Verifies country addresses outside of the United States (US) and Canada. To use this node, you must obtain your address reference databases and licenses directly from QAS.
Loqate	Verifies international addresses using Loqate software.
Address Verification (World)	Verifies addresses from outside of the US and Canada. This node is similar to QAS address verification, but it supports verification and correction for addresses from over 200 locales.
Address Verification (World 2)	Verifies international addresses using Address Doctor 5 software. For more information, see Working with the Address Verification (World 2) Node.
Geocoding	Matches geographic information from the geocode reference database with ZIP codes in your data to determine latitude, longitude, census tract, Federal Information Processing Standards (FIPS), and block information.
Street-Level Geocoding	Conducts street-level geographic location analysis.
US City/State/Zip Validation	Verifies that a city and state are correct for the ZIP code provided in the data input.
County	Matches information from the phone and geocode reference databases with Federal Information Processing Standards (FIPS). To use this node, you must have FIPS codes in your data.
US City/State/Zip Lookup	Looks up either the city or state by the ZIP code or the ZIP code by city and state.
Phone	Matches information from the phone reference database with telephone numbers in your data to determine information. To use this node, you must have telephone numbers in your data.
Area Code	Matches information from the phone reference database with ZIP codes in your data to calculate Area Code, Overlay, and Result values. To use this node, you must have ZIP codes in your data.
Canadian Postal Code Lookup	Enters Canadian postal codes. The output returns a range of addresses for a postal code, including a range of street numbers, street names, cities, provinces, and postal codes.
Distributed Geocoding	Offloads geocode processing to a machine other than the one running the current job.
Distributed Address Verification	Offloads address verification to a machine other than the one running the current job.
Distributed Phone	Offloads phone data processing to a machine other than the one running the current job.
Distributed Area Code	Offloads area code data processing to a machine other than the one running the current job.
Distributed County	Offloads processing of county data to a machine other than the one running the current job.

Entity Resolution

You can use the entity resolution nodes to perform record matching. Record matching merges multiple files (or duplicate records within a single file) in such a way that the records referring to the same physical object are treated as a single record. Then, records are matched based on the information that they have in common. For examples of how these several of these nodes can be used together in a data job, see Working with Entity Resolutions, Suggestion-Based Matching, and Combination-Based Matching.

Name	Description
Match Codes	Specifies the duplicate records identified with your match criteria.
Match Codes (Parsed)	Specifies the duplicate records identified with your match criteria for parsed data.
Clustering	Creates a cluster ID that is appended to each input row. Many rows can share cluster IDs, which indicate that these rows match using the clustering criteria specified.
Surviving Record Identification	Examines clustered data and determines a surviving record for each cluster. This surviving record identification (SRI) process allows for eliminating duplicate information in a data source. The surviving record is identified using one or more user-configurable record rules.
Cluster Aggregation	Accepts the output of a Clustering node and aggregates clusters that share the same membership (based on Primary Key). Since multiple match codes might have different scores, the Cluster Aggregation node is also able to reconcile clusters of identical membership but different scoring.
Cluster Diff	Compares sets of clustered records by taking inputs from each of two tables that are referred to as a "left" and a "right" table. From each table, the node takes two inputs: a record ID field and a cluster number field.
Cluster Analysis	Compares pairs of rows within a single match cluster to determine whether each pair is really a match.
Sub-Clustering	Functions like the Clustering node, but at a lower level.

Utilities

You can use the utility nodes to perform specialized tasks in a data job.

Name	Description
Expression	Enables you to add a DataFlux Expression Engine Language (EEL) expression to the flow in a data job.
Data Validation	Analyze the content of data by setting validation conditions that are used to filter data for a more accurate view of that data. For an example of how to use this node, see Using Reference Data Manager Domain Items in a Job.
Calculated Field	Calculates a score from any number of incoming numeric fields and a selected algorithm. Then the resulting score is placed in a new column. Each field included in the calculation is assigned a weight that determines how much impact it has on the resulting score.
Concatenate	Combines one or more fields into a single field.
Branch	Enables up to 32 Expression nodes to simultaneously access data from a single source. Depending on configuration, data is passed from the Branch node directly to each of the Expression nodes or the data is temporarily stored in memory or disk caches, before passing it along.
Realtime Service	Accesses a real-time service on a DataFlux Data Management Server from your DataFlux Data Management Studio job.
Sequencer (Autonumber)	Creates a sequence of numbers given a starting number and a specified interval. Because this node establishes a sequence without repeating values, this step can be useful when creating entries for primary key values.
Field Layout	Renames and reorders field names as they pass out of this node.
Java Plugin	Runs a specified Java program in the context of a data job. See Running a Java Program in a Data Job.
Safe String Encode/Decode	Makes the string safe for use in SQL statements, FTP transmissions, and other tasks by removing punctuation, spaces, and non-ASCII characters.
Email and FTP	Adds a step to e-mail and FTP files that sends the output from a DataFlux Data Management Studio job to one or more recipients.
External Program (Delimited Input and Output)	Passes data fields into an STDIN from an executable outside of DataFlux® products using a delimited text file. The node then takes data from the STDOUT from the executable and parses the file as delimited.
External Program (Delimited Input, Fixed Width Output)	Passes data fields into an STDIN from an executable outside of DataFlux® products using a delimited text file. The node then takes data from the STDOUT from the executable and parses the file as a fixed-width text file.
External Program (Fixed Width Input, Delimited Output)	Passes data fields into an STDIN from an executable outside of DataFlux® products using a fixed-width text file. The node then takes data from the STDOUT from the executable and parses the file as a delimited text file.
External Program (Fixed Width Input and Output)	Passes data fields into an STDIN from an executable outside of DataFlux® products using a fixed-width text file. The node then takes data from the STDOUT from the executable and parses the file as another fixed-width text file.
External Program (Delimited Output)	Takes data from the STDOUT from the executable and parses the file as a delimited text file.
External Program (Fixed Width Output)	Takes data from the STDOUT from the executable and parses the file as a fixed-width text file.
XML Column Input	Reads XML from a column in a row of data as an input in a data job. It enables you to augment rows with columns pulled from a column containing XML. For example, you could use this node in conjunction with the XML Column Output node with Web Service node. You could also use column input and output when you need to pull data out of data sources that are stored as XML. For an example of how to use this node, see Managing an HTTP Request.
XML Column Output	Collects columns in a row into XML that is placed in a column. For example, you could use this node in conjunction with the XML Column Input node with Web Service node. You could also use column input and output when you need to pull data out of data sources that are stored as XML.

Monitor

You can use the monitoring nodes to monitor data programmatically and process the output from data monitoring operations.

Name	Description
Data Monitoring	Applies a task to its input table. Each task specifies one or more business rules and one or more events that can be triggered based on the results that are returned by a rule. The events triggered by the task can be used to monitor data quality. For an example of how to use this node, see Creating a Data Monitoring Job.
Repository Info	Lists the repositories that are defined on the Administration riser in DataFlux Data Management Studio. Lists the rules and tasks that are in the current repository.
Repository Primary	Lists summary information about the executions of all tasks in the current repository. This information could be used as input to other nodes in a data monitoring job.
Repository Detail	Lists detailed information about the executions of all tasks in the current repository. The information for each execution includes the values for each row returned by the task. These rows can be grouped by execution attributes and sorted by rule attributes.
Repository Log	Lists fields that are associated with the business rules that are defined in the current repository. This node is used by some DataFlux solutions.
Execute Business Rule	Enables you to select an existing, row-based business rule and apply it to rows of data as they flow through a data job. This node bypasses the need to associate business rules to tasks and associated events to triggered rows. For an example of how to use this node, see Using the Execute Business Rule Node.
Execute Custom Metric	Enables you to select an existing custom metric and apply it to a set of data in a data job. This node extends the usage of custom metrics beyond group and set rules and data profiling. For an example of how to use this node, see Using the Execute Custom Metric Node.

Profile

You can use the profiling nodes to profile data programmatically and process the output from data profile analysis.

Note Note: If you are extracting data from XML, first use the XML input step to extract your data into a text file or database, and then profile that text file or database.

Name	Description
Pattern Analysis	Looks for patterns in a data source.
Basic Statistics	Generates basic statistics about your data sources.
Frequency Distribution	Adds frequency distribution profiling to the flow.
Basic Pattern Analysis	Runs a simplified version of pattern analysis that enables you to run pattern analysis on character sets, digits, or combined input with both characters and numeric digits.

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfU_DataFllowNodes.html