Clustering Node

DataFlux Data Management Studio 2.6: User Guide

Clustering Node

You can add a Clustering node to a data job to create a cluster ID that is appended to each input row. The node accepts any number of field inputs arranged in any number of logically defined conditions. These cluster IDs indicate matches. Many rows can share cluster IDs, which indicate these rows match using the clustering criteria specified. Match code fields are often used for clustering as they provide a degree of fuzziness in the clustering process. The cluster engine uses 64 bit data types. It can handle input data sets that may result in over 1 billion clusters.

Note that clustering is the first step in entity resolution. It is used to put groups of related records in clusters by assigning each a cluster ID. Records in a set with the same cluster ID are considered to be in a cluster. Each cluster is treated as a group that is processed by other entity resolution nodes after the data is clustered . One node that works with clustered data is the Surviving Record Identification node. This node looks at the cluster and chooses a surviving record (which is an entity) and flags it. After that, the clusters can out to an entity resolution file. You can then examine the entity resolution file and manually select the surviving records.

A cluster can be created based on a single entity, where an entity is a single field or a collection of fields concatenated together. It can also be created based on a multiple entity, where matches within a condition are made across all entities of that condition. For example, a cluster can be created in which Field 1 would be one condition, and Fields 2 and 3 would be two separate entities of a second condition. For more information, see Set Clustering Properties.

Once you have added the Clustering node, you can double-click it to open its properties dialog. The properties dialog includes the following elements:

Name - Specifies a name for the node.

Notes - Enables you to open the Notes dialog. You use the dialog to enter optional details or any other relevant information for the input.

Output Cluster ID Field - Specifies the name of the field in which new clusters will be inserted.

Output clusters - Specifies whether all rows, only those that are in a cluster by itself, or only those that are in a cluster with at least another row are produced on the output. The Clustering node displays the number of rows it skipped in the output as part of the final status of the node. This count is shown at the end of the job run. Note that the count is relevant when the node is configured to skip single or multi row clusters.

Override clustering memory size - When selected, overrides the cluster memory default value.

Note: When a job with clustering nodes is run under DataFlux Data Management Studio, clustering memory size must be configured to be less than four gigabytes regardless of how much memory the machine may have. This limitation is due to the fact that DataFlux Data Management Studio is a 32-bit application. In order to use over four gigabytes of memory, the job must be run on a 64-bit build of DataFlux Data Management Server.

If a memory value greater than four gigabytes is configured for clustering and such a job is run underDataFlux Data Management Studio, the clustering node displays a misleading error message: "Cluster Memory Size must be an integer." This error message also displays when a non-integer value is entered for clustering memory in a configuration file or in a node's advanced properties.

Treat blank field values as nulls - When selected, prevents blank field values from being clustered together.

Sort Output by Cluster Number - When selected, sorts records in ascending order according to the cluster number added to each record. Select this check box only if sorting is actually required, since this action can significantly decrease performance.

Note Note: If you use the Surviving Record Identification node as part of a data job after a Cluster Update node, check this option, because the SRI step expects sorted input.

More Options - Displays the Options dialog, where you can set the following options:

Condition Matched field prefix - When selected, creates a new output field for each clustering condition. The name of each output field will be the prefix entered here, followed by condition index (such as 1, 2, and so on). If you set this option, you must select a radio button beneath the condition matched field to specify how the field is processed. Select Return true or false as to whether the condition was matched across the cluster to return a true or false value. Select Return a count of the number of times the condition was matched across the cluster return a match count.
Cluster size field name - Enables you to specify the name of a new output field for rows returned from the Clustering node. The value in this field will be the number of rows in the specific cluster being returned.
Do not cluster field - Specifies the name of a field that contains the do not cluster flag. This field can contain a Boolean (true or false) value. It can also contain an integer value. In this case, a value greater than 0 is interpreted as true and a value of 0 or NULL is interpreted as false. If a row is flagged as “do not cluster,” that row will be placed in a cluster all by itself (and get a unique cluster ID). The node will attempt to match the row to any other rows in the input data.

The Conditions section of the dialog includes the Records will be clustered if any condition is satisfied field. This field displays the conditions that are used to determine how the records in your data sources are clustered. You can click the Add a cluster condition button or the Edit cluster condition button to open the Clustering Condition dialog. Then you can use this dialog to add or edit one or more conditions to the list in the Records will be clustered if any condition is satisfied field. Finally, you can use the Edit cluster condition and Delete cluster condition buttons to manage the conditions list.

The Clustering node writes debug log messages with the status of memory and disk usage during the job run. These messages indicate whether the primary data file was pulled into memory before the node started to output data, which is critical to performance whenever node is configured to sort output clusters. The debug messages are written via "DF.StepEngine.Plugin.Cluster" and are enabled by editing the dfwfproc.log.xml configuration file to write process log at debug level.

Note Note: The Cluster Engine library (in addition to the node) can produce its own log file, which is enabled via app.cfg file. Enabling this log has no effect on the job performance. The log file will be created in the temp directory for the process running the job. To control location of this log file, you can set the following option in app.cfg: CLUSTER/TEMP = c:\CElogs\path. When performance or functionality questions arise for the Clustering node, SAS Tech Support will often request this Cluster Engine library log file (in addition to the process log file discussed earlier)

You can also set the following options in the app.cfg file: BASE/SORTMERGES, BY_GROUP, COLLATION, NODUPS, and STABLE.

Additional Outputs -Displays the Additional Outputs dialog. This dialog enables you to specify the fields that you can make available to the next node in your data job.

You can access the following advanced properties by right-clicking the Clustering node:

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfU_PFInt_Cluster.html