Clustering Node

DataFlux Data Management Studio 2.5: User Guide

Clustering Node

You can add a Clustering node to a data job to create a cluster ID that is appended to each input row. The node accepts any number of field inputs arranged in any number of logically defined conditions. These cluster IDs indicate matches. Many rows can share cluster IDs, which indicate these rows match using the clustering criteria specified. Match code fields are often used for clustering as they provide a degree of fuzziness in the clustering process.

Once you have added the Clustering node, you can double-click it to open its properties dialog. The properties dialog includes the following elements:

Name - Specifies a name for the node.

Notes - Enables you to open the Notes dialog. You use the dialog to enter optional details or any other relevant information for the input.

Output Cluster ID Field - Specifies the name of the field in which new clusters will be inserted.

Options - Displays the Options dialog, where you can set the following options:

Condition Matched field prefix - When selected, creates a new Boolean field for each clustering condition. The name of each output field will be the prefix entered here, followed by condition index (such as 1, 2, and so on). The value trueï¿½ will be entered into the field(s) corresponding to the condition that caused the record to be clustered. Note that enabling this option significantly degrades job performance.
Backup path/prefix - Enables you to enter or navigate to the backup filename prefix to instruct CUE to generate state backup files at the end of the clustering process.

If you do not configure these parameters, or if CUE steps are used in the Preview mode, CUE will not generate state files.

Treat blank field values as nulls - When selected, prevents blank field values from being clustered together.

Compact cluster numbers - Specifies whether the final cluster numbers are in a sequence without gaps (missing numbers) or not (for example, cluster IDs can come out as {0, 1, 2, 3} or as {0, 4, 10, 33}). Setting this option can noticeably reduce performance of the job for large data sets (100K rows or more), so use the option only when it really is necessary.

Override clustering memory size - When selected, overrides the cluster memory default value.

Note: When a job with clustering nodes is run under DataFlux Data Management Studio, clustering memory size must be configured to be less than four gigabytes regardless of how much memory the machine may have. This limitation is due to the fact that DataFlux Data Management Studio is a 32-bit application. In order to use over four gigabytes of memory, the job must be run on a 64-bit build of Data Management Server.

If a memory value greater than four gigabytes is configured for clustering and such a job is run underDataFlux Data Management Studio, the clustering node displays a misleading error message: "Cluster Memory Size must be an integer." This error message also displays when a non-integer value is entered for clustering memory in a configuration file or in a node's advanced properties.

Cluster - Specifies whether all rows, only those that are in a cluster by itself, or only those that are in a cluster with at least another row are produced on the output.

Do not cluster field - Specifies the name of a field that contains the do not cluster flag. This field should be defined as an integer data type in your data source. If this value in this field is 1, the row will not be clustered with other records having the same clustering criteria. If the value in this field is 0 or Null, the row will be included in the cluster.

Sort Output by Cluster Number - When selected, sorts records in ascending order according to the cluster number added to each record. Select this check box only if sorting is actually required, since this action can significantly decrease performance.

Note Note: If you use the Surviving Record Identification node as part of a data job after a Cluster Update node, check this option, because the SRI step expects sorted input.

The Conditions section of the dialog includes the following elements:

Available - Displays the fields that you can make available for the next step in your data job. Items displayed in this list are dependent on your data sources and any preceding steps in your data job.

Selected - Displays the fields that will be made available to the next node in your data job. You can click OR to add an OR condition the bottom of the Selected Fields list.

Note Note: You can enable or disable cluster logging in the app. cfg file. Specify CLUSTER/LOG = 1 to enable logging or CLUSTER/LOG = 0 to disable logging.

Additional Outputs -Displays the Additional Outputs dialog. This dialog enables you to specify the fields that you can make available to the next node in your data job.

You can access the following advanced properties by right-clicking the Clustering node:

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfU_PFInt_Cluster.html