Cluster Aggregation Node

DataFlux Data Management Studio 2.6: User Guide

Cluster Aggregation Node

You can add a Cluster Aggregation node to a data job to provide a post-clustering step that accepts the output of a Cluster node and consolidates the identical cluster output that shares the same membership based on the Primary Key. The Cluster Aggregation node output is a subset of the input row set, and each output cluster is guaranteed to be unique from any other output cluster produced in the same job.

The Cluster Aggregation node also has a special scoring option. Due to the fact that multiple match codes might have different scores, the Cluster Aggregation node must be able to reconcile clusters with identical membership but different scoring. To do this, a variety of algorithms are available in the Scoring Algorithm field.

Once you have added the Cluster Aggregation node, you can double-click it to open its properties dialog. The properties dialog includes the following elements:

Name - Specifies a name for the node.

Notes - Enables you to open the Notes dialog. You use the dialog to enter optional details or any other relevant information for the input.

Cluster ID Field - Identifies the column that contains the cluster ID so that rows can be correlated to the appropriate clusters.

Primary Key Field - Specifies the field to be used as the primary key for comparisons when testing cluster membership against other clusters.

Aggregation Indicator Field - Allows the user to enable an additional "indicator" column to appear in the output. This column is populated with a Boolean value designating whether a given cluster was aggregated. When the checkbox is checked, a text box becomes enabled, allowing the user to name the indicator output column.

Score Field - Displays the column that contains scoring information that can be processed using this node.

Scoring Method - Specifies whether to score the output by cluster or by record. This field is enabled when a value is selected in Score Field.

Preserve Original Scores - Allows the user to perform whatever scoring algorithm is selected, but it retains the original score values. This checkbox is enabled when a value is selected in Scoring Method.

Scoring Algorithm - Displays the algorithm options. This field is enabled when a value is selected in Score Field, and it allows the user to apply one of the scoring algorithms to aggregate clusters. If Record is selected in the Scoring Method field, this field displays options including Maximum, Mean, Mean with Scaling, Median, and Minimum. If Cluster is selected in the Scoring Method field, this field displays options including Highest Maximum, Highest Mean, Highest Minimum, Lowest Maximum, Lowest Mean, and Lowest Minimum.

Remove subclusters - When selected, suppresses any rows whose cluster ID belongs to a subcluster. When multiple match codes are generated, one or more specific records can be in more than one cluster, as shown in the following list:

Cluster 1 contains records A, B, C, and D.
Cluster 2 contains records A, B, and C.
Cluster 3 contains records B, C, and D.
Cluster 4 contains records A, C, and D.

Select Remove subclusters when you want to see only the super cluster, which in this case is Cluster 1. The other clusters are identified as subclusters and removed. Note, however, that selecting the check box affects performance. If the data does not contain subclusters, computation time is increased because the node evaluates all of the clusters in an attempt to identify subclusters.

You can access the following advanced properties by right-clicking the Clustering Aggregation node:

Documentation Feedback: yourturn@sas.com
Note: Always include the Doc ID when providing documentation feedback.

Doc ID: dfU_PFInt_ClustAgg.html