Cluster Analysis Node
You can add a Cluster Analysis node to a data job to compare pairs of rows within a single match cluster to determine whether each pair is really a match. Once you have added the node, you can double-click it to open its properties dialog. The properties dialog includes the following elements:
Name - Specifies a name for the node.
Notes - Enables you to open the Notes dialog. You can use the dialog to enter optional details or any other relevant information for the input.
Options - Displays the Options dialog, where you can set the following options:
- Memory to use for processing each cluster - Specifies the amount of memory to use for storing a single cluster of rows. Note that increasing the memory setting results in very little performance improvement, if any (unlike the Clustering node).
- Primary key field - Specifies a unique value used to identify each row in a table. This field is optional. If the Primary key field is given, it must be included in the pass-through fields. If it is not given, the node sequentially assigns integers beginning with 0 as primary key values for all input rows from all input clusters. When a new cluster is loaded, the primary key assignment continues from where it left off from the previous cluster.
- Output Rule Score - When selected, the Cluster Analysis node creates an output field for each configured rule that contains the score for that rule.
- Skip self-compare rows - When selected, the compare to self row pairs are skipped from the output whenever one or more of the real row pairs pass.
- Total score field name - Specifies the field name for the total score field in your output file. This field will contain the sum of scores of all the configured rules.
The Record rules section of the dialog enables you to create rules using expressions for the Cluster Analysis node. These rules are applied to pairs of rows. Each rule should return a positive score. The section includes the following elements:
Passing Score - In combination with the Method for passing setting, determines whether a row pair should be considered a match (pass) or not (fail). A row is presented on output only if it passes.
Method for passing - Defines how the passing score will be interpreted and how individual rule scores will be used when determining if rows in a pair are considered a match (pass) or not (fail). Select one of the following options from the drop-down list:
- Use total sum of scores
- Pass if any one rule passes
- Fail if any one rule fails
Record rules table - Displays the current record rules.
Add - Displays the Record Rule dialog. You can use the dialog to create new rules, which you must name and designate as either Basic or Advanced. Use the drop-down menus in the columns to perform the following functions:
- Select the Record 1 field
- Select the comparison
- Select the Record 2 field
You can use the Add and Delete buttons to maintain the list of record rules.
Edit - Enables you to modify the record rules.
The Output fields section of the dialog includes the following elements:
Available - Displays the fields that you can make available for the next step in your data job. Items displayed in this list are dependent on your data sources and any preceding steps in your data job.
Selected - Displays the fields that will be made available to the next node in your data job.
You can access the following advanced properties by right-clicking the Cluster Analysis node: