Clusters

Clusters are numbered groups of values that generate identical match codes or that have an exact match of characters. Clusters are used in the creation of schemes using the DQSCHEME procedure. The cluster with the greatest number of members becomes the transformation value for the scheme.

Householding with the DQMATCH Procedure

You can use the DQMATCH procedure to generate cluster numbers as it generates match codes. An important application for clustering is commonly referred to as householding. Members of a family or household are identified in clusters that are based on multiple criteria and conditions.
To establish the criteria and conditions for householding, use multiple CRITERIA statements and CONDITION= options within those statements.
  • The integer values of the CONDITION= options are reused across multiple CRITERIA statements to establish groups of criteria.
  • Within each group, match codes are created for each criteria.
  • If a source row is to receive a cluster number, all of the match codes in the group must match all of the codes in another source row.
  • The match codes within a group are therefore evaluated with a logical AND.
If more than one condition number is specified across multiple CRITERIA statements, there are multiple groups and multiple groups of match codes. In this case, source rows receive cluster numbers when any groups match any other group in another source row. The groups are therefore evaluated with a logical OR.
For an example of householding, assume that a data set contains customer information. To assign cluster numbers, you use two groups of two CRITERIA statements. One group (condition 1) uses two CRITERIA statements to generate match codes based on the names of individuals and an address. The other group (condition 2) generates match codes based on organization name and address. A cluster number is assigned to a source row when either pair of match codes matches at least one group that matches the match codes from another source row. The code and output for this example are provided in Clustering with Multiple CRITERIA Statements.

Clustering with Exact Criteria

Use the EXACT= option of the DQMATCH procedure's CRITERIA statement to use exact character matches as part of your clustering criteria. Exact character matches are helpful in situations where you want to assign cluster numbers using a logical AND of an exact number and the match codes of a character variable.
For example, you could assign cluster numbers using two criteria: one using an exact match on a customer ID values and the other using a match code generated from customer names. The syntax of the EXACT= option is provided in DQMATCH Procedure.