Match-Merge Data

Introduction

Match-Merge Data icon in the SAS Data Loader window
Use the Match-Merge Data directive to combine rows from two or more source tables into a single row in a target table. Rows are combined according to the values of one or more matched columns.
Matched columns are common to all source tables. Matched columns require compatible basic data types of numeric or character.
Rows are merged when values match in specified merge-by columns in two or more source tables. If you specify more than one merge-by column, then target rows are grouped, with one subgroup for each merge-by column.
The source table that provides the data for the merged row is determined by an ordered list of source tables. The source table that contributes merged data is the last table in the list that contains a matching merge-by value.
In the target, the merged row receives values from selected matching columns. Values from specified unmatched columns are also included in the merge.
Unmatched columns from any source table can be renamed and added as target columns. The target receives data from unmatched columns in merged rows and in rows that carry over without a merge.
You can define a filter to exclude unwanted rows from the target. The filter is a user-written SAS DS2 expression. If the expression evaluates to true, then the row is written to the target. The Advanced Editor expression builder in the Filter Rows task displays DS2 functions, provides syntax help, and enables you to add functions to your expression with a click.
You can add new columns to the target that contain calculated values for each target row. The calculations are specified as user-written DS2 expressions.
To learn more about the match-merge process, see How does Merge work? in a new Match-Merge directive. The Help information is provided in the Order Source Tables task, after you select source tables.
Online Help “How does Merge work?

Example

Introduction

Follow through this section using your own source tables to create and run a job using the Match-Merge Data directive.
If you have a question about a particular task in the directive, refer directly to that section.

Select Source Tables

The Match-Merge Data directive opens in the Source Table task. The initial display lists the tables in the Hive default data source, or in the data source that you last accessed in your current session.
Select a source table, and then click Next to select the second table. To add another table, click Add Another Source Table.
To select a source table from a different data source, click Return to Data Sources.
For more information about data sources and tables, see Viewing Data Sources and Tables.
When your list of tables to be merged is complete, click Next to display the Order Source Tables task.

Change Maximum Character Length

At any point in the use of the Match-Merge directive, you can change the maximum length of character values as they are read, processed, and output by SAS. The source values are not changed in Hadoop.
When you change the maximum character length, the change applies to a single source table. The default maximum character length (1024) is not changed. You can change the maximum character length for all of the source tables in the match-merge job.
To change the default maximum character length in SAS, see Change the Maximum Length for SAS Character Columns.
Follow these steps to view or change the maximum character length:
  1. If necessary, click a Source Table taskbar at the top of the directive.
  2. In the taskbar, click Edit Advanced Options Icon.
  3. In the Directive Settings window, view or change the value Maximum length for SAS character columns.

Order Source Tables

Use the Order Source Tables task to determine the source table that provides target data for a given merged row. The source table that provides data for a merged row is the last table in the list that contains a matching merge-by value for the given row.
In the Order Source Tables task, the top table is the first in the list. The bottom table is the last in the list.
The default order of tables in the Order Source Tables task replicates the order in which the source tables were selected. To change the default order of tables, click the up or down arrow icons.
List of source tables, the last table writes the target
When the order of tables is complete, click Next to open the Matched Columns task.

Matched Columns

Use the Matched Columns task to identify the columns in each source table that have similar content. Some or all of the matched columns are merged in the target. One (or more) will be identified as the merge-by column.
To be functional, matched columns need to have the same name, and their data types must resolve to either numeric or character. The Matched Column task helps you rename source columns and match data types.
Note: The Matched Columns task displays by default all columns in all sources that have a matching name and type.
Column data types can differ. For example, an INTEGER column can be matched with a BIGINT column. Similarly, a CHAR type can be matched with a VARCHAR type. The type of the associated target column is the largest or longest of the initial input types.
When column types match, but names do not, select one column in each source table, and then click Rename columns. Enter the name of the column as it will appear in the target, or accept the default name from the initial source table.
To remove a matched column after you define it, click any instance of the column name in Columns that match, and then click Unmatch columns.
To remove all matched column definitions, click Reset.
Defining matched columns
When the matched columns are defined, click Next to open the Merge By task.

Merge By

Use the Merge By task to specify the matched columns that uniquely identify rows in the source tables. Rows that match in the merge-by column are merged into a single row in the target. All rows with unique values in the merge-by column will appear in the target table.
The merge-by column is similar in purpose to a primary key in a database table.
Merge results can be unpredictable when the selected columns present something other than a one-to-one merge. Seek to define your merge-by columns so that a match identifies a single row in the merge tables. If a given merge table contains several matching rows for a single row in the initial table, then the results of the merge can be unpredictable.
If you select more than one merge-by column, then the order of the columns specifies a group-by arrangement of the rows in the target table. The group-by arrangement arranges target rows by group and subgroup.
CAUTION:
Specifying more than two merge-by columns produces incorrect results.
To learn more about merge-by columns, see the SAS DS2 Language Reference.
Note: If a source table contains two or more instances of the same merge-by variables, then the result of the merge is nondeterministic. In other words, the merge can produce more than one correct result. To generate fully repeatable results, ensure that your source tables contain no more than one row for each set of merge-by variables.
Use Columns to merge by to change the order of the merge-by columns. Select columns and click the vertical arrow icons.
Defining a merge-by column
After you select and order your merge-by columns, click Next to open the Input Columns task.

Input Columns

Use the Input Columns task to specify the source columns that will be merged in the target.
Initially, Available columns lists all source columns, except for the merge-by columns. The source tables appear in merge-by order, from left to right, as defined in the Order Source Tables task. The values that are written to the target come from the participating source table that is position farthest to the right.
To specify the columns that will be merged in the target table, move columns from Available columns to Selected columns.
Tip
To see the full name of a source table, position the cursor on the abbreviated name.
To include in the target matched columns or columns with the same name, move the columns to Selected columns and then click Rename to not match.
To automatically rename and not match two or more instances of a matched column, move those instances into Selected columns. This operation adds a repair warning icon to each instance of the selected columns. To rename the columns automatically and remove the warning icon, click Repair warnings.
The following image depicts six columns that will be merged, two of which will be renamed.
Input Columns task
When you have selected your input columns, click Next to open the New Columns task.

New Columns

Use the New Columns task to define target columns that receive the results of user-written DS2 expressions.
New Columns task
To add a new target column and paste or enter a DS2 expression, click Add New Column Icon.
Note: If your expression contains more than one clause, see Develop Expressions for Directives.
To add a new row and create an expression in the Advanced Editor window, click Add Column and Open Advanced Editor Icon
To create an expression using the Advanced Editor window, click a new row and then click Edit Expression in New Column Icon.

Filter Rows

Use the Filter Rows task to exclude rows from the target table by specifying one or more DS2 expressions.
Initially, All Rows is selected, which indicates that no rows will be filtered from the target. To continue to the next task without filtering rows, click Next.
To filter with a SAS DS2 expression, click Specify expression.
To create a rule that limits the rows to which the DS2 expression will be applied, click Specify rule to indicate. The rule tests each matching row to determine the source tables that include or do not include that row. If the rule is true for a row, then the DS2 expression is evaluated for that row. If the rule is not true, then the DS2 expression is not evaluated and the matching row is written to the target. You can create and apply multiple DS2 expressions. Each DS2 expression can have its own rule.
To specify a rule, select In or Not In for a selected source table. Click Add condition to apply a logical AND or OR and to specify a second source table. Continue to add conditions as needed.
To create a DS2 expression, you can paste an existing expression or enter the expression into the DS2 expression text box. When adding an expression, note that the entire expression is inserted into SAS code. The expression should be syntactically correct with all the statements ending properly with a semi-colon. SAS Data Loader does make one exception to this rule: it adds a semi-colon at the end of an expression if there is no semi-colon anywhere in the expression.
To paste a previously copied DS2 expression, click DS2 expression and press Ctrl+V or your equivalent.
Note: If your expression contains more than one clause, see Develop Expressions for Directives.
To enter a DS2 expression, use the column names and DS2 functions in the Resources list box.
To add a second DS2 expression, click Add expression. Multiple expressions are evaluated in the order in which they appear in the Filter Rows task. The top expression is evaluated first, the bottom expression last.
Filter Rows task
When your Filter Rows task is complete, click Next to display the Management Output Columns task.

Manage Output Columns

Use the Manage Output Columns task to reorder or remove the columns in the resulting target table.
Initially, All columns is selected. To not reorder or remove target columns, click Next.
Note: To ensure data integrity, the first column in the target is required to be the first merge-by column, as defined in the Merge By task. The default order of columns is alphanumeric by column name.
Click Specify columns to display an alphabetic list of target columns. Included in the list are columns that were renamed in the Input Columns task. Also included are any new columns that were added to receive the results of DS2 expressions. New columns are added in the New Columns task.
To reorder a column, select it and then click a vertical arrow icon.
To remove a column, select it and click the left arrow icon.
Manage Output Columns task
When your target columns are properly selected, named, and ordered, click Next to move to the Target Table task.

Target Table

Use the Target Table task to select a new or existing target table for your match-merge job. If you select an existing table, it will be completely overwritten by the match-merge job.
To select or create a target table in a different data source, click Return to Data Sources.
For further information, see Browse Tables.
Target Table task
After you select your target table, click Next to open the Result task.

Result

Use the Result task to run your match-merge job, and examine the resulting target table, generated code, error messages, and log file.