Sort and De-Duplicate Data in Hadoop

Introduction

Sort and De-Duplicate Icon in SAS Data Loader Window
Use the Sort and De-Duplicate Data in Hadoop directive to create jobs that include some or all of the following steps:
  1. Group rows based on selected columns and then summarize numeric columns for each group and subgroup.
  2. If not summarizing, specify the removal of duplicate rows and filter rows from the target.
  3. Remove, reposition, and rename the columns in the target table. Add columns that receive the results of Hive expressions.
  4. Sort target rows by selecting one or more columns for ascending or descending values.

Example

Follow these steps to use the Sort and De-Duplicate directive:
  1. Open SAS Data Loader, as described in Getting Started.
  2. In the Source Table task, click a table in your default data source and click Next. To select a source table from another data source, click Return to Data Source Icon.
  3. Use the Summarize Rows task to group rows in the target according to column values, and then summarize numeric values for each group or subgroup.
    If you do not want to generate summary values for groups of rows, or if you want to remove duplicate rows, click Next to display the Filter task and go .
    Note: If your source data is in Hive 13 format or lower, the Summarize task will not handle special characters in column names. To resolve the issue, either rename the columns or move the source table into Hive 14 format.
    Follow these steps to use the Summarize Rows task:
    1. Click Group rows by and select a column. To generate nested groups with additional summary values, click Add Column.
      Group Rows for Summary Values
    2. Click Summarize column and select a column that will be used to generate summary values for each group. The summarized values will appear in a new column in the target.
    3. Click Aggregation and select the summary type. The available summary types are defined as follows:
      Count
      specifies the number of rows that contain values in each group.
      Count Distinct
      specifies the number of rows that contain distinct (or unique) values in each group.
      Max
      specifies the largest value in each group.
      Min
      specifies the smallest value in each group.
      Sum
      specifies the total of the values in each group.
    4. Click New column name to change the default column name for the new target column that will receive summarized data. The new target column will contain a summary value for each group and subgroup.
    5. Click Add Column to specify a second summary and target column.
      Summarizations example
    6. When your groups and summaries are complete, click Next to display the Filter task.
  4. The Filter task enables you to remove duplicate rows and filter or remove unnecessary rows from the target.
    If you specify summaries or if you do not need to filter rows from the target, then click Next to display the Columns task.
    Follow these steps to use the Filter task:
    1. To remove from the target any rows that are identical to another row, click No duplicate rows.
    2. To filter rows from the target using one or more rules, click Specify rules.
      To filter rows using a Hive expression, click Specify expression and go to Step 4d .
    3. To filter rows by specifying one or more rules, follow these steps:
      1. Click Column and choose the source column that forms the basis of your rule.
      2. Click and select a logical Operator. The operators that are available depend on the type of the data in the source column. For example, the following image shows the operators that are available for the character data type:
        Operators available to filter character columns
      3. In the Value field, add the source column value that completes the expression. In the preceding example, the two rules combine to read “Filter from the target all contacts from Texas who were born on or after January 1, 1990.”
      4. Click Add Rule as needed to add another rule. Select a different column, operator, and value. To associate a new rule with the previous rules, either retain the default AND operator or click AND and select OR.
      5. When your rules are complete, click Next to display the Columns task and go to Step 5.
    4. To filter rows using a user-written Hive expression, follow these steps:
      1. In the Hive expression text box, either enter or paste a Hive expression.
      2. To add Hive functions to your expression, click Functions in the Resources box, expand a category, select a function, and click Arrow Right Plus Icon.
        Hive CAST Function selected from Resources list box
        To add column names to your expression, position the cursor in the Hive expression box, click Columns in the Resources box, click a source column, and then click Arrow Right Plus Icon.
      3. When your expression is complete, click Next to open the Columns task.
  5. Use the Columns task to remove, reorder, or rename columns. You can also add columns that receive the results of user-written Hive expressions.
    If you defined summaries in the Summarize Rows task, or if you do not need to modify columns, then click Next to display the Sort task and go to to Step 6.
    Follow these steps to use the Columns task:
    1. Click Specify Columns to display the Selected Columns and Available Columns.
    2. In Selected Columns, click icons to perform the following tasks:
      • To rename columns, click rename columns icon, or click and enter the new name in the Target Name column.
      • To rearrange or reorder columns, click the up and down arrow icons. Note that the top row in Selected Columns is the leftmost column, or column 1, in the target table.
      • To remove columns from the target, click the trash can icon or the left arrow icons.
      • To add a new column and add data to that column using an existing Hive expression, click add icon. In the Hive expression column, paste your Hive expression.
      • To add new columns that contain a Hive expression, and to develop that expression using the Advanced Editor, click Add column using Advanced Editor icon. The Advanced Editor helps you browse and select Hive functions and column names for your expression. To learn how to use the Advanced Editor, see Using the Advanced Editor for Hive Expressions.
      • To edit with the Advanced Editor columns that contain a Hive expression, click Icon to edit column with Advanced Editor.
    3. When your columns are complete, click Next to display the Sort task.
  6. If you have not defined any summaries in the Summarize Rows task, then the Sort task enables you to group rows based on ascending or descending values.
    If you defined summaries, click Next to display the Target task and go to the next step.
    Sorting Rows
  7. In the Target Table task, select a location for the target table. When the table list appears, either select an existing target or click New Table. To generate a temporary table that is not saved to disk, select Save as a View. Click Next.
  8. In the Code task, review and edit the generated Hive code. Note that if you edit the code, you will lose your edits if you change a task and regenerate code. Click Next.
  9. In the Result task, click Save or Save As to save your job. You can then access that job in Saved Directives. Click Start querying data to run your job.

Using the Advanced Editor for Hive Expressions

In the directive Sort and De-Duplicate Data in Hadoop, in the Columns task, you use the Advanced Editor to add or edit user-written Hive expressions. The expressions are run by the Sort and De-Duplicate job to add data to new target columns. The Advanced Editor enables you to insert column names and Hive function syntax into your expressions.
Follow these steps to use the Advanced Editor:
  1. As needed in the Columns task, click Add column using Advanced Editor icon or Icon to edit column with Advanced Editor to open the Advanced Editor.
  2. In the Advanced Editor, in the Column Name field, enter a name for a new column or rename an existing column. The fields Column type and Column length describe the selected column.
  3. To build an expression, you can start by pasting Hive code from your clipboard. To edit or build your expression, click the column names and functions in the Resources box.
    Tip
    When you select a function, syntax help is displayed at the bottom of the Resources box.
    Advanced Editor for Hive Expressions
  4. To save your expression and return to the Columns task, click Save. To save and create another new column and expression, click Save and New. In the Columns task, new columns are displayed at the bottom of the Selected Columns box.