Use the Sort and De-Duplicate
Data in Hadoop directive to create jobs that include some or all of
the following steps:
-
If needed, group rows
based on selected columns and then summarize selected numeric columns
for each group.
-
If not summarizing,
specify the removal of duplicate rows.
-
Filter rows into the
target table by applying rules to selected columns.
-
Remove, reposition,
and rename the columns in the target table. Add columns for HiveQL
expressions as needed.
-
Sort target rows by
selecting one or more columns for ascending or descending values.