Use queries to group rows based on the values in one
or more columns and then summarize selected numeric columns. The summary
data appears in new columns in the target table.
Use joins to combine
source tables. The join is based on a comparison of values in “join-on”
columns that are selected for each of the source tables. The result
of the join depends on matching values in the join-on columns, and
on the selected type of the join. Four types of joins are available:
inner, left, right, and full.
The Query or Join Data
in Hadoop directive enables you to create jobs that combine multiple
joins or queries. In the resulting table, you can remove unwanted
rows and columns, remove duplicate rows, and rearrange columns. Before
you execute the job, you can edit the generated Hive code and paste-in
additional Hive code. The process of the directive is defined as follows:
-
-
Join tables to the initial table
as needed.
-
Define queries that group columns
and aggregate numeric values, again as needed.
-
For jobs that do not include queries,
use rules to filter unwanted rows from the target. (Queries require
all rows.)
-
For join-only jobs, select, arrange,
and rename target columns.
-
For join-only jobs, apply Hive
SQL expressions in new or existing target columns.
-
Sort target rows based on specified
target columns.