Query or Join Data in Hadoop

Introduction

Query or Join Data in Hadoop icon
Use queries to group rows based on the values in one or more columns and then summarize selected numeric columns. The summary data appears in new columns in the target table.
Use joins to combine source tables. The join is based on a comparison of values in “join-on” columns that are selected for each of the source tables. The result of the join depends on matching values in the join-on columns, and on the selected type of the join. Four types of joins are available: inner, left, right, and full.
The Query or Join Data in Hadoop directive enables you to create jobs that combine multiple joins or queries. In the resulting table, you can remove unwanted rows and columns, remove duplicate rows, and rearrange columns. Before you execute the job, you can edit the generated Hive code and paste-in additional Hive code. The process of the directive is defined as follows:
  • Select a source table.
  • Join tables to the initial table as needed.
  • Define queries that group columns and aggregate numeric values, again as needed.
  • For jobs that do not include queries, use rules to filter unwanted rows from the target. (Queries require all rows.)
  • For join-only jobs, select, arrange, and rename target columns.
  • For join-only jobs, apply Hive SQL expressions in new or existing target columns.
  • Sort target rows based on specified target columns.

Example

Follow these steps to use the Query or Join Data In Hadoop directive.
  1. In the SAS Data Loader directives page, click Query or Join Data in Hadoop.
  2. In the Query task, click the browse icon browse icon.
  3. In the Select a Table window, scroll through the Location list and click a schema. Then click a source table in the Table list, and then click OK.
    Select a Table window
  4. If your job includes no joins, click Next to open the Summarize Rows task.
  5. To join your source table with other tables, click Add Join, and then click Next.
  6. In the Join row, click the browse icon browse icon and select the table for the join.
  7. As needed, click the Join field and select a join type other than the default join type Inner.
    Inner
    The inner join finds matching values in the join-on columns and writes one row to the target. The target row contains all columns from both source tables. A row from either source table is not written to the target if it contains a null value in the join-on column. A row is also not written to the target if the value in the join-on column does not match a value in the join-on column in the other source table.
    Left
    The left or left-full join writes to the target all rows from the left table of the join statement. If a match does not exist between the join-on columns, null values are written to the target for the columns of the right table in the join.
    Right
    The right or right-full join reverses the definition of the left join. All rows from the right table appear in the target. If no values match between the join-on columns, then null values are written to the target for the columns of the table on the left side of the join statement.
    Full
    The full join combines the left and right joins. If a match exists between the join-on columns, then a single row is written to the target to represent those two source rows. If the left or right table has a value in the join-on column that does not match, then the data for that row is written to the target. Null values are written into the columns from the other source table.
  8. In the Join-on row, click the left join-on column and select a replacement for the default column, as needed.
    Join-on statement showing left and right column selections
    Note: The left and right designations in the join-on statement define the output that is generated by the available left join and right join.
  9. Click the right join-on column to select a replacement for the column, as needed.
  10. To add more join columns, click the Add icon Add a new join-on column icon at the end of the Join-on row. When you add a second pair of join-on columns, a match between the source tables consists of a match in the first pair of join-on values and a match between the second pair of join-on values.
  11. To join a third table to the joined table that unites the two source tables, click Add join.
    Completed example join
  12. Click Next and wait a moment while the application assembles in memory the names of the joined columns.
  13. In the Summarize Rows task, if you do not need to summarize, click Next.
    Note: If your source data is in Hive 13 format or lower, the Summarize task will not handle special characters in column names. To resolve the issue, either rename the columns or move the source table into Hive 14 format.
  14. To add summarizations, click the Group rows by field, and then click the column that you want to use as the primary grouping in your target table. For example, if you are querying a table of product sales data, then you could group rows by the product type column.
    Note:
    • If your job includes joins, note that the Group rows by list includes all columns from your source tables.
    • If you intend to paste a Hive query into this directive, then you can click Next two times to display the Code task.
  15. To subset the first group with a second group, and to generate a second set of aggregations, click Add column.
  16. To generate multiple aggregations, you can add additional groups. The additional groups will appear in the target table as nested subgroups. Each group that you define will receive its own aggregations.
    To add a group, click Add Column, and then repeat the previous step to select a different column than the first group. In a table of product sales data, you could choose a second group by selecting the column product_code.
  17. In Summarize columns, select the first numeric column that you want to aggregate.
  18. In Aggregation, select one of the following:
    Count
    specifies the number of rows that contain values in each group.
    Count Distinct
    specifies the number of rows that contain distinct (or unique) values in each group.
    Max
    specifies the largest value in each group.
    Min
    specifies the smallest value in each group.
    Sum
    specifies the total of the values in each group.
  19. In New column name, either accept the default name of the aggregation column, or click to specify a new name.
  20. To add an aggregation, click Add Column.
    Completed query window
  21. When the aggregations are complete, click Next.
  22. In the Filter Data task, all source rows are included in the target by default. To accept this default, click Next.
  23. If your job includes joins but no summarizations, then you can select No duplicate rows to remove duplicate rows from the target. Older versions of Hive do not support the selection of both No duplicate rows and All Rows.
  24. To filter rows from the target, choose one of the following:
    1. To filter rows using one or more rules, click Specify rules and proceed to the next step. You can specify multiple rules and apply them using logical AND and OR operators.
    2. To filter rows using a Hive expression, click Specify expression and go to Step 26.
  25. To filter rows by specifying one or more rules, follow these steps:
    1. Click Select a column and choose the source column that forms the basis of your rule.
    2. Click and select a logical Operator. The operators that are available depend on the type of the data in the source column. For example, the following image shows the operators that are available for the date/time data type:
      Operators available for date/time source columns
    3. In the Value field, add the source column value that completes the expression. In the preceding example, the two rules combine to read “Filter from the target all source rows with an income less than 50,000.00 and born on or after May 1, 1985.”
    4. Click Add Rule as needed to add another rule. Select a different column, operator, and value. To associate a new rule with the previous rules, either retain the default AND operator or click AND and select OR.
    5. When your rules are complete, go to Step 27.
  26. To filter rows using a Hive expression, follow these steps:
    1. In the Hive expression text box, either enter or paste a Hive expression.
    2. To add Hive functions to your expression, click Functions in the Resources box, expand a category, select a function, and click Arrow Right Plus Icon.
      Hive BETWEEN Function selected from Resources list box
      To add column names to your expression, position the cursor in the Hive expression box, click Columns in the Resources box, click a source column, and then click Arrow Right Plus Icon.
  27. When your rules or expression are complete, click Next to open the Columns task.
  28. Use the Columns task to select, order, and rename the columns that will be written into the target table. Also use the Columns task to select, order, and rename the columns that will be written into the target task to apply Hive expressions to new or existing columns.
    Note: This Columns task is available only if your job does not contain summaries. If your job does contain summaries, then click Next to display the Sort task.
  29. Use the Columns task to do the following:
    • Select and order the columns in the target using these arrow icons select all columns icon(select all), select one column icon (select one), remove one column icon or delete one column icon(remove one), and remove all columns icon (remove all).
    • Replace the suggested column names as needed by clicking in the Target Name column.
    • Reorder columns by clicking a column and clicking move column to first positon icon (move to first column), move column one position left icon (move column left one position), move column one position right icon (move column right one position), and move column to last position icon (move column to last position).
    • Add new columns for Hive expressions. Click the Add icon Add Hive expression icon, and then specify the column name.
    • Add a new column and open the Advanced Editor to develop a Hive expression. Click Add column, use Advanced Editor icon.The expression uses the values in other columns to populate the new column.
    • Edit an existing column by applying a Hive expression. Click Edit Hive expression in existing column with the Advanced Editor icon.
  30. Follow these steps to specify a Hive expression:
    1. In the Column name field, either enter a new column name or verify the name of edited column.
    2. Paste Hive code into the Hive expression box. Edit the text as needed. Add Hive functions and source column names using the Resources box.
  31. Click Next to close the Column name task and open the Target Table task.
  32. In the Target Table task, to learn about the contents of a table, click the table and click the Table Viewer icon Table Viewer icon.
  33. To write your target data to an existing table, click that table and click Next. Any and all existing data is replaced.
  34. To save data to a new target table, click New Table icon/button, enter a table name in the New Table window, and click OK.
    The names of tables must meet the naming conventions of SAS and Hadoop.
  35. To display your target data as a temporary view, click save as a view icon. Saving as a view displays your target data in the Sample Data Viewer without saving the results to a table on disk.
    When your target selection is complete, click Next to open the Code task
  36. In the Code task, click Edit HiveQL Code to edit the generated code. Click Reset Code to restore the original generated code. Click Next to open the Result task.
    Note: Edit your HiveQL code with care. The code in the editor is the exact code that will be executed by your job, regardless of previous selections.
  37. In the Result task, you can review the previous tasks by clicking on the gray taskbars at the top of the window.
  38. Click Save or Save As to save your job.
  39. Click Start querying data to execute your directive. To monitor the progress of your job, see the Run Statusdirective.