Query or Join Data in Hadoop

Introduction

Query or Join Data in Hadoop icon
Use queries to group rows based on the values in one or more columns and then summarize selected numeric columns. The summary data appears in new columns in the target table.
Use joins to combine source tables. The join is based on a comparison of values in “join-on” columns that are selected for each of the source tables. The result of the join depends on matching values in the join-on columns, and on the selected type of the join. Four types of joins are available: inner, left, right, and full.
The Query or Join Data in Hadoop directive enables you to create jobs that combine multiple joins and queries, and then customize the target table to remove unwanted rows and columns, remove duplicate rows, and rearrange columns. Before you execute the job you can edit the Hive SQL code and paste-in additional Hive SQL code. The process of the directive is defined as follows:
  • Select a source table.
  • Join tables to the initial table as needed.
  • Define queries that group columns and aggregate numeric values, again as needed.
  • For jobs that do not include queries, use rules to filter unwanted rows from the target. (Queries require all rows.)
  • For join-only jobs, select, arrange, and rename target columns.
  • For join-only jobs, apply Hive SQL expressions in new or existing target columns.
  • Sort target rows based on specified target columns.

Example

Follow these steps to use the Query or Join Data In Hadoop directive.
  1. In the SAS Data Loader page, click Query or Join Data in Hadoop.
  2. In the Query page, click the browse icon browse icon.
  3. In the Select a Table window, scroll through the Location list and click a schema. Then click a source table in the Table list, and then click OK.
    Select a Table window
  4. If your job includes no joins, click Next to open the Summarize Rows page.
  5. To join your source table with other tables, click Add Join, and then click Next.
  6. In the Join row, click the browse icon browse icon and select the table for the join.
  7. As needed, click the Join field and select a join type other than the default join type Inner.
    Inner
    The inner join finds matching values in the join-on columns and writes one row to the target. The target row contains all columns from both source tables. A row from either source table is not written to the target if it contains a null value in the join-on column. A row is also not written to the target if the value in the join-on column does not match a value in the join-on column in the other source table.
    Left
    The left or left-full join writes to the target all rows from the left table of the join statement. If a match does not exist between the join-on columns, null values are written to the target for the columns of the right table in the join.
    Right
    The right or right-full join reverses the definition of the left join. All rows from the right table appear in the target. If no values match between the join-on columns, then null values are written to the target for the columns of the table on the left side of the join statement.
    Full
    The full join combines the left and right joins. If a match exists between the join-on columns, then a single row is written to the target to represent those two source rows. If the left or right table has a value in the join-on column that does not match, then the data for that row is written to the target and null values are written into the columns from the other source table.
  8. In the Join-on row, click the left join-on column and select a replacement for the default column, as needed.
    Join-on statement showing left and right column selections
    Note: The left and right designations in the join-on statement define the output that is generated by the available left join and right join.
  9. Click the right join-on column to select a replacement for the column, as needed.
  10. To add more join columns, click the Add icon Add a new join-on column icon at the end of the Join-on row. When you add a second pair of join-on columns, a match between the source tables consists of a match in the first pair of join-on values and a match between the second pair of join-on values.
  11. To join a third table to the joined table that unites the two source tables, click Add join.
    Completed example join
  12. Click Next and wait a moment while the application assembles in memory the names of the joined columns.
  13. In the Summarize Rows page, if you do not need to summarize, click Next.
  14. To add summarizations, click the Group rows by field, and then click the column that you want to use as the primary grouping in your target table. For example, if you are querying a table of product sales data, then you could group rows by the product type column.
    Note:
    • If your job includes joins, note that the Group rows by list includes all columns from your source tables.
    • If you intend to paste a Hive SQL query into this directive, then you can click Next two times, to bypass the pages for summaries and filters and reach the Code page.
  15. To subset the first group with a second group, and to generate a second set of aggregations, click Add column.
  16. To generate multiple aggregations, you can add additional groups. The additional groups will appear in the target table as nested subgroups. Each group that you define will receive its own aggregations.
    To add a group, click Add Column, and then repeat the previous step to select a different column than the first group. In a table of product sales data, you could choose a second group by selecting the column product_code.
  17. In Summarize columns, select the first numeric column that you want to aggregate.
  18. In Aggregations, select one of the following:
    Count
    specifies the number of rows that contain values in each group.
    Count Distinct
    specifies the number of rows that contain distinct (or unique) values in each group.
    Max
    specifies the largest value in each group.
    Min
    specifies the smallest value in each group.
    Sum
    specifies the total of the values in each group.
  19. In New column name, either accept the default name of the aggregation column, or click to specify a new name.
  20. To add an aggregation, click Add Column.
    Completed query window
  21. When the aggregations are complete, click Next.
  22. In the Filter Data page, all rows are included in the target by default. To accept this default, click Next.
  23. If your job includes joins but no summarizations, then you can select No duplicate rows to remove duplicate rows from the target.
  24. The field Include rows where applies when you specify multiple rules. The default value Rows for which all of these rules apply writes rows to the target only if all of the rules apply. When you select Rows for which any of these rules apply, rows are written to the target only if one or more of the rules apply.
  25. Click Select a column and choose the column for the rule.
  26. Click Operator to specify a logical operator for your rule. The logical operators that are available for your rules depend on the data type of your column. Columns can be numeric numeric column data type icon, character character column data type icon, or datetime datetime column data type icon. To learn about the available logical operators, see Logical Operators in the Filter Transformation.
  27. To add another rule, click Add Rule. When your filter rules are complete, click Next.
    Completed Filter page in Query or Join Data in Hadoop Directive
  28. Click Next to open the Columns page. Use the Columns page to select, order, and rename the columns that will be written into the target table. Also use the Columns page to apply Hive SQL expressions to new or existing columns.
    The Columns page is available only if your job does not contain summaries. If your job does contain summaries, then click Next to display the Sort page, and then click Next again to open the Target Table page.
  29. Use the Columns page to do the following:
    • Select and order the columns in the target, using the arrow icons to select all select all columns icon, select one select one column icon, remove oneremove one column icon 1 (or remove one colum icon 2), and remove all remove all columns icon.
    • Replace the suggested column names as needed by clicking in the Target Name column.
    • Reorder columns by clicking a column and clicking move column to first positon icon (move to first column), move column one position left icon (move column left one position), move column one position right icon (move column right one position), and move column to last position icon (move column to last position).
    • Add new columns for Hive SQL expressions. Click the Add icon Add Hive SQL expression icon, and then specify the column name.
    • Add new column and open the Advanced Editor to develop a Hive SQL expression. Click Add column, use Advanced Editor icon.
    • Add a Hive SQL expression to an existing column using the Advanced Editor. Click Edit Hive SQL expression in existing column with the Advanced Editor icon.
  30. Follow these steps to use the Advanced Editor to generate a Hive SQL expression:
  31. Click Next to close the Columns page and open the Target Table page.
  32. In the Target Table page, to learn about the contents of a table, click the table and click the Table Viewer icon Table Viewer icon.
  33. To write your target data to an existing table, click that table and click next. Any and all existing data will be replaced.
  34. To save data to a new target table, click New Table icon/button, enter a table name in the New Table window, and click OK.
    The names of tables must meet the naming conventions of SAS and Hadoop.
  35. To display your target data as a temporary view, click save as a view icon. Saving as a view displays your target data in the Sample Data Viewer without saving the results to a target table on disk.
    When your target selection is complete, click Next to open the Code page
  36. In the Code page, click Edit HiveQL Code to edit the generated code. Click Reset Code to restore the original generated code. Click Next to open the Result page.
    Note: Edit your Hive SQL code with care. The code in the editor is the exact code that will be executed by your job, regardless of previous selections.
  37. In the Result page, you can review the previous pages by clicking on the gray action bars at the top of the window. and making changes.
  38. Click Save or Save As to save your job.
  39. Click Start querying data to execute your directive. To monitor the progress of your job, see the Run Statusdirective.