Copy Data from Hadoop

Introduction

Copy Data from Hadoop icon in the SAS Data Loader window
The Copy Data from Hadoop directive enables you to copy data from Hadoop into database management systems such as Oracle and Teradata.

Using Copy Data from Hadoop

Follow these steps to copy data from Hadoop into a database:
  1. On the SAS Data Loader page, click the Copy Data from Hadoop directive. The Source Table page that lists available data sources is displayed:
    shows data sources.
  2. Click a data source to display its tables:
    shows tables inside the data source.
  3. Select the table from which to copy data.
    Clicking the Action menu action menu icon enables the following actions:
    Open
    opens the current directive page.
    Table Viewer
    enables you to view sample data from a table. Select the table, and then click data sample icon to display the SAS Table Viewer.
    Advanced Options
    opens a dialog box that enables you to specify the maximum length for SAS columns. Entering a value here overrides the value specified in the Configuration options.
    Note: If the source table has String data types, the resulting SAS data set could be very large. The length of the target field in the SAS data set is determined by the value of this option.
    When table selection is complete, click Next. The Options page is displayed:
    select rows to filter
  4. The value on the Options page should not be changed unless you have advanced knowledge of database operations.
    Note: Changing the number of processes to greater than one expands the number of processes and source data connections that are used to import data.
    Click Next. The Target Table page is displayed with target databases:
    shows databases
  5. Click a database to display its data sources:
    shows data sources inside the database.
  6. Click a data source to display its tables:
    shows tables inside the data source.
  7. Select the table from which to copy data.
    Tip
    • You can create a new table by clicking New Table
    • If a profile already exists for a table, PROFILED appears next the table icon. You can view the existing profile by selecting the table and clicking View Profile.
    Clicking the Action menu action menu icon enables the following actions:
    Open
    opens the current directive page.
    Table Viewer
    enables you to view sample data from a table. Select the table, and then click data sample icon to display the SAS Table Viewer.
    Click Next. The Code page is displayed:
    edit code
  8. Click Edit Code to modify the generated code. To cancel your modifications, click Reset Code.
    CAUTION:
    Edit code only to implement advanced features.
    Under normal circumstances, code edits are not need or required.
  9. Click Next. The Result page is displayed:
    start copying data
  10. Click Start copying data. The Result page displays the results of the copy process:
    display copying success
    The following actions are available:
    View Results
    enables you to view the results of the copy process in the SAS Table Viewer.the
    Log
    displays the SAS log that is generated during the copy process.
    Code
    displays the SAS code that copies the data.

About Drivers and Connections

The Copy Data from Hadoop directive uses JDBC drivers to connect from your client machine to databases. The JDBC drivers on your client machine must be the same as those that are installed on the Hadoop cluster. See Install JDBC Drivers and Add Database Connectionsfor more information.

Usage Notes

  • The hive.resultset.use.unique.column.names entry in the hive-site.xml file on the target Hadoop cluster must be set to False. If not, you might see an error message in the job history log. If you encounter an error, contact your Hadoop administrator.
  • Source tables with a large number of columns can cause Copy From Hadoop jobs to fail. The job runs until the target table reaches the maximum number of columns that are supported in the target database. To resolve the problem, reduce the number of columns that are selected for the target and run the job again.
  • If one or more varchar or string columns from a source Hadoop table contains more string data than the target database column, the Copy Data from Hadoop request times out. For example, a source Hadoop table might contain a string column named myString and a target Oracle table might contain a varchar(4000) column also named myString. If data in the Hadoop myString column has a length greater than 4000, then the copy request fails.
  • When copying a Hadoop table to a database, a column name specified in the array of struct in the Hadoop table is not copied to the database table. This happens because of how structs are mapped to varchars in Sqoop.
  • A copy from Hadoop is likely to fail if the name of a source column is also a reserved word in the target database.
  • When copying a Hadoop table to Oracle, a mixed-case schema name generates an error.
  • When copying a Hadoop table to Oracle, timestamp columns in Hadoop generate errors in Oracle. The Hive timestamp format differs from the Oracle timestamp format. To resolve this issue, change the column type in the Oracle target table from timestamp to varchar2.
  • If a copy fails due to errors in null value formats, edit your code so that null string and null non-string arguments contain the value null:
    <arg>--input-null-string</arg>
    <arg>null</arg>
    <arg>--input-null-non-string</arg>
    <arg>null</arg>
    <arg>--input-null-string</arg>
    <arg>null</arg>
    <arg>--input-null-non-string</arg>
    <arg>null</arg>
  • To copy Hadoop tables to Teradata, when the source contains a double-byte character set (DBCS) such as Chinese, follow these steps:
    1. Edit the default connection string to include the option charset=utf8, as shown in this example:
      jdbc:teradata://TeradataHost/Database=TeradataDB,charset=utf8
      To edit the configuration string, open the Configuration window Configuration menu icon, click Databases, and click and edit the Teradata connection.
    2. Ensure that the default character type for the Teradata user is UNICODE.
    3. When a new Teradata table is created with the Copy Data from Hadoop directive, the column type for VARCHAR (and perhaps CHAR) columns should be set to CHARACTER SET UNICODE to accommodate wide characters.