What's New in SAS Data Loader 2.3 for Hadoop

Overview

The main enhancements for SAS Data Loader 2.3 include the following:
  • Migration Support for SAS Data Loader 2.2
  • New Support for Importing Delimited Files into Hadoop
  • Enhanced Support for SAS LASR Analytic Server
  • Profile Enhancements
  • New Features for Data Quality Analysis
  • Usability Enhancements
  • Enhanced Support for Hadoop
  • Enhanced Support for Apache Hive
  • New Support for Active Directory (LDAP) Authentication
  • Documentation Changes

Migration Support for SAS Data Loader 2.2

When you configure the vApp for SAS Data Loader 2.3, you can migrate your jobs, profiles, and configuration settings from SAS Data Loader 2.2. For more information, see the “Migrate From a Previous Version” chapter of the SAS Data Loader for Hadoop: vApp Deployment Guide.

New Support for Importing Delimited Files into Hadoop

The new directive Import a File now imports and registers delimited source files in Hadoop. The contents of the imported file are saved to a Hive table. New tables receive columns that are derived from the source. Column definitions can be edited and stored for reuse in future imports.

Enhanced Support for SAS LASR Analytic Server

The directive Load Data to LASR now supports symmetric multiprocessing (SMP) with the SASIOLA engine when loading data into non-grid configurations of the SAS LASR Analytic Server software. In grid configurations, the directive continues to support massively parallel processing (MPP).

Profile Enhancements

The directives Profile Data and Saved Profile Reports have been enhanced:
  • New configuration settings give you more control over how profile jobs are processed. For example, you can specify the number of parallel threads to use when processing a job. You can also choose to minimize the number of threads that are used. For more information about these settings, see Profiles Panel.
  • When viewing data in a profile report, you can view trend graphs for the information in the report to quickly visualize how the profiled data has changed over time. For more information about using trend graphs in reports, see Open Saved Profile Reports.
  • When you generate a profile report, SAS Data Loader now automatically performs data type analysis and includes the results in the report. The analysis detects and reports on the types of content in source columns. For example, the analysis can indicate that a column contains phone numbers or ZIP codes. For more information about profile reports, see About Profile Reports.

New Features for Data Quality Analysis

The directive Cleanse Data in Hadoop now provides the following new transformations: Change Case, Gender Analysis, Pattern Analysis, and Field Extraction.

New Support for Deleting Rows from a Selected Table

The new Delete Rows directive enables the deletion of specified rows within a source table, without writing the output to a target table. Rows can be deleted by one or more rules or by a user-written Hive expression. You can paste and edit existing expressions. An expression builder provides access to Hive function syntax and source column names. In order to use this feature, the Hadoop cluster must be configured with release 0.14 or later of the Apache Hive data warehouse software. This release supports transactional tables.

Usability Enhancements

You can do the following tasks, using features that are new to this release:
  • display available tables and data sources in a list view or in the existing grid view.
  • use a new Search field to locate tables and data sources in the list view.
  • configure the list view and grid view to open the most recently used data source automatically. For more information, see About the SAS Table Viewer.
  • select from the ten most recently accessed tables or data sources.
  • delete individual jobs in the Run Status directive.
  • filter rows using an expression builder or the existing rules mechanism in several directives.
  • The SAS Table Viewer now displays short column names, not tablename.columnname. The short column names are easier to read.

Enhanced Support for Hadoop

New versions of Cloudera and Hortonworks are supported. Support for MapR has been added. Kerberos is not supported in combination with MapR.
If the version of Hadoop on your cluster has changed, you can now update the version of Hadoop that is specified in the vApp to match. For more information, see “Configuring a New Version of Hadoop” in the SAS Data Loader for Hadoop: vApp Deployment Guide.
If the default temporary storage directory for SAS Data Loader is not appropriate for some reason, you can now change that directory. For example, if the default temporary directory is not writable, you can specify another directory. For more information, see the description of the SAS HDFS temporary file location field in the topic Hadoop Configuration Panel.

Enhanced Support for Apache Hive

The new directive Run a Hive Program enables you to paste and edit existing Hive programs, and then run those programs in Hadoop.
The Run a Hive Program directive provides a resource selector that enables you to browse and add Hive function syntax. The resource selector is also provided in the Hive expression builder in the following directives: Delete Rows, Query or Join Tables in Hadoop, and Sort and De-Duplicate Data in Hadoop.
You can now store data in the default Hive warehouse location, or you can specify an alternate location in the Hadoop file system. This location setting can be applied globally for all target tables that are generated. You can also override the global setting on a job-by-job basis for individual target tables. For more information, see Change the File Format of Hadoop Target Tables.

New Support for Active Directory (LDAP) Authentication

SAS Data Loader will now connect to a Hadoop cluster that is protected with Active Directory and LDAP (Lightweight Directory Access Protocol). For more information, see Active Directory (LDAP) Authentication.

Documentation Changes

The content that was formerly in the SAS Data Loader for Hadoop: Administrator's Guide is now in the “Administrator’s Guide for SAS Data Loader for Hadoop” section of the SAS In-Database Products: Administrator's Guide.