What's New in SAS Data Loader 2.4 for Hadoop

Combine Multiple Tables and Merge Matching Rows

Use the new Match-Merge Data directive to combine columns from multiple source tables into a single target table. You can also merge data in specified columns when rows match in two or more source tables. Columns can match across numeric or character data types. Additional features in the directive include the following:
  • Rename and copy into the target specified unmatched columns.
  • Filter rows from the target using multiple rules combined with an expression, or filter with an expression only.
  • Evaluate target rows using expressions, and write the results into new target columns.
To learn more, see Match-Merge Data.

Execute Multiple Jobs in Series or in Parallel

The new directive named Chain Directives runs two or more saved directives in series or in parallel. One chain directive can contain another chain directive. A serial chain can execute a parallel chain, and a parallel chain can execute a serial chain. An individual directive can appear more than once in a serial directive. Results can be viewed for each directive in a chain as soon as those results become available. To learn more, see Chain Directives.

Improve Your Data with Clustering and Survivorship

The new Cluster-Survive Data directive uses rules to create clusters of similar rows. Additional rules can be used to construct a survivor row that replaces the cluster of rows in the target. The survivor row combines the best values in the cluster. This directive requires the Apache Spark run-time environment. To learn more, see Cluster-Survive Data.

Run SQL Programs in Impala or Hive

The directive Run a Hadoop SQL Program replaces the former directive Run a Hive Program. The new directive can use either Impala SQL or HiveQL. In the directive, a Resources box lists the functions that are available in the selected SQL environment. Selecting a function displays syntax help. A click moves the selected function into the SQL program. To learn more, see Run a Hadoop SQL Program.

Increase Performance Using Spark and Impala

Support for Apache Spark brings massively parallel in-memory processing to the Cleanse Data, Transform Data, and Cluster-Survive directives. Spark-enabled directives use DataFlux EEL functions in user-written expressions. The Spark-enabled EEL functions replace the default DS2 functions, which support MapReduce. The SAS Data Management Accelerator for Spark manages processing in Hadoop.
Support for Cloudera Impala brings Impala SQL processing to the directives Query or Join, Sort and De-Duplicate, and Run a Hadoop SQL Program. User-written expressions use Impala SQL functions instead of the default HiveQL functions.
Spark and Impala can be enabled by default for new directives. Individual directives can override the default and always use Spark or MapReduce, or Impala or Hive.
Impala is enabled in the Configuration window, in the Hadoop Configuration panel. Host and Port fields specify both the Hive and Impala servers. Test Connection buttons validate the host and port entries and confirm the operational status of the servers.
Spark support is enabled by simply selecting that value in the Preferred runtime target list box.
Saved directives that are enabled for Spark or Impala are indicated as such in the Run Status directive.
Saved directives that were created before the implementation of Impala and Spark will continue to run as usual using Hive and MapReduce.

Cleanup Hadoop Using Table Delete

Hadoop tables can now be selected and deleted in the Source Table and Target Table tasks. To learn more, see Delete Tables.

Added Support for Pivotal HD and IBM Big Insights

Support is now available for the Hadoop distributions Pivotal HD and IBM Big Insights. New versions of Cloudera, Hortonworks, and MapR are supported. Kerberos is not supported in combination with MapR or IBM Big Insights. To learn about the supported versions of these distributions, see SAS Data Loader 2.4 for Hadoop: System Requirements.

Enhance the Performance of Profile Jobs

In the Configuration window, the Profile panel now configures the number of threads used in Profile Data directives. Also added is the ability to specify the number of MapReduce jobs that are run by Profile Data directives. For more information, see Profiles Panel.

Schedule Jobs Using REST API

A REST API can now be used in a scheduling application to run saved directives. The API can also return a job’s state, results, log file, or error message. Job controls in the API can cancel running jobs and delete job information. To learn more, see the SAS Data Loader for Hadoop: REST API Reference.

Improved Syntax Editing

The Code task now displays generated code in a syntax editor rather than a text editor. The Code task appears near the Target Table task in the directives that generate code.

Apply and Reload Hadoop Configuration Changes

In the Configuration window, the Hadoop Configuration panel now includes Apply and Reload buttons. Click Apply to validate your changes. Click Reload to restore your previous configuration. For more information, see Hadoop Configuration Panel.

Added Support for VirtualBox and VMware Hypervisors

The following hypervisors can now be used with SAS Data Loader for Hadoop: VMware Workstation 12, VMware Workstation 12 Pro, and Oracle VM VirtualBox 5. Hypervisors run the vApp virtual machine on Windows client computers. For more information, see the SAS Data Loader 2.4 for Hadoop: System Requirements.

Directive Names No Longer Use “in Hadoop”

The words “in Hadoop” no longer appear at the end of the directive names. In previous releases, the words appeared in the SAS Data Loader window. The words also appeared in the names of new directives.