Increase Performance Using Spark and Impala
Support for Apache Spark
brings massively parallel in-memory processing to the Cleanse Data,
Transform Data, and Cluster-Survive directives. Spark-enabled directives
use DataFlux EEL functions in user-written expressions. The Spark-enabled
EEL functions replace the default DS2 functions, which support MapReduce.
The SAS Data Management Accelerator for Spark manages processing in
Hadoop.
Support for Cloudera
Impala brings Impala SQL processing to the directives Query or Join,
Sort and De-Duplicate, and Run a Hadoop SQL Program. User-written
expressions use Impala SQL functions instead of the default HiveQL
functions.
Spark and Impala can
be enabled by default for new directives. Individual directives can
override the default and always use Spark or MapReduce, or Impala
or Hive.
Impala is enabled in
the Configuration window, in the Hadoop
Configuration panel. Host and Port fields
specify both the Hive and Impala servers. Test Connection buttons
validate the host and port entries and confirm the operational status
of the servers.
Spark support is enabled
by simply selecting that value in the Preferred runtime
target list box.
Saved directives that
are enabled for Spark or Impala are indicated as such in the Run Status
directive.
Saved directives that
were created before the implementation of Impala and Spark will continue
to run as usual using Hive and MapReduce.
Directive Names No Longer Use “in Hadoop”
The words “in
Hadoop” no longer appear at the end of the directive names.
In previous releases, the words appeared in the SAS Data
Loader window. The words also appeared in the names of
new directives.