You can adjust how the
SAS Embedded Process runs by changing or adding its properties. SAS
Embedded Process configuration properties can be added to the mapred-site.xml
configuration on the client side or the sasep-site.xml file. If you
change the properties in the ep-config.xml file located on HDFS, it
will affect all SAS Embedded Process jobs. The ep-config.xml file
is created when you install the SAS Embedded Process.
When using the SAS Embedded
Process, there are a number of properties that you can adjust to improve
performance.
-
You can specify the number of SAS
Embedded Process MapReduce tasks per node by changing the sas.ep.superreader.tasks.per.node
property.
The SAS Embedded Process
super reader technology does not use the standard MapReduce split
calculation. Instead of assigning one split per task, it assigns many.
The super reader calculates the splits, groups them, and distributes
the groups to a configurable number of mapper tasks based on data
locality.
The default number
of tasks is 6.
<property>
<name>sas.ep.superreader.tasks.per.node</name>
<value>number-of-tasks</value>
</property>
-
You can specify the number of concurrent
nodes that are allowed to run High-Performance Analytics output tasks
by changing the sas.ep.hpa.output.concurrent.nodes property.
If this property is
set to zero, the SAS Embedded Process allocates tasks on all nodes
capable of running a YARN container. If this property is set to –1,
the number of concurrent nodes equates to the number of High-Performance
Analytics worker nodes. If the number of concurrent nodes exceeds
the number of available nodes, the property value is adjusted to the
number of available nodes. The default value is 0.
<property>
<name>sas.ep.hpa.output.concurrent.nodes</name>
<value>number-of-nodes</value>
</property>
-
You can specify the number of High-Performance
Analytics output tasks that are allowed to run per node by changing
the sas.ep.hpa.output.tasks.per.node property.
The default number
of tasks is 1.
<property>
<name>sas.ep.hpa.output.tasks.per.node</name>
<value>number-of-tasks</value>
</property>
-
You can specify the number of concurrent
input reader threads.
Each reader thread
takes a file split from the input splits queue, opens the file, positions
itself at the beginning of the split, and starts reading the records.
Each record is stored on a native buffer that is shared with the DS2
container. When the native buffer is full, it is pushed to the DS2
container for processing. When a reader thread finishes reading a
file split, it takes another file split from the input splits queue.
The default number of input threads is 3.
<property>
<name>sas.ep.input.threads</name>
<value>number-of-input-threads</value>
</property>
-
You can specify the number of output
writer threads by changing the sas.ep.output.threads property.
The SAS Embedded Process
super writer technology improves performance by writing output data
in parallel, producing multiple parts of the output file per mapper
task. Each writer thread is responsible for writing one part of the
output file. The default number of output threads is 2.
<property>
<name>sas.ep.output.threads</name>
<value>number-of-output-threads</value>
</property>
-
You can specify the number of compute
threads by changing the sas.ep.compute.threads property.
Each compute thread
runs one instance of the DS2 program inside the SAS Embedded Process.
The DS2 code that runs inside the DS2 container processes the records
that it receives. At a given point, DS2 flushes output data to native
buffers. The super writer threads take the output data from DS2 buffers
and writes them to the super writer thread output file on a designated
HDFS location. When all file input splits are processed and all output
data is flushed and written to HDFS, the mapper task ends. The default
number of compute threads is 1.
<property>
<name>sas.ep.compute.threads</name>
<value>number-of-threads</value>
</property>
-
You can specify the number of buffers
that are used for output data by changing the sas.ep.output.buffers
property.
The number of output
buffers should not be less than sas.ep.compute.threads plus sas.ep.output.threadss.
The default number of buffers is 3.
<property>
<name>sas.ep.output.buffers</name>
<value>number-of-buffers</value>
</property>
-
You can specify the number of native
buffers that are used to cache input data by changing the sas.ep.input.buffers
property. The default value is 4. The number of input buffers should
not be less than sas.ep.compute.threads plus sas.ep.input.threads.
<property>
<name>sas.ep.input.buffers</name>
<value>number-of-buffers</value>
</property>
-
You can specify the optimal size
of one input buffer by changing the sas.ep.optimal.input.buffer.size
property.
The optimal row array
size is calculated based on the optimal buffer size. The default value
is 1 MB.
<property>
<name>sas.ep.optimal.input.buffer.size.mb</name>
<value>number-in-megabytes</value>
</property>
-
You can specify the amount of memory
in bytes that the SAS Embedded Process is allowed to use with MapReduce
1 by changing the sas.ep.max.memory property in the ep-config.xml
file.
If
your Hadoop distribution is running MapReduce 2, this value does not
supersede the YARN maximum memory per task. Adjust the YARN container
limit to change the amount of memory that the SAS Embedded Process
is allowed to use.
The default value is
2147483647 bytes.
<property>
<name>sas.ep.max.memory</name>
<value>number-of-bytes</value>
</property>
Note: This property
is valid only for Hadoop distributions that are running MapReduce
1.