Adjusting the SAS Embedded Process Performance

You can adjust how the SAS Embedded Process runs by changing or adding its properties. SAS Embedded Process configuration properties can be added to the mapred-site.xml configuration on the client side or the sasep-site.xml file. If you change the properties in the ep-config.xml file located on HDFS, it will affect all SAS Embedded Process jobs. The ep-config.xml file is created when you install the SAS Embedded Process.
When using the SAS Embedded Process, there are a number of properties that you can adjust to improve performance.
  • You can specify the number of SAS Embedded Process MapReduce tasks per node by changing the sas.ep.superreader.tasks.per.node property.
    The SAS Embedded Process super reader technology does not use the standard MapReduce split calculation. Instead of assigning one split per task, it assigns many. The super reader calculates the splits, groups them, and distributes the groups to a configurable number of mapper tasks based on data locality.
    The default number of tasks is 6.
    <property>
       <name>sas.ep.superreader.tasks.per.node</name>
       <value>number-of-tasks</value>
    </property>
  • You can specify the number of concurrent nodes that are allowed to run High-Performance Analytics output tasks by changing the sas.ep.hpa.output.concurrent.nodes property.
    If this property is set to zero, the SAS Embedded Process allocates tasks on all nodes capable of running a YARN container. If this property is set to –1, the number of concurrent nodes equates to the number of High-Performance Analytics worker nodes. If the number of concurrent nodes exceeds the number of available nodes, the property value is adjusted to the number of available nodes. The default value is 0.
    <property>
       <name>sas.ep.hpa.output.concurrent.nodes</name>
       <value>number-of-nodes</value>
    </property>
  • You can specify the number of High-Performance Analytics output tasks that are allowed to run per node by changing the sas.ep.hpa.output.tasks.per.node property.
    The default number of tasks is 1.
    <property>
       <name>sas.ep.hpa.output.tasks.per.node</name>
       <value>number-of-tasks</value>
    </property>
  • You can specify the number of concurrent input reader threads.
    Each reader thread takes a file split from the input splits queue, opens the file, positions itself at the beginning of the split, and starts reading the records. Each record is stored on a native buffer that is shared with the DS2 container. When the native buffer is full, it is pushed to the DS2 container for processing. When a reader thread finishes reading a file split, it takes another file split from the input splits queue. The default number of input threads is 3.
    <property>
       <name>sas.ep.input.threads</name>
       <value>number-of-input-threads</value>
    </property>
  • You can specify the number of output writer threads by changing the sas.ep.output.threads property.
    The SAS Embedded Process super writer technology improves performance by writing output data in parallel, producing multiple parts of the output file per mapper task. Each writer thread is responsible for writing one part of the output file. The default number of output threads is 2.
    <property>
       <name>sas.ep.output.threads</name>
       <value>number-of-output-threads</value>
    </property>
  • You can specify the number of compute threads by changing the sas.ep.compute.threads property.
    Each compute thread runs one instance of the DS2 program inside the SAS Embedded Process. The DS2 code that runs inside the DS2 container processes the records that it receives. At a given point, DS2 flushes output data to native buffers. The super writer threads take the output data from DS2 buffers and writes them to the super writer thread output file on a designated HDFS location. When all file input splits are processed and all output data is flushed and written to HDFS, the mapper task ends. The default number of compute threads is 1.
    <property>
       <name>sas.ep.compute.threads</name>
       <value>number-of-threads</value>
    </property>
  • You can specify the number of buffers that are used for output data by changing the sas.ep.output.buffers property.
    The number of output buffers should not be less than sas.ep.compute.threads plus sas.ep.output.threadss. The default number of buffers is 3.
    <property>
       <name>sas.ep.output.buffers</name>
       <value>number-of-buffers</value>
    </property>
  • You can specify the number of native buffers that are used to cache input data by changing the sas.ep.input.buffers property. The default value is 4. The number of input buffers should not be less than sas.ep.compute.threads plus sas.ep.input.threads.
    <property>
       <name>sas.ep.input.buffers</name>
       <value>number-of-buffers</value>
    </property>
  • You can specify the optimal size of one input buffer by changing the sas.ep.optimal.input.buffer.size property.
    The optimal row array size is calculated based on the optimal buffer size. The default value is 1 MB.
    <property>
       <name>sas.ep.optimal.input.buffer.size.mb</name>
       <value>number-in-megabytes</value>
    </property>
  • You can specify the amount of memory in bytes that the SAS Embedded Process is allowed to use with MapReduce 1 by changing the sas.ep.max.memory property in the ep-config.xml file.
    If your Hadoop distribution is running MapReduce 2, this value does not supersede the YARN maximum memory per task. Adjust the YARN container limit to change the amount of memory that the SAS Embedded Process is allowed to use.
    The default value is 2147483647 bytes.
    <property>
       <name>sas.ep.max.memory</name>
       <value>number-of-bytes</value>
    </property>
    Note: This property is valid only for Hadoop distributions that are running MapReduce 1.
Last updated: February 9, 2017