FOCUS AREAS

Hot Topics

Scalability Community: SAS/CONNECT Software

Piping is an extension of the MP CONNECT functionality whose purpose is to address pipeline parallelism. Piping enables you to overlap the execution of SAS data steps and/or certain SAS procedures. This is accomplished by spawning one SAS session to run one data step or proc and pipes its output through a TCP/IP socket as input into another SAS session running another data step or proc. This pipeline can be extended to include any number of steps and can even extend between different physical machines. The benefits of piping include:

Piping Syntax and Details

Considerations and Requirements

Sample Code: Piping Between Data Steps

Sample Code: Piping Between Data Step and Proc Sort on SMP Machine

Sample Code: Piping Between Data Step and Proc Sort Across Remote Machines

Sample Code: Multi-Processing with Raw Data Files and Merging

Sample Code: Piping with Raw Data Files and Merging

Sample Code: Piping Between Data Step, Sort, and Merge

Sample Results of Piping Implementations

Piping Syntax and Details

The piping functionality is packaged in the form of an engine which you specify on a SAS libname statement.
libname libref sasesock "port specifier";

where port specifier can be specified in one of two ways:

Explicit port specification has the following syntax:

":explicit port | :service name";
for example:
libname foo SASESOCK ":256";
This example specifies an explicit port of 256 on the machine where an asynchronous RSUBMIT is executing.
libname foo SASESOCK ":port1";
This example specifies a service name of port1 on the machine where an asynchronous RSUBMIT is executing.
libname foo SASESOCK "xyz.finance.com:256";
This example specifies an explicit port of 256 on machine xyz.finance.com.
libname foo SASESOCK "xyz.finance.com:port1";
This example specifies a service name of port1 on machine xyz.finance.com.

Implicit port specification can be used if you do not want to specify a specific port or service or do not know which port or service may be unused. In this case you can specify an alias as the port.

for example:

libname foo SASESOCK "autoport";
This example specifies the alias autoport which will cause an unused port to be associated with this alias. You do not have to care what the explicit port number is. All you have to do is use the autoport alias on both the input and outpub libname statements. As stated above, this does require the use of the metadata repository in order to store the number of the specific port that gets used.

Considerations and Requirements

In order to benefit from the use of piping, you should be aware of the following:

Sample Results of Piping Implementations



Sequential Processing                                

wpeC.jpg (33804 bytes) The first test exemplifies a typical SAS scenario: using a data step to generate data, writing the output to disk, then reading in and sorting the data and writing the output back out to disk.

Total time: 5374 seconds


   Pipeline Parallelism - same machine

   Click here to view the SAS source for this scenario.

wpe11.jpg (51589 bytes) The next test accomplishes the same task using parallel processing with piping.  The data step generates the data as before, but this time, splits the output and pipes it out  TCP/IP ports to 4 SAS sessions, each running in parallel.  Each of these SAS sessions sorts their portion of the data, and sends the output over a TCP/IP port to a merge, which merges the sorted data and writes the output to disk. 

The test case  become much more efficient because:

1.  The data can be sorted as the data step is processing.

2.  The sort process can be divided into 4 separate sorts, running in parallel, enabling the sort to run more effectively.

3.  The intermediate write to disk is eliminated, thus, improving performance and reducing disk usage.   

This resulted in a 62% improvement over the sequential test case!  One reason we saw such drastic improvement, is the machine on which the test case ran, had sufficient CPU and I/O resources.  This scenario would run most efficiently on a machine with a minimum of 5 CPUs.  However, what if your machine does not have this capacity??

Total time:  2063 seconds          %62 improvement over sequential test case!!


Pipeline Parallelism -  remote machines

   Click here to view the SAS source for this scenario.

wpe12.jpg (49991 bytes) No problem, simply farm the work out to remote machines.  In this example, the sorts are actually running on remote machines, and piping the result back to the parent machine.   This test case completed with a 61% improvement over the sequential test!

Total time:  2063 seconds          %61 improvement over sequential test case!!


Performance Results Summary

wpe15.jpg (27883 bytes) The results of the 3 test cases are depicted in the graph, demonstrating the dramatic effectiveness of piping.

Test case Scenario #2


Sequential Processing

wpe2D.jpg (67802 bytes)

wpe19.jpg (47440 bytes)

The next suite of test cases depict a scenario where two separate raw text files are read in, one containing the sales for the year, the other containing the goals for that year.  The data step divides the data into 4 SAS data sets based on the quarter.  The sales data and the goals data are then merged together based on the quarter, to produce 4 SAS output files.

The sequential test case will accomplish this by executing the following steps in sequence:  first the sales raw data file will be read in and divided into 4 different SAS data sets based on the quarter, which will be stored on the disk.  Next, the goals raw data file will be read in and divided into 4 additional SAS data sets based on the quarter and written to disk.  After this step has completed, the Q1 sales and the Q1 goal data sets will be merged together and written to disk.  Next, the Q2 sales and Q2 goals will be merged together and written to disk.  Next, the Q3 sales and the Q3 goals will be merged together and written to disk.  Finally, the Q4 sales and Q4 goals will be merged together and written to disk.  In conclusion, this test required 6 different steps, which are all executed sequentially.  This test required 2063 seconds to complete.

Total time:  2063 seconds


MP CONNECT Implementation

   Click here to view the SAS source for this scenario.

wpe1D.jpg (80037 bytes) MP Connect allows independent SAS tasks to run in parallel in their own unique SAS session.  The MP Connect implementation of this same test case would be completed by running the two data step in parallel.  After this step has completed, MP Connect would then execute 4 merges in parallel.  This test completed in 977 seconds, which is a 52% improvement over the sequential version of the test case!

Total time:  977 seconds          %52 improvement over sequential test case!!

Piping Implementation

   Click here to view the SAS source for this scenario.

wpe20.jpg (72712 bytes) The piping implementation executes the data steps and the merges in parallel.  Thus, as the data is separated, it can be piped directly into the merge process.  Not only is this more time effective because the data step is feeding its output directly into the merge so that the two processes can execute simultaneously, but, it also eliminating the intermediate write to disk.  We saw a %46 improvement over the MP Connect test case and a 74% improvement over the sequential test case!

Total time:  532 seconds          %46 improvement over MP Connect test case!!  %74 improvement over sequential test case!!