SAS Scalable Performance Data Server
 

FOCUS AREAS

MP CONNECT Combined with SPD Server - Tips and Tricks

The following example goes through an interative process of modifying an existing serial SAS job with different scalable SAS solutions. The first iteration makes use of MP CONNECT. The second iteration uses the SPD Server product. The final iteration maximizes the gain in performance by using a combination of both MP CONNECT and the SPDS Server product.

Hardware Details for this Application

Serial Base SAS Execution

MP CONNECT Implementation

SPD Server Implementation

MP CONNECT and SPD Server Combined

Graphical Results of Performance Gains

Hardware Details for this Application

The following information describes the hardware that was used to run the four iterations of the application described below:

Serial Base SAS Execution

The following example is a serial SAS job made up of sorts and data steps and a final merge to create a couple of final data sets as follows:



libname base1 '/DATA01/spds301' ;
libname base2 '/DATA02/spds301' ;

/** sort BASE1.TYPE1 to create BASE1.TYPE1_SORT**/
proc sort data=base1.type1 out=base1.type1_sort ;
   by group_key type1_key ;
run ;

/** sort BASE1.FILTER_RECS to create BASE1.FILTER_RECS_SORT **/
proc sort data=base1.filter_recs out=base1.filter_recs_sort ;
   by group_key type1_key ;
run ;

/** merge BASE1.TYPE1_SORT and BASE1.FILTER_RECS_SORT to **/
/**  create BASE1.TYPE1_CALC                             **/
data base1.type1_calc (keep=group_key type1_group_size
                       sortedby=group_key) ;

    merge base1.type1_sort (in=a) base1.filter_recs_sort (in=b) ;

    by group_key type1_key ;

    retain type1_group_size 0;

    if first.type1_key then type1_group_size = .;

    if a and b
    then delete ;
    else if a
         then do ;
            type1_group_size = sum(type1_group_size,size1) ;
            if last.group_key then output base1.type1_calc ;
         end ;
run ;

proc datasets library=base1 ;
   delete type1_sort filter_recs_sort ;
run ;

/** sort BASE2.TYPE2  **/
proc sort data=base2.type2 out=base2.type2_sort ;
   by group_key type2_key ;
run ;

/** create BASE2.TYPE2_CALC and BASE2.TYPE2_CNT **/
data base2.type2_calc (sortedby=group_key)
     base2.type2_cnt (sortedby=group_key) ;

   set base2.type2_sort ;

   by group_key type2_key ;

   retain type2_size 0 max_type2 0 type2_cnt 0 ;

   if first.type2_key
   then do ;
      max_type2 = 0 ;
      type2_size = . ;
   end ;

   if first.type2_key
   then do ;
       type2_size = sum(type2_size,size2) ;
       type2_cnt = 0 ;
   end ;

   duncnt = duncnt + 1 ;

   if last.type2_key and max_type2 < duncnt then max_type2 = type2_cnt ;

   if last.type2_key
   then do ;
       output base2.type2_calc ;
       if max_type2 > 0 then output base2.type2_cnt ;
   end ;
run ;

proc datasets library=base2 ;
   delete type2_sort ;
run ;

/** merge BASE1.TYPE1_CALC and BASE2.TYPE2_CALC **/
data base2.both_types ;
    merge base1.type1_calc (in=a) base2.type2_calc (in=b) ;
    by group_key ;
run ;

MP CONNECT Implementation

The following example makes use of MP CONNECT. This particular application can be broken up into at least two independent units of work that can be run in parallel. MP CONNECT is utilized in this iteration to execute the following independent tasks in parallel and then synchronize the execution for the final dependent task.

libname base1 '/DATA01/spds301' ;
libname base2 '/DATA02/spds301' ;

/** create TASK1 to execute first independent task **/
signon task1 sascmd='sas' ;

  rsubmit wait=no ;

  libname base1 '/DATA01/spds301' ;

  /** sort BASE1.TYPE1 to create BASE1.TYPE1_SORT **/
  proc sort data=base1.type1 out=base1.type1_sort ;
     by group_key type1_key ;
  run ;

  /** sort BASE1.FILTER_RECS to create BASE1.FILTER_RECS_SORT **/
  proc sort data=base1.filter_recs out=base1.filter_recs_sort;
     by group_key type1_key ;
  run ;

  /** merge BASE1.TYPE1_SORT and BASE1.FILTER_RECS_SORT to create **/
  /** BASE1.TYPE1_CALC                                            **/
  data base1.type1_calc (keep=group_key type1_group_size
                         sortedby=group_key) ;

     merge base1.type1_sort (in=a) base1.filter_recs_sort (in=b) ;

     by group_key type1_key ;

     retain type1_group_size 0;

     if first.type1_key then type1_group_size = .;

     if a and b
     then delete ;
     else if a
          then do ;
             type1_group_size = sum(type1_group_size,size1) ;
             if last.group_key then output base1.type1_calc ;
          end ;
  run ;

  proc datasets library=base1 ;
     delete type1_sort filter_recs_sort ;
  run ;

  endrsubmit ;

/** create TASK2 to execute second independent task **/
signon task2 sascmd='sas' ;

  rsubmit wait=no ;

  libname base2 '/DATA02/spds301' ;

  /** sort BASE2.TYPE2 **/
  proc sort data=base2.type2 out=base2.type2_sort ;
     by group_key type2_key ;
  run ;

  /** create BASE2.TYPE2_CALC and BASE2.TYPE2_CNT **/
  data base2.type2_calc (sortedby=group_key)
       base2.type2_cnt (sortedby=group_key) ;

     set base2.type2_sort ;

     by group_key type2_key ;

     retain type2_size 0 max_type2 0 type2_cnt 0 ;

     if first.type2_key
     then do ;
        max_type2 = 0 ;
        type2_size = . ;
     end ;

     if first.type2_key
     then do ;
        type2_size = sum(type2_size,size2) ;
        type2_cnt = 0 ;
     end ;

     duncnt = duncnt + 1 ;

     if last.type2_key and max_type2 < duncnt then
        max_type2 = type2_cnt ;

     if last.type2_key
     then do ;
         output base2.type2_calc ;
         if max_type2 > 0 then output base2.type2_cnt ;
     end ;
  run ;

  proc datasets library=base2 ;
     delete type2_sort ;
  run ;

  endrsubmit ;

/** wait for both TASK1 and TASK2 to complete so that **/
/** the merge can be done.                            **/
waitfor _all_ task1 task2 ;

signoff task1 ;
signoff task2 ;

/** merge BASE1.TYPE1_CALC and BASE2.TYPE2_CALC **/
data base2.both_types ;
    merge base1.type1_calc (in=a) base2.type2_calc (in=b) ;
    by group_key ;
run ;

SPD Server Implementation

The following example makes use of the SPD Server product and the fact that it is capable of performing implicit sorts when BY processing is done. This job performs the following tasks:

%let domain=spds302 ;
%let serv=5120 ;

/* assumes that SPDS server has been initialized and is running */
libname spds302 sasspds "&domain" host='hostname' serv="&serv"
        user='anonymous' unixdomain=YES netcomp=NO ;

libname &domain sasspds "&domain" host='hostname' serv="&serv"
        user='anonymous' unixdomain=YES netcomp=NO ;

/** create SPDS302.TYPE1_CALC from SPDS302.TYPE and SPDS302.FILTER_RECS **/
data &domain..type1_calc (keep=group_key type1_group_size
                          sortedby=group_key) ;

    merge &domain..type1 (in=a)
          &domain..filter_recs (in=b) ;

    by group_key type1_key ;

    retain type1_group_size 0;

    if first.type1_key then type1_group_size = .;

    if a and b
    then delete ;
    else if a
         then do ;
            type1_group_size = sum(type1_group_size,size1) ;
            if last.group_key then output &domain..type1_calc ;
         end ;
run ;

/** create SPDS302.TYPE2_CALC and SPDS302.TYPE2_CNT from SPDS302.TYPE2 **/
data &domain..type2_calc (sortedby=group_key)
     &domain..type2_cnt (sortedby=group_key) ;

   set &domain..type2 ;

   by group_key type2_key ;

   retain type2_size 0 max_type2 0 type2_cnt 0 ;

   if first.type2_key
   then do ;
      max_type2 = 0 ;
      type2_size = . ;
   end ;

   if first.type2_key
   then do ;
       type2_size = sum(type2_size,size2) ;
       type2_cnt = 0 ;
   end ;

   duncnt = duncnt + 1 ;

   if last.type2_key and max_type2 < duncnt then max_type2 = type2_cnt ;

   if last.type2_key
   then do ;
       output &domain..type2_calc ;
       if max_type2 > 0 then output &domain..type2_cnt ;
   end ;
run ;

/** merge SPDS302.TYPE1_CALC and SPDS302.TYPE2_CALC **/
data &domain..both_types ;
    merge &domain..type1_calc (in=a) &domain..type2_calc (in=b) ;
    by group_key ;
run ;

MP CONNECT and SPD Server Combined

The final iteration combines both the MP CONNECT implementation with the SPD Server implementation for the maximum gain in performance. This solution benefits from the parallel task execution provided by MP CONNECT and the implicit sort capability of SPD Server to minimize the total elapsed time needed to run the job. The final application is structured as follows:

%let domain=spds302 ;
%let serv=5120 ;

/* assumes that SPDS server has been initialized and is running */
libname &domain sasspds "&domain" host='hostname' serv="&serv"
        user='anonymous' unixdomain=YES netcomp=NO ;

/** create TASK1 to execute first independent task **/
signon task1 sascmd='sas' ;

rsubmit wait=no ;

%let domain=spds302 ;
%let serv=5120 ;

/* assumes that SPDS server has been initialized and is running */
libname &domain sasspds "&domain" host='hostname' serv="&serv"
        user='anonymous' unixdomain=YES netcomp=NO ;

/** merge BASE1.TYPE1_SORT and BASE1.FILTER_RECS_SORT to create **/
/** BASE1.TYPE1_CALC                                            **/
data &domain..type1_calc (keep=group_key type1_group_size
                          sortedby=group_key) ;

    merge &domain..type1 (in=a)
          &domain..filter_recs (in=b) ;

    by group_key type1_key ;

    retain type1_group_size 0;

    if first.type1_key then type1_group_size = .;

    if a and b
    then delete ;
    else if a
         then do ;
            type1_group_size = sum(type1_group_size,size1) ;
            if last.group_key then output &domain..type1_calc ;
         end ;
run ;
endrsubmit ;

/** create TASK2 to execute second independent task **/
signon task2 sascmd='sas' ;

rsubmit wait=no ;

%let domain=spds302 ;
%let serv=5120 ;

/* assumes that SPDS server has been initialized and is running */
libname &domain sasspds "&domain" host='hostname' serv="&serv"
        user='anonymous' unixdomain=YES netcomp=NO ;

/** create BASE2.TYPE2_CALC and BASE2.TYPE2_CNT **/
data &domain..type2_calc (sortedby=group_key)
     &domain..type2_cnt (sortedby=group_key) ;

   set &domain..type2 ;

   by group_key type2_key ;

   retain type2_size 0 max_type2 0 type2_cnt 0 ;

   if first.type2_key
   then do ;
      max_type2 = 0 ;
      type2_size = . ;
   end ;

   if first.type2_key
   then do ;
       type2_size = sum(type2_size,size2) ;
       type2_cnt = 0 ;
   end ;

   duncnt = duncnt + 1 ;

   if last.type2_key and max_type2 < duncnt then
      max_type2 = type2_cnt ;

   if last.type2_key
   then do ;
       output &domain..type2_calc ;
       if max_type2 > 0 then output &domain..type2_cnt ;
   end ;
run ;
endrsubmit ;

/** wait for both TASK1 and TASK2 to complete so that **/
/** the merge can be done.                            **/
waitfor _all_ task1 task2 ;

signoff task1 ;
signoff task2 ;

/** merge BASE1.TYPE1_CALC and BASE2.TYPE2_CALC **/
data &domain..both_types ;
    merge &domain..type1_calc (in=a)
          &domain..type2_calc (in=b) ;
    by group_key ;
run ;

Graphical Results of Performance Gains

The following graph shows the relative gains in performance for each iteration of this application. The X-axis represents time and the Y-axis represents the four different iterations of the solution that are detailed above . The SPDS/MP bar on the top illustrates the maximum performance gain running in approximately 25% of the time required by the original serial problem.

error-file:tidyout.log