Parallel Group-By Facility

Overview of the Parallel Group-By Facility

Nested Queries Meet Group-By Syntax Requirements

Overview of the Parallel Group-By Facility

SPD Server SQL Planner optimizations improve the performance of the more frequent query types used in data mining solutions. One of the SQL planner optimizations integrated into SPD Server is tighter integration of the Parallel Group-By capability. Parallel Group-By is a high performance parallel summarization of data executed using SQL. Parallel Group-By is often used in SQL queries (through the use of sub queries) to apply selection lists for inclusion or exclusion. The tighter integration adds performance benefits to nested Group-By syntax.

Parallel Group-By looks for specific patterns in a query that can be performed using parallel processing summarization. Parallel Group-By works against single tables that are used to aggregate data. Parallel processing summarization is limited to the types of functions it can handle.

The Parallel Group-By support in SPD Server has been expanded in many areas. Parallel Group-By is integrated into the WHERE clause planner code so that it will boost the capabilities of the SPD Server SQL engine. Any section of code that matches the Parallel Group-By trigger pattern will use it.

Enhanced Group-By Functions

Parallel Group-By now supports the following functions in syntax: COUNT, FREQ, N, USS, CSS, AVG, MEAN, MAX, MIN, NMISS, RANGE, STD, STDERR, SUM, VAR. These functions all can accept the DISTINCT term. The listed functions are the minimum summary functions that are required in order to support the SAS Enterprise Marketing Automation tool suite.

Table Aliases Supported

Table aliases are now supported in SPD Server in order to better support front end tools such as SAS Enterprise Marketing Automation. Tools such as SAS Enterprise Marketing Automation generate SQL queries that use table aliases. Table aliases enable both shorter coding syntax and a method to select a specific column in a query that has two tables that share common column names.

Nested Queries Meet Group-By Syntax Requirements

Since the Parallel Group-By functionality is integrated into the SPD Server WHERE clause planner, now many sections of queries can take advantage of performance enhancements such as parallel processing. Some common performance enhancements are sub-queries that generate value lists in an IN clause, views that now conform to Parallel Group-By syntax, and views that contain nested Group-By syntax.

General Syntax:

SELECT 'project-list' FROM 'table-name' ;

WHERE [where-expression];

GROUP BY [groupby-list];

HAVING [having-expression];

ORDER BY [orderby-list];

project-list

Items must be either column names (which must appear in the groupby-list) or aggregate (summary) functions involving a single column (with the exception of count(*), which accepts an asterisk argument. At least one aggregate function must be specified. Project items can be aliased (for example, select avg(salary) as avgsal from ) and these aliases can appear in any where-expression, having-expression, groupby-list or orderby-list. The following aggregate functions are supported: count, avg, avg distinct, count distinct, css, max, min, nmiss, sum, sum distinct, supportc, range, std, stderr, uss, var. Mean is a synonym for avg. Freq and n are synonyms for count except they do not accept the asterisk argument.

table-name

Table names can be one- or two-part identifiers (for example, mytable or foo.mytable), the latter requiring a previous libref statement to define the domain identifier (for example, foo).

The where-expression is optional.

The optional groupby-list must be column names or projected aliases.

The optional having-expression must be a Boolean expression composed of aggregate functions, GROUP BY columns, or constants.

The optional orderby-list must be projected column names or aliases or numbers that represent the position of a projected item (for example, select a, count (*) order by 2).

Formatted Parallel Group Select

By default, the columns of a group-by statement are grouped by their unformatted value. SQL pass-through parallel GROUP BY provides the capability to also group data by the columns output data format. For example, you could group by the date column of a table with an input format of mmddyy8 and an output format of monname9. Suppose the column has dates 01/01/04 and 01/02/04. Grouping by the unformatted value would put these dates into two separate groups. However, grouping by the formatted month name, would put these values into the same month grouping of January.

You enable or disable pass-through formatted parallel GROUP BY with the following execute commands:

     PROC SQL;
     connect to sasspds
      (dbq=........);

     /* turn on formatted parallel group-by */
     execute(reset fmtgrpsel)
      by sasspds;

     select *
     from connection
     to sasspds
      (select dte
       from mytable
       groupby dte);

     /* turn off formatted parallel group-by */
     execute(reset nofmtgrpsel)
      by sasspds;

     select *
     from connection
     to sasspds
      (select dte
       from mytable
       groupby dte);

     quit;

The example code below is extracted from a larger block of code, whose purpose is to make computations based on user-defined classes of age, such as Child, Adolescent, Adult, and Pensioner. The code uses SQL Parallel Group-By features to create the user-defined classes and then uses them to perform aggregate summaries and calculations.

/* Use the parallel group-by feature with the   */
/* fmtgrpsel option. This groups the data based */
/* on the output format specified in the table. */
/* This will be executed in parallel.           */

PROC SQL;
connect to sasspds
 (dbq="&domain"
  serv="&serv"
  host="&host"
  user="anonymous");

 /* Explicitly set the fmtgrpsel option */

 execute(reset fmtgrpsel)
  by sasspds;

 title 'Simple Fmtgrpsel Example';
 select *
 from connection to sasspds
  (select age, count(*) as count
   from fmttest group by age);

 disconnect from sasspds;
 quit;

PROC SQL;
connect to sasspds
 (dbq="&domain"
  serv="&serv"
  host="&host"
  user="anonymous");

 /* Explicitly set the fmtgrpsel option */

 execute(reset fmtgrpsel)
  by sasspds;

 title 'Format Both Columns Group Select Example';

 select *
 from connection to sasspds
  (select
    GENDER format=$GENDER.,
    AGE format=AGEGRP.,
   count(*) as count
   from fmttest
   formatted group by GENDER, AGE);

 disconnect from sasspds;

 quit;

PROC SQL;
connect to sasspds
 (dbq="&domain"
  serv="&serv"
  host="&host"
  user="anonymous");

 /* Explicitly set the fmtgrpsel option */

 execute(reset fmtgrpsel)
  by sasspds;

 title1 'To use Format on Only One Column With Group Select';
 title2 'Override Column Format With a Starndard Format';

 select *
 from connection to sasspds
  (select
   GENDER format=$1.,
   AGE format=AGEGRP.,
   count(*) as count
   from fmttest
   formatted group by GENDER, AGE);

 disconnect from sasspds;

 quit;

 /* A WHERE clause that uses a format to subset  */
 /* data is pushed to the server. If it is not   */
 /* pushed to the server, the following warning  */
 /* message will be written to the SAS log:      */
 /* WARNING: Server is unable to execute the     */
 /* where clause.                                */

 data temp;
 set &domain..fmttest;
  where put
   (AGE,AGEGRP.) = 'Child';
 run;

For the complete code example, see User-Defined Formats Example.