Parallel Group-By Facility

Overview of the Parallel Group-By Facility

Nested Queries Meet Group-By Syntax Requirements

Overview of the Parallel Group-By Facility

SPD Server SQL Planner optimizations improve the performance of the more frequent query types used in data mining solutions. One of the SQL Planner optimizations is the Parallel Group-By capability. Parallel Group-By is a high-performance parallel summarization of data that is executed using SQL. Parallel Group-By is often used in SQL queries (through the use of sub queries) to apply selection lists for inclusion or exclusion. The tighter integration adds performance benefits to nested Group-By syntax.

Parallel Group-By looks for specific patterns in a query that can be performed by using parallel processing summarization. Parallel Group-By works against single tables that are used to aggregate data. Parallel processing summarization is limited to the types of functions that it can handle.

The Parallel Group-By support in SPD Server is integrated into the WHERE clause planner code so that it boosts the capabilities of the SPD Server SQL engine. Any section of code that matches the Parallel Group-By trigger pattern will use Parallel Group-By support.

Enhanced Group-By Functions

Parallel Group-By supports the following functions in syntax: COUNT, FREQ, N, USS, CSS, AVG, MEAN, MAX, MIN, NMISS, RANGE, STD, STDERR, SUM, VAR. All these functions can accept the DISTINCT term. These functions are the minimum summary functions that are required in order to support the SAS Marketing Automation tool suite.

Nested Queries Meet Group-By Syntax Requirements

Because the Parallel Group-By functionality is integrated into the SPD Server WHERE clause planner, many sections of queries can take advantage of performance enhancements such as parallel processing. Some common performance enhancements are subqueries that generate value lists in an IN clause, views that conform to Parallel Group-By syntax, and views that contain nested Group-By syntax.

General Syntax:

SELECT 'project-list' FROM 'table-name' ;

WHERE [where-expression];

GROUP BY [groupby-list];

HAVING [having-expression];

ORDER BY [orderby-list];

project-list

Items must be either column names (which must appear in the groupby-list value) or aggregate (summary) functions that involve a single column [with the exception of count(*), which accepts an asterisk argument]. You must specify at least one aggregate function. You can use an alias for project items [for example, SELECT avg(salary) AS avgsal FROM... ]. These aliases can appear in any where-expression, having-expression, groupby-list or orderby-list value. The following aggregate functions are supported: count, avg, avg distinct, count distinct, css, max, min, nmiss, sum, sum distinct, supportc, range, std, stderr, uss, var. Mean is a synonym for avg. Freq and n are synonyms for count, but these values do not accept the asterisk argument.

table-name

Table names can be one- or two-part identifiers (for example, mytable or foo.mytable). Identifiers such as foo.mytable require a previous libref statement to define the domain identifier (for example, foo).

where-expression

This value is optional

groupby-list

This value is optional. The value must be column names or projected aliases.

having-expression

This value is optional. The value must be a Boolean expression composed of aggregate functions, GROUP BY columns, or constants.

orderby-list

This value is option. The value must be projected column names, aliases, or numbers that represent the position of a projected item [for example, SELECT a, COUNT(*) ORDER BY 2].

Formatted Parallel Group Select

By default, the columns of a group-by statement are grouped by their unformatted value. You can use SQL pass-through parallel GROUP BY to group data by the columns output data format. For example, you can group by the date column of a table with an input format of mmddyy8 and an output format of monname9. Suppose the column has dates 01/01/04 and 01/02/04. If you group by the unformatted value, these dates will be put into two separate groups. However, if you group by the formatted month name, these values will be put into the same month grouping of January.

You enable or disable pass-through formatted parallel group-by with the following execute commands:

     PROC SQL;
     connect to sasspds
      (dbq=........);

     /* turn on formatted parallel group-by */
     execute(reset fmtgrpsel)
      by sasspds;

     select *
     from connection
     to sasspds
      (select dte
       from mytable
       groupby dte);

     /* turn off formatted parallel group-by */
     execute(reset nofmtgrpsel)
      by sasspds;

     select *
     from connection
     to sasspds
      (select dte
       from mytable
       groupby dte);

     quit;

The following example code is extracted from a larger block of code, whose purpose is to make computations based on user-defined classes of age, such as Child, Adolescent, Adult, and Pensioner. The code uses SQL Parallel Group-By features to create the user-defined classes, and then uses them to perform aggregate summaries and calculations.

/* Use the parallel group-by feature with the   */
/* fmtgrpsel option. This groups the data based */
/* on the output format specified in the table. */
/* This will be executed in parallel.           */

PROC SQL;
connect to sasspds
 (dbq="&domain"
  serv="&serv"
  host="&host"
  user="anonymous");

 /* Explicitly set the fmtgrpsel option */

 execute(reset fmtgrpsel)
  by sasspds;

 title 'Simple Fmtgrpsel Example';
 select *
 from connection to sasspds
  (select age, count(*) as count
   from fmttest group by age);

 disconnect from sasspds;
 quit;

PROC SQL;
connect to sasspds
 (dbq="&domain"
  serv="&serv"
  host="&host"
  user="anonymous");

 /* Explicitly set the fmtgrpsel option */

 execute(reset fmtgrpsel)
  by sasspds;

 title 'Format Both Columns Group Select Example';

 select *
 from connection to sasspds
  (select
    GENDER format=$GENDER.,
    AGE format=AGEGRP.,
   count(*) as count
   from fmttest
   formatted group by GENDER, AGE);

 disconnect from sasspds;

 quit;

PROC SQL;
connect to sasspds
 (dbq="&domain"
  serv="&serv"
  host="&host"
  user="anonymous");

 /* Explicitly set the fmtgrpsel option */

 execute(reset fmtgrpsel)
  by sasspds;

 title1 'To use Format on Only One Column With Group Select';
 title2 'Override Column Format With a Starndard Format';

 select *
 from connection to sasspds
  (select
   GENDER format=$1.,
   AGE format=AGEGRP.,
   count(*) as count
   from fmttest
   formatted group by GENDER, AGE);

 disconnect from sasspds;

 quit;

 /* A WHERE clause that uses a format to subset  */
 /* data is pushed to the server. If it is not   */
 /* pushed to the server, the following warning  */
 /* message will be written to the SAS log:      */
 /* WARNING: Server is unable to execute the     */
 /* where clause.                                */

 data temp;
 set &domain..fmttest;
  where put
   (AGE,AGEGRP.) = 'Child';
 run;

For the complete code example, see User-Defined Formats.