Grouping Data

The GROUP BY clause groups data by a specified column or columns. When you use a GROUP BY clause, you also use an aggregate function in the SELECT clause or in a HAVING clause to instruct PROC SQL in how to summarize the data for each group. PROC SQL calculates the aggregate function separately for each group.

Grouping by One Column

The following example sums the populations of all countries to find the total population of each continent:
libname sql 'SAS-library';

proc sql;
   title 'Total Populations of World Continents';
   select Continent, sum(Population) format=comma14. as TotalPopulation
      from sql.countries
      where Continent is not missing
      group by Continent;
Note: Countries for which a continent is not listed are excluded by the WHERE clause.
Grouping by One Column
Total Populations of World Continents

Grouping without Summarizing

When you use a GROUP BY clause without an aggregate function, PROC SQL treats the GROUP BY clause as if it were an ORDER BY clause and displays a message in the log that informs you that this has happened. The following example attempts to group high and low temperature information for each city in the SQL.WORLDTEMPS table by country:
libname sql 'SAS-library';

proc sql outobs=12;
   title 'High and Low Temperatures';
   select City, Country, AvgHigh, AvgLow
      from sql.worldtemps
      group by Country;
The output and log show that PROC SQL transforms the GROUP BY clause into an ORDER BY clause.
Grouping without Aggregate Functions
High and Low Temperatures
Grouping without Aggregate Functions (Partial Log)
 WARNING: A GROUP BY clause has been transformed into an ORDER BY clause because

         neither the SELECT clause nor the optional HAVING clause of the 
         associated table-expression referenced a summary function.

Grouping by Multiple Columns

To group by multiple columns, separate the column names with commas within the GROUP BY clause. You can use aggregate functions with any of the columns that you select. The following example groups by both Location and Type, producing total square miles for the deserts and lakes in each location in the SQL.FEATURES table:
libname sql 'SAS-library';

proc sql;
   title 'Total Square Miles of Deserts and Lakes';
   select Location, Type, sum(Area) as TotalArea format=comma16.
      from sql.features
      where type in ('Desert', 'Lake')
      group by Location, Type;
Grouping by Multiple Columns
Total Square Miles of Deserts and Lakes

Grouping and Sorting Data

You can order grouped results with an ORDER BY clause. The following example takes the previous example and adds an ORDER BY clause to change the order of the Location column from ascending order to descending order:
libname sql 'SAS-library';

proc sql;
   title 'Total Square Miles of Deserts and Lakes';
   select Location, Type, sum(Area) as TotalArea format=comma16.
      from sql.features
      where type in ('Desert', 'Lake')
      group by Location, Type
      order by Location desc;
Grouping with an ORDER BY Clause
Total Square Miles of Deserts and Lakes

Grouping with Missing Values

Finding Grouping Errors Caused by Missing Values

When a column contains missing values, PROC SQL treats the missing values as a single group. This can sometimes provide unexpected results.
In this example, because the SQL.COUNTRIES table contains some missing values in the Continent column, the missing values combine to form a single group that has the total area of the countries that have a missing value in the Continent column:
libname sql 'SAS-library';

/* unexpected output */

proc sql outobs=12;
   title 'Areas of World Continents';
   select Name format=$25., 
          Continent, 
          sum(Area) format=comma12. as TotalArea
      from sql.countries
      group by Continent
      order by Continent, Name;
The output is incorrect because Bermuda, Iceland, and Kalaallit Nunaat are not actually part of the same continent. However, PROC SQL treats them that way because they all have a missing character value in the Continent column.
Finding Grouping Errors Caused by Missing Values (Unexpected Output)
Areas of World Continents
To correct the query from the previous example, you can write a WHERE clause to exclude the missing values from the results:
/* modified output */

proc sql outobs=12;
   title 'Areas of World Continents';
   select Name format=$25.,
          Continent, 
          sum(Area) format=comma12. as TotalArea
      from sql.countries
      where Continent is not missing
      group by Continent
      order by Continent, Name;
Adjusting the Query to Avoid Errors Due to Missing Values (Modified Output)
Areas of World Continents
Note: Aggregate functions, such as the SUM function, can cause the same calculation to repeat for every row. This occurs whenever PROC SQL remerges data. See Remerging Summary Statistics for more information about remerging.