Retrieving Data from a Single Table |
The GROUP BY clause groups data by a specified column or columns. When you use a GROUP BY clause, you also use an aggregate function in the SELECT clause or in a HAVING clause to instruct PROC SQL in how to summarize the data for each group. PROC SQL calculates the aggregate function separately for each group.
Grouping by One Column |
The following example sums the populations of all countries to find the total population of each continent:
proc sql; title 'Total Populations of World Continents'; select Continent, sum(Population) format=comma14. as TotalPopulation from sql.countries where Continent is not missing group by Continent;
Note: Countries for which a continent is not listed are excluded by the WHERE clause.
Total Populations of World Continents Total Continent Population --------------------------------------------------- Africa 710,529,592 Asia 3,381,858,879 Australia 18,255,944 Central America and Caribbean 66,815,930 Europe 872,192,202 North America 384,801,818 Oceania 5,342,368 South America 317,568,801
Grouping without Summarizing |
When you use a GROUP BY clause without an aggregate function, PROC SQL treats the GROUP BY clause as if it were an ORDER BY clause and displays a message in the log that informs you that this has happened. The following example attempts to group high and low temperature information for each city in the SQL.WORLDTEMPS table by country:
proc sql outobs=12; title 'High and Low Temperatures'; select City, Country, AvgHigh, AvgLow from sql.worldtemps group by Country;
The output and log show that PROC SQL transforms the GROUP BY clause into an ORDER BY clause.
Grouping without Aggregate Functions
High and Low Temperatures City Country AvgHigh AvgLow ------------------------------------------------------- Algiers Algeria 90 45 Buenos Aires Argentina 87 48 Sydney Australia 79 44 Vienna Austria 76 28 Nassau Bahamas 88 65 Hamilton Bermuda 85 59 Sao Paulo Brazil 81 53 Rio de Janeiro Brazil 85 64 Quebec Canada 76 5 Montreal Canada 77 8 Toronto Canada 80 17 Beijing China 86 17
Grouping without Aggregate Functions (Partial Log)
WARNING: A GROUP BY clause has been transformed into an ORDER BY clause because neither the SELECT clause nor the optional HAVING clause of the associated table-expression referenced a summary function.
Grouping by Multiple Columns |
To group by multiple columns, separate the column names with commas within the GROUP BY clause. You can use aggregate functions with any of the columns that you select. The following example groups by both Location and Type, producing total square miles for the deserts and lakes in each location in the SQL.FEATURES table:
proc sql; title 'Total Square Miles of Deserts and Lakes'; select Location, Type, sum(Area) as TotalArea format=comma16. from sql.features where type in ('Desert', 'Lake') group by Location, Type;
Total Square Miles of Deserts and Lakes Location Type TotalArea -------------------------------------------------- Africa Desert 3,725,000 Africa Lake 50,958 Asia Lake 25,300 Australia Desert 300,000 Canada Lake 12,275 China Desert 500,000 Europe - Asia Lake 143,550 North America Desert 140,000 North America Lake 77,200 Russia Lake 11,780 Saudi Arabia Desert 250,000
Grouping and Sorting Data |
You can order grouped results with an ORDER BY clause. The following example takes the previous example and adds an ORDER BY clause to change the order of the Location column from ascending order to descending order:
proc sql; title 'Total Square Miles of Deserts and Lakes'; select Location, Type, sum(Area) as TotalArea format=comma16. from sql.features where type in ('Desert', 'Lake') group by Location, Type order by Location desc;
Grouping with an ORDER BY Clause
Total Square Miles of Deserts and Lakes Location Type TotalArea -------------------------------------------------- Saudi Arabia Desert 250,000 Russia Lake 11,780 North America Lake 77,200 North America Desert 140,000 Europe - Asia Lake 143,550 China Desert 500,000 Canada Lake 12,275 Australia Desert 300,000 Asia Lake 25,300 Africa Desert 3,725,000 Africa Lake 50,958
Grouping with Missing Values |
When a column contains missing values, PROC SQL treats the missing values as a single group. This can sometimes provide unexpected results.
In this example, because the SQL.COUNTRIES table contains some missing values in the Continent column, the missing values combine to form a single group that has the total area of the countries that have a missing value in the Continent column:
/* incorrect output */ proc sql outobs=12; title 'Areas of World Continents'; select Name format=$25., Continent, sum(Area) format=comma12. as TotalArea from sql.countries group by Continent order by Continent, Name;
The output is incorrect because Bermuda, Iceland, and Kalaallit Nunaat are not actually part of the same continent; however, PROC SQL treats them that way because they all have a missing character value in the Continent column.
Finding Grouping Errors Caused by Missing Values (Incorrect Output)
Areas of World Continents Name Continent TotalArea ----------------------------------------------------------------------- Bermuda 876,800 Iceland 876,800 Kalaallit Nunaat 876,800 Algeria Africa 11,299,595 Angola Africa 11,299,595 Benin Africa 11,299,595 Botswana Africa 11,299,595 Burkina Faso Africa 11,299,595 Burundi Africa 11,299,595 Cameroon Africa 11,299,595 Cape Verde Africa 11,299,595 Central African Republic Africa 11,299,595
To correct the query from the previous example, you can write a WHERE clause to exclude the missing values from the results:
/* corrected output */ proc sql outobs=12; title 'Areas of World Continents'; select Name format=$25., Continent, sum(Area) format=comma12. as TotalArea from sql.countries where Continent is not missing group by Continent order by Continent, Name;
Adjusting the Query to Avoid Errors Due to Missing Values (Corrected Output)
Areas of World Continents Name Continent TotalArea ----------------------------------------------------------------------- Algeria Africa 11,299,595 Angola Africa 11,299,595 Benin Africa 11,299,595 Botswana Africa 11,299,595 Burkina Faso Africa 11,299,595 Burundi Africa 11,299,595 Cameroon Africa 11,299,595 Cape Verde Africa 11,299,595 Central African Republic Africa 11,299,595 Chad Africa 11,299,595 Comoros Africa 11,299,595 Congo Africa 11,299,595
Note: Aggregate functions, such as the SUM function, can cause the same calculation to repeat for every row. This occurs whenever PROC SQL remerges data. See Remerging Summary Statistics for more information about remerging.
Copyright © 2008 by SAS Institute Inc., Cary, NC, USA. All rights reserved.