Retrieving Data from a Single Table |
You can use an aggregate function (or summary function) to produce a statistical summary of data in a table. The aggregate function instructs PROC SQL in how to combine data in one or more columns. If you specify one column as the argument to an aggregate function, then the values in that column are calculated. If you specify multiple arguments, then the arguments or columns that are listed are calculated.
Note: When more than one argument is used within an SQL aggregate function, the function is no longer considered to be an SQL aggregate or summary function. If there is a like-named Base SAS function, then PROC SQL executes the Base SAS function and the results that are returned are based on the values for the current row. If no like-named Base SAS function exists, then an error will occur. For example, if you use multiple arguments for the AVG function, an error will occur because there is no AVG function for Base SAS.
When you use an aggregate function, PROC SQL applies the function to the entire table, unless you use a GROUP BY clause. You can use aggregate functions in the SELECT or HAVING clauses.
Note: See Grouping Data for information about producing summaries of individual groups of data within a table.
Using Aggregate Functions |
The following table lists the aggregate functions that you can use:
Function | Definition |
---|---|
AVG, MEAN | mean or average of values |
COUNT, FREQ, N | number of nonmissing values |
CSS | corrected sum of squares |
CV | coefficient of variation (percent) |
MAX | largest value |
MIN | smallest value |
NMISS | number of missing values |
PRT | probability of a greater absolute value of Student's t |
RANGE | range of values |
STD | standard deviation |
STDERR | standard error of the mean |
SUM | sum of values |
SUMWGT | sum of the WEIGHT variable values (table note 1) |
T | Student's t value for testing the hypothesis that the population mean is zero |
USS | uncorrected sum of squares |
VAR | variance |
TABLE NOTE 1: In the SQL procedure, each row has a weight of 1.
Note: You can use most other SAS functions in PROC SQL, but they are not treated as aggregate functions.
Summarizing Data with a WHERE Clause |
You can use aggregate, or summary functions, by using a WHERE clause. For a complete list of the aggregate functions that you can use, see Aggregate Functions.
This example uses the MEAN function to find the annual mean temperature for each country in the SQL.WORLDTEMPS table. The WHERE clause returns countries with a mean temperature that is greater than 75 degrees.
proc sql outobs=12; title 'Mean Temperatures for World Cities'; select City, Country, mean(AvgHigh, AvgLow) as MeanTemp from sql.worldtemps where calculated MeanTemp gt 75 order by MeanTemp desc;
Note: You must use the CALCULATED keyword to reference the calculated column.
Using the MEAN Function with a WHERE Clause
Mean Temperatures for World Cities City Country MeanTemp --------------------------------------------- Lagos Nigeria 82.5 Manila Philippines 82 Bangkok Thailand 82 Singapore Singapore 81 Bombay India 79 Kingston Jamaica 78 San Juan Puerto Rico 78 Calcutta India 76.5 Havana Cuba 76.5 Nassau Bahamas 76.5
Displaying Sums |
The following example uses the SUM function to return the total oil reserves for all countries in the SQL.OILRSRVS table:
proc sql; title 'World Oil Reserves'; select sum(Barrels) format=comma18. as TotalBarrels from sql.oilrsrvs;
Note: The SUM function produces a single row of output for the requested sum because no nonaggregate value appears in the SELECT clause.
World Oil Reserves TotalBarrels ------------------ 878,300,000,000
Combining Data from Multiple Rows into a Single Row |
In the previous example, PROC SQL combined information from multiple rows of data into a single row of output. Specifically, the world oil reserves for each country were combined to form a total for all countries. Combining, or rolling up, of rows occurs when
the SELECT clause contains only columns that are specified within an aggregate function
the WHERE clause, if there is one, contains only columns that are specified in the SELECT clause.
Remerging Summary Statistics |
The following example uses the MAX function to find the largest population in the SQL.COUNTRIES table and displays it in a column called MaxPopulation. Aggregate functions, such as the MAX function, can cause the same calculation to repeat for every row. This occurs whenever PROC SQL remerges data. Remerging occurs whenever any of the following conditions exist
The SELECT clause references a column that contains an aggregate function that is not listed in a GROUP BY clause.
The SELECT clause references a column that contains an aggregate function and other column or columns that are not listed in the GROUP BY clause.
One or more columns or column expressions that are listed in a HAVING clause are not included in a subquery or a GROUP BY clause.
In this example, PROC SQL writes the population of China, which is the largest population in the table:
proc sql outobs=12; title 'Largest Country Populations'; select Name, Population format=comma20., max(Population) as MaxPopulation format=comma20. from sql.countries order by Population desc;
Largest Country Populations Name Population MaxPopulation ------------------------------------------------------------------------------- China 1,202,215,077 1,202,215,077 India 929,009,120 1,202,215,077 United States 263,294,808 1,202,215,077 Indonesia 202,393,859 1,202,215,077 Brazil 160,310,357 1,202,215,077 Russia 151,089,979 1,202,215,077 Bangladesh 126,387,850 1,202,215,077 Japan 126,345,434 1,202,215,077 Pakistan 123,062,252 1,202,215,077 Nigeria 99,062,003 1,202,215,077 Mexico 93,114,708 1,202,215,077 Germany 81,890,690 1,202,215,077
In some cases, you might need to use an aggregate function so that you can use its results in another calculation. To do this, you need only to construct one query for PROC SQL to automatically perform both calculations. This type of operation also causes PROC SQL to remerge the data.
For example, if you want to find the percentage of the total world population that resides in each country, then you construct a single query that
obtains the total world population by using the SUM function
divides each country's population by the total world population.
proc sql outobs=12; title 'Percentage of World Population in Countries'; select Name, Population format=comma14., (Population / sum(Population) * 100) as Percentage format=comma8.2 from sql.countries order by Percentage desc;
Note: When a query remerges data, PROC SQL displays a note in the log to indicate that data remerging has occurred.
Percentage of World Population in Countries Name Population Percentage --------------------------------------------------------------- China 1,202,215,077 20.88 India 929,009,120 16.13 United States 263,294,808 4.57 Indonesia 202,393,859 3.52 Brazil 160,310,357 2.78 Russia 151,089,979 2.62 Bangladesh 126,387,850 2.20 Japan 126,345,434 2.19 Pakistan 123,062,252 2.14 Nigeria 99,062,003 1.72 Mexico 93,114,708 1.62 Germany 81,890,690 1.42
Using Aggregate Functions with Unique Values |
You can use DISTINCT with an aggregate function to cause the function to use only unique values from a column.
The following query returns the number of distinct, nonmissing continents in the SQL.COUNTRIES table:
proc sql; title 'Number of Continents in the COUNTRIES Table'; select count(distinct Continent) as Count from sql.countries;
Using DISTINCT with the COUNT Function
Number of Continents in the COUNTRIES Table Count -------- 8
Note: You cannot use select count(distinct *) to count distinct rows in a table. This code generates an error because PROC SQL does not know which duplicate column values to eliminate.
Compare the previous example with the following query, which does not use the DISTINCT keyword. This query counts every nonmissing occurrence of a continent in the SQL.COUNTRIES table, including duplicate values:
proc sql; title 'Countries for Which a Continent is Listed'; select count(Continent) as Count from sql.countries;
Effect of Not Using DISTINCT with the COUNT Function
Countries for Which a Continent is Listed Count -------- 206
In the previous two examples, countries that have a missing value in the Continent column are ignored by the COUNT function. To obtain a count of all rows in the table, including countries that are not on a continent, you can use the following code in the SELECT clause:
proc sql; title 'Number of Countries in the SQL.COUNTRIES Table'; select count(*) as Number from sql.countries;
Using the COUNT Function to Count All Rows in a Table
Number of Countries in the SQL.COUNTRIES Table Number -------- 209
Summarizing Data with Missing Values |
When you use an aggregate function with data that contains missing values, the results might not provide the information that you expect because many aggregate functions ignore missing values.
The AVG function returns the average of only the nonmissing values. The following query calculates the average length of three features in the SQL.FEATURES table: Angel Falls and the Amazon and Nile rivers:
/* incorrect output */ proc sql; title 'Average Length of Angel Falls, Amazon and Nile Rivers'; select Name, Length, avg(Length) as AvgLength from sql.features where Name in ('Angel Falls', 'Amazon', 'Nile');
Finding Errors Caused by Missing Values (Incorrect Output)
Average Length of Angel Falls, Amazon and Nile Rivers Name Length AvgLength ------------------------------------ Amazon 4000 4072.5 Angel Falls . 4072.5 Nile 4145 4072.5
Because no length is stored for Angel Falls, the average includes only the Amazon and Nile rivers. The average is therefore incorrect.
Compare the result from the previous example with the following query, which includes a CASE expression to handle missing values:
/* corrected output */ proc sql; title 'Average Length of Angel Falls, Amazon and Nile Rivers'; select Name, Length, case when Length is missing then 0 else Length end as NewLength, avg(calculated NewLength) as AvgLength from sql.features where Name in ('Angel Falls', 'Amazon', 'Nile');
Finding Errors Caused by Missing Values (Corrected Output)
Average Length of Angel Falls, Amazon and Nile Rivers Name Length NewLength AvgLength ----------------------------------------------- Amazon 4000 4000 2715 Angel Falls . 0 2715 Nile 4145 4145 2715
Copyright © 2008 by SAS Institute Inc., Cary, NC, USA. All rights reserved.