Retrieving Data from a Single Table

# Summarizing Data

You can use an aggregate function (or summary function) to produce a statistical summary of data in a table. The aggregate function instructs PROC SQL in how to combine data in one or more columns. If you specify one column as the argument to an aggregate function, then the values in that column are calculated. If you specify multiple arguments, then the arguments or columns that are listed are calculated.

Note:   When more than one argument is used within an SQL aggregate function, the function is no longer considered to be an SQL aggregate or summary function. If there is a like-named Base SAS function, then PROC SQL executes the Base SAS function and the results that are returned are based on the values for the current row. If no like-named Base SAS function exists, then an error will occur. For example, if you use multiple arguments for the AVG function, an error will occur because there is no AVG function for Base SAS.

When you use an aggregate function, PROC SQL applies the function to the entire table, unless you use a GROUP BY clause. You can use aggregate functions in the SELECT or HAVING clauses.

Note:   See Grouping Data for information about producing summaries of individual groups of data within a table.

The following table lists the aggregate functions that you can use:

Aggregate Functions
Function Definition
AVG, MEAN mean or average of values
COUNT, FREQ, N number of nonmissing values
CSS corrected sum of squares
CV coefficient of variation (percent)
MAX largest value
MIN smallest value
NMISS number of missing values
PRT probability of a greater absolute value of Student's t
RANGE range of values
STD standard deviation
STDERR standard error of the mean
SUM sum of values
SUMWGT sum of the WEIGHT variable values
T Student's t value for testing the hypothesis that the population mean is zero
USS uncorrected sum of squares
VAR variance

TABLE NOTE 1:  In the SQL procedure, each row has a weight of 1.

Note:   You can use most other SAS functions in PROC SQL, but they are not treated as aggregate functions.

You can use aggregate, or summary functions, by using a WHERE clause. For a complete list of the aggregate functions that you can use, see Aggregate Functions.

### Using the MEAN Function with a WHERE Clause

This example uses the MEAN function to find the annual mean temperature for each country in the SQL.WORLDTEMPS table. The WHERE clause returns countries with a mean temperature that is greater than 75 degrees.

```proc sql outobs=12;
title 'Mean Temperatures for World Cities';
select City, Country, mean(AvgHigh, AvgLow)
as MeanTemp
from sql.worldtemps
where calculated MeanTemp gt 75
order by MeanTemp desc;```

Note:   You must use the CALCULATED keyword to reference the calculated column.

```                       Mean Temperatures for World Cities

City                Country          MeanTemp
---------------------------------------------
Lagos               Nigeria              82.5
Manila              Philippines            82
Bangkok             Thailand               82
Singapore           Singapore              81
Bombay              India                  79
Kingston            Jamaica                78
San Juan            Puerto Rico            78
Calcutta            India                76.5
Havana              Cuba                 76.5
Nassau              Bahamas              76.5
```

The following example uses the SUM function to return the total oil reserves for all countries in the SQL.OILRSRVS table:

```proc sql;
title 'World Oil Reserves';
select sum(Barrels) format=comma18. as TotalBarrels
from sql.oilrsrvs;```

Note:   The SUM function produces a single row of output for the requested sum because no nonaggregate value appears in the SELECT clause.

```                               World Oil Reserves

TotalBarrels
------------------
878,300,000,000
```

In the previous example, PROC SQL combined information from multiple rows of data into a single row of output. Specifically, the world oil reserves for each country were combined to form a total for all countries. Combining, or rolling up, of rows occurs when

• the SELECT clause contains only columns that are specified within an aggregate function

• the WHERE clause, if there is one, contains only columns that are specified in the SELECT clause.

The following example uses the MAX function to find the largest population in the SQL.COUNTRIES table and displays it in a column called MaxPopulation. Aggregate functions, such as the MAX function, can cause the same calculation to repeat for every row. This occurs whenever PROC SQL remerges data. Remerging occurs whenever any of the following conditions exist

• The SELECT clause references a column that contains an aggregate function that is not listed in a GROUP BY clause.

• The SELECT clause references a column that contains an aggregate function and other column or columns that are not listed in the GROUP BY clause.

• One or more columns or column expressions that are listed in a HAVING clause are not included in a subquery or a GROUP BY clause.

In this example, PROC SQL writes the population of China, which is the largest population in the table:

```proc sql outobs=12;
title 'Largest Country Populations';
select Name, Population format=comma20.,
max(Population) as MaxPopulation format=comma20.
from sql.countries
order by Population desc;```

```                               Largest Country Populations

Name                                           Population         MaxPopulation
-------------------------------------------------------------------------------
China                                       1,202,215,077         1,202,215,077
India                                         929,009,120         1,202,215,077
United States                                 263,294,808         1,202,215,077
Indonesia                                     202,393,859         1,202,215,077
Brazil                                        160,310,357         1,202,215,077
Russia                                        151,089,979         1,202,215,077
Japan                                         126,345,434         1,202,215,077
Pakistan                                      123,062,252         1,202,215,077
Nigeria                                        99,062,003         1,202,215,077
Mexico                                         93,114,708         1,202,215,077
Germany                                        81,890,690         1,202,215,077
```

In some cases, you might need to use an aggregate function so that you can use its results in another calculation. To do this, you need only to construct one query for PROC SQL to automatically perform both calculations. This type of operation also causes PROC SQL to remerge the data.

For example, if you want to find the percentage of the total world population that resides in each country, then you construct a single query that

• obtains the total world population by using the SUM function

• divides each country's population by the total world population.

PROC SQL runs an internal query to find the sum and then runs another internal query to divide each country's population by the sum.

```proc sql outobs=12;
title 'Percentage of World Population in Countries';
select Name, Population format=comma14.,
(Population / sum(Population) * 100) as Percentage
format=comma8.2
from sql.countries
order by Percentage desc;```

Note:   When a query remerges data, PROC SQL displays a note in the log to indicate that data remerging has occurred.

```                  Percentage of World Population in Countries

Name                                     Population  Percentage
---------------------------------------------------------------
China                                 1,202,215,077       20.88
India                                   929,009,120       16.13
United States                           263,294,808        4.57
Indonesia                               202,393,859        3.52
Brazil                                  160,310,357        2.78
Russia                                  151,089,979        2.62
Japan                                   126,345,434        2.19
Pakistan                                123,062,252        2.14
Nigeria                                  99,062,003        1.72
Mexico                                   93,114,708        1.62
Germany                                  81,890,690        1.42
```

You can use DISTINCT with an aggregate function to cause the function to use only unique values from a column.

### Counting Unique Values

The following query returns the number of distinct, nonmissing continents in the SQL.COUNTRIES table:

```proc sql;
title 'Number of Continents in the COUNTRIES Table';
select count(distinct Continent) as Count
from sql.countries;```

```                  Number of Continents in the COUNTRIES Table

Count
--------
8
```

Note:   You cannot use select count(distinct *) to count distinct rows in a table. This code generates an error because PROC SQL does not know which duplicate column values to eliminate.

### Counting Nonmissing Values

Compare the previous example with the following query, which does not use the DISTINCT keyword. This query counts every nonmissing occurrence of a continent in the SQL.COUNTRIES table, including duplicate values:

```proc sql;
title 'Countries for Which a Continent is Listed';
select count(Continent) as Count
from sql.countries;```

```                   Countries for Which a Continent is Listed

Count
--------
206
```

### Counting All Rows

In the previous two examples, countries that have a missing value in the Continent column are ignored by the COUNT function. To obtain a count of all rows in the table, including countries that are not on a continent, you can use the following code in the SELECT clause:

```proc sql;
title 'Number of Countries in the SQL.COUNTRIES Table';
select count(*) as Number
from sql.countries;```

```                 Number of Countries in the SQL.COUNTRIES Table

Number
--------
209
```

When you use an aggregate function with data that contains missing values, the results might not provide the information that you expect because many aggregate functions ignore missing values.

### Finding Errors Caused by Missing Values

The AVG function returns the average of only the nonmissing values. The following query calculates the average length of three features in the SQL.FEATURES table: Angel Falls and the Amazon and Nile rivers:

```/* incorrect output */

proc sql;
title 'Average Length of Angel Falls, Amazon and Nile Rivers';
select Name, Length, avg(Length) as AvgLength
from sql.features
where Name in ('Angel Falls', 'Amazon', 'Nile');```

```             Average Length of Angel Falls, Amazon and Nile Rivers

Name               Length  AvgLength
------------------------------------
Amazon               4000     4072.5
Angel Falls             .     4072.5
Nile                 4145     4072.5
```

Because no length is stored for Angel Falls, the average includes only the Amazon and Nile rivers. The average is therefore incorrect.

Compare the result from the previous example with the following query, which includes a CASE expression to handle missing values:

```/* corrected output */

proc sql;
title 'Average Length of Angel Falls, Amazon and Nile Rivers';
select Name, Length, case
when Length is missing then 0
else Length
end as NewLength,
avg(calculated NewLength) as AvgLength
from sql.features
where Name in ('Angel Falls', 'Amazon', 'Nile');```

```             Average Length of Angel Falls, Amazon and Nile Rivers

Name               Length  NewLength  AvgLength
-----------------------------------------------
Amazon               4000       4000       2715
Angel Falls             .          0       2715
Nile                 4145       4145       2715
```

 Previous Page | Next Page | Top of Page