Retrieving Data from a Single Table

Summarizing Data

You can use an aggregate function (or summary function) to produce a statistical summary of data in a table. The aggregate function instructs PROC SQL in how to combine data in one or more columns. If you specify one column as the argument to an aggregate function, then the values in that column are calculated. If you specify multiple arguments, then the arguments or columns that are listed are calculated.

Note: When more than one argument is used within an SQL aggregate function, the function is no longer considered to be an SQL aggregate or summary function. If there is a like-named Base SAS function, then PROC SQL executes the Base SAS function and the results that are returned are based on the values for the current row. If no like-named Base SAS function exists, then an error will occur. For example, if you use multiple arguments for the AVG function, an error will occur because there is no AVG function for Base SAS. [cautionend]

When you use an aggregate function, PROC SQL applies the function to the entire table, unless you use a GROUP BY clause. You can use aggregate functions in the SELECT or HAVING clauses.

Note: See Grouping Data for information about producing summaries of individual groups of data within a table. [cautionend]

Using Aggregate Functions

The following table lists the aggregate functions that you can use:

Aggregate Functions
Function	Definition
AVG, MEAN	mean or average of values
COUNT, FREQ, N	number of nonmissing values
CSS	corrected sum of squares
CV	coefficient of variation (percent)
MAX	largest value
MIN	smallest value
NMISS	number of missing values
PRT	probability of a greater absolute value of Student's t
RANGE	range of values
STD	standard deviation
STDERR	standard error of the mean
SUM	sum of values
SUMWGT	sum of the WEIGHT variable values (table note 1)
T	Student's t value for testing the hypothesis that the population mean is zero
USS	uncorrected sum of squares
VAR	variance

TABLE NOTE 1: In the SQL procedure, each row has a weight of 1.

Note: You can use most other SAS functions in PROC SQL, but they are not treated as aggregate functions. [cautionend]

Summarizing Data with a WHERE Clause

You can use aggregate, or summary functions, by using a WHERE clause. For a complete list of the aggregate functions that you can use, see Aggregate Functions.

Using the MEAN Function with a WHERE Clause

This example uses the MEAN function to find the annual mean temperature for each country in the SQL.WORLDTEMPS table. The WHERE clause returns countries with a mean temperature that is greater than 75 degrees.

proc sql outobs=12;
   title 'Mean Temperatures for World Cities';
   select City, Country, mean(AvgHigh, AvgLow) 
          as MeanTemp
      from sql.worldtemps
      where calculated MeanTemp gt 75
      order by MeanTemp desc;

Note: You must use the CALCULATED keyword to reference the calculated column. [cautionend]

Using the MEAN Function with a WHERE Clause

                       Mean Temperatures for World Cities

                 City                Country          MeanTemp
                 ---------------------------------------------
                 Lagos               Nigeria              82.5
                 Manila              Philippines            82
                 Bangkok             Thailand               82
                 Singapore           Singapore              81
                 Bombay              India                  79
                 Kingston            Jamaica                78
                 San Juan            Puerto Rico            78
                 Calcutta            India                76.5
                 Havana              Cuba                 76.5
                 Nassau              Bahamas              76.5

Displaying Sums

The following example uses the SUM function to return the total oil reserves for all countries in the SQL.OILRSRVS table:

proc sql;
   title 'World Oil Reserves';
   select sum(Barrels) format=comma18. as TotalBarrels
      from sql.oilrsrvs;

Note: The SUM function produces a single row of output for the requested sum because no nonaggregate value appears in the SELECT clause. [cautionend]

Displaying Sums

                               World Oil Reserves

                                     TotalBarrels
                               ------------------
                                  878,300,000,000

Combining Data from Multiple Rows into a Single Row

In the previous example, PROC SQL combined information from multiple rows of data into a single row of output. Specifically, the world oil reserves for each country were combined to form a total for all countries. Combining, or rolling up, of rows occurs when

the SELECT clause contains only columns that are specified within an aggregate function
the WHERE clause, if there is one, contains only columns that are specified in the SELECT clause.

Remerging Summary Statistics

The following example uses the MAX function to find the largest population in the SQL.COUNTRIES table and displays it in a column called MaxPopulation. Aggregate functions, such as the MAX function, can cause the same calculation to repeat for every row. This occurs whenever PROC SQL remerges data. Remerging occurs whenever any of the following conditions exist

The SELECT clause references a column that contains an aggregate function that is not listed in a GROUP BY clause.
The SELECT clause references a column that contains an aggregate function and other column or columns that are not listed in the GROUP BY clause.
One or more columns or column expressions that are listed in a HAVING clause are not included in a subquery or a GROUP BY clause.

In this example, PROC SQL writes the population of China, which is the largest population in the table:

proc sql outobs=12;
   title 'Largest Country Populations';
   select Name, Population format=comma20., 
          max(Population) as MaxPopulation format=comma20.
      from sql.countries
      order by Population desc;

Using Aggregate Functions

                               Largest Country Populations

     Name                                           Population         MaxPopulation
     -------------------------------------------------------------------------------
     China                                       1,202,215,077         1,202,215,077
     India                                         929,009,120         1,202,215,077
     United States                                 263,294,808         1,202,215,077
     Indonesia                                     202,393,859         1,202,215,077
     Brazil                                        160,310,357         1,202,215,077
     Russia                                        151,089,979         1,202,215,077
     Bangladesh                                    126,387,850         1,202,215,077
     Japan                                         126,345,434         1,202,215,077
     Pakistan                                      123,062,252         1,202,215,077
     Nigeria                                        99,062,003         1,202,215,077
     Mexico                                         93,114,708         1,202,215,077
     Germany                                        81,890,690         1,202,215,077

In some cases, you might need to use an aggregate function so that you can use its results in another calculation. To do this, you need only to construct one query for PROC SQL to automatically perform both calculations. This type of operation also causes PROC SQL to remerge the data.

For example, if you want to find the percentage of the total world population that resides in each country, then you construct a single query that

obtains the total world population by using the SUM function
divides each country's population by the total world population.

PROC SQL runs an internal query to find the sum and then runs another internal query to divide each country's population by the sum.

proc sql outobs=12;
   title 'Percentage of World Population in Countries';
   select Name, Population format=comma14.,
          (Population / sum(Population) * 100) as Percentage 
          format=comma8.2
      from sql.countries
      order by Percentage desc;

Note: When a query remerges data, PROC SQL displays a note in the log to indicate that data remerging has occurred. [cautionend]

Remerging Summary Statistics

                  Percentage of World Population in Countries

        Name                                     Population  Percentage
        ---------------------------------------------------------------
        China                                 1,202,215,077       20.88
        India                                   929,009,120       16.13
        United States                           263,294,808        4.57
        Indonesia                               202,393,859        3.52
        Brazil                                  160,310,357        2.78
        Russia                                  151,089,979        2.62
        Bangladesh                              126,387,850        2.20
        Japan                                   126,345,434        2.19
        Pakistan                                123,062,252        2.14
        Nigeria                                  99,062,003        1.72
        Mexico                                   93,114,708        1.62
        Germany                                  81,890,690        1.42

Using Aggregate Functions with Unique Values

You can use DISTINCT with an aggregate function to cause the function to use only unique values from a column.

Counting Unique Values

The following query returns the number of distinct, nonmissing continents in the SQL.COUNTRIES table:

proc sql;
   title 'Number of Continents in the COUNTRIES Table';
   select count(distinct Continent) as Count
      from sql.countries;

Using DISTINCT with the COUNT Function

                  Number of Continents in the COUNTRIES Table

                                       Count
                                    --------
                                           8

Note: You cannot use select count(distinct *) to count distinct rows in a table. This code generates an error because PROC SQL does not know which duplicate column values to eliminate. [cautionend]

Counting Nonmissing Values

Compare the previous example with the following query, which does not use the DISTINCT keyword. This query counts every nonmissing occurrence of a continent in the SQL.COUNTRIES table, including duplicate values:

proc sql;
   title 'Countries for Which a Continent is Listed';
   select count(Continent) as Count
      from sql.countries;

Effect of Not Using DISTINCT with the COUNT Function

                   Countries for Which a Continent is Listed

                                       Count
                                    --------
                                         206

Counting All Rows

In the previous two examples, countries that have a missing value in the Continent column are ignored by the COUNT function. To obtain a count of all rows in the table, including countries that are not on a continent, you can use the following code in the SELECT clause:

proc sql;
   title 'Number of Countries in the SQL.COUNTRIES Table';
   select count(*) as Number
      from sql.countries;

Using the COUNT Function to Count All Rows in a Table

                 Number of Countries in the SQL.COUNTRIES Table

                                      Number
                                    --------
                                         209

Summarizing Data with Missing Values

When you use an aggregate function with data that contains missing values, the results might not provide the information that you expect because many aggregate functions ignore missing values.

Finding Errors Caused by Missing Values

The AVG function returns the average of only the nonmissing values. The following query calculates the average length of three features in the SQL.FEATURES table: Angel Falls and the Amazon and Nile rivers:

/* incorrect output */

proc sql;
   title 'Average Length of Angel Falls, Amazon and Nile Rivers';
   select Name, Length, avg(Length) as AvgLength 
      from sql.features
      where Name in ('Angel Falls', 'Amazon', 'Nile');

Finding Errors Caused by Missing Values (Incorrect Output)

             Average Length of Angel Falls, Amazon and Nile Rivers

                      Name               Length  AvgLength
                      ------------------------------------
                      Amazon               4000     4072.5
                      Angel Falls             .     4072.5
                      Nile                 4145     4072.5

Because no length is stored for Angel Falls, the average includes only the Amazon and Nile rivers. The average is therefore incorrect.

Compare the result from the previous example with the following query, which includes a CASE expression to handle missing values:

/* corrected output */

proc sql;
   title 'Average Length of Angel Falls, Amazon and Nile Rivers';
   select Name, Length, case 
                           when Length is missing then 0
                           else Length 
                         end as NewLength, 
                avg(calculated NewLength) as AvgLength 
      from sql.features
      where Name in ('Angel Falls', 'Amazon', 'Nile');

Finding Errors Caused by Missing Values (Corrected Output)

             Average Length of Angel Falls, Amazon and Nile Rivers

                Name               Length  NewLength  AvgLength
                -----------------------------------------------
                Amazon               4000       4000       2715
                Angel Falls             .          0       2715
                Nile                 4145       4145       2715

Top of Page