The SQL Procedure |
Restriction: | A summary function cannot appear in an ON clause or a WHERE clause. | ||||
See also: |
| ||||
Featured in: |
|
summary-function (<DISTINCT | ALL> sql-expression) |
Arguments |
is one of the following:
arithmetic mean or average of values
number of nonmissing values
corrected sum of squares
coefficient of variation (percent)
largest value
smallest value
number of missing values
is the two-tailed p-value for Student's t statistic, T with degrees of freedom.
range of values
standard deviation
standard error of the mean
sum of values
sum of the WEIGHT variable values(footnote 1)
Student's t value for testing the hypothesis that the population mean is zero
uncorrected sum of squares
variance
For a description and the formulas used for these statistics, see SAS Elementary Statistics Procedures.
specifies that only the unique values of sql-expression be used in the calculation.
specifies that all values of sql-expression be used in the calculation. If neither DISTINCT nor ALL is specified, then ALL is used.
is described in sql-expression.
Summarizing Data |
Summary functions produce a statistical summary of the entire table or view that is listed in the FROM clause or for each group that is specified in a GROUP BY clause. If GROUP BY is omitted, then all the rows in the table or view are considered to be a single group. These functions reduce all the values in each row or column in a table to one summarizing or aggregate value. For this reason, these functions are often called aggregate functions. For example, the sum (one value) of a column results from the addition of all the values in the column.
Counting Rows |
The COUNT function counts rows. COUNT(*) returns the total number of rows in a group or in a table. If you use a column name as an argument to COUNT, then the result is the total number of rows in a group or in a table that have a nonmissing value for that column. If you want to count the unique values in a column, then specify COUNT(DISTINCT column).
If the SELECT clause of a table-expression contains one or more summary functions and that table-expression resolves to no rows, then the summary function results are missing values. The following are exceptions that return zeros:
COUNT(*) | |
COUNT(<DISTINCT> sql-expression) | |
NMISS(<DISTINCT> sql-expression) |
See Creating a View from a Query's Result and Counting Missing Values with a SAS Macro for examples.
Calculating Statistics Based on the Number of Arguments |
The number of arguments that is specified in a summary function affects how the calculation is performed. If you specify a single argument, then the values in the column are calculated. If you specify multiple arguments, then the arguments or columns that are listed are calculated for each row.
Note: When more than one argument is used within an SQL aggregate function, the function is no longer considered to be an SQL aggregate or summary function. If there is a like-named Base SAS function, then PROC SQL executes the Base SAS function, and the results that are returned are based on the values for the current row. If no like-named Base SAS function exists, then an error will occur. For example, if you use multiple arguments for the AVG function, an error will occur because there is no AVG function for Base SAS.
For example, consider calculations on the following table.
proc sql; title 'Summary Table'; select * from summary;
Summary Table X Y Z ---------------------------- 1 3 4 2 4 5 8 9 4 4 5 4
If you use one argument in the function, then the calculation is performed on that column only. If you use more than one argument, then the calculation is performed on each row of the specified columns. In the following PROC SQL step, the MIN and MAX functions return the minimum and maximum of the columns they are used with. The SUM function returns the sum of each row of the columns specified as arguments:
proc sql; select min(x) as Colmin_x, min(y) as Colmin_y, max(z) as Colmax_z, sum(x,y,z) as Rowsum from summary;
Summary Table Colmin_x Colmin_y Colmax_z Rowsum -------------------------------------- 1 3 5 8 1 3 5 11 1 3 5 21 1 3 5 13
Remerging Data |
When you use a summary function in a SELECT clause or a HAVING clause, you might see the following message in the SAS log:
NOTE: The query requires remerging summary statistics back with the original data.
The process of remerging involves two passes through the data. On the first pass, PROC SQL
calculates and returns the value of summary functions. It then uses the result to calculate the arithmetic expressions in which the summary function participates.
groups data according to the GROUP BY clause.
On the second pass, PROC SQL retrieves any additional columns and rows that it needs to show in the output.
Note: To specify that PROC SQL not process queries that use remerging of data, use either the PROC SQL NOREMERGE option or the NOSQLREMERGE system option. If remerging is attempted when the NOMERGE option or the NOSQLREMERGE system option is set, an error is written to the SAS log. For more information, see the REMERGE option and the SQLREMERGE system option in the SAS Language Reference: Dictionary.
The following examples use the PROCLIB.PAYROLL table (shown in Creating a Table from a Query's Result) to show when remerging of data is and is not necessary.
The first query requires remerging. The first pass through the data groups the data by Jobcode and resolves the AVG function for each group. However, PROC SQL must make a second pass in order to retrieve the values of IdNumber and Salary.
proc sql outobs=10; title 'Salary Information'; title2 '(First 10 Rows Only)'; select IdNumber, Jobcode, Salary, avg(salary) as AvgSalary from proclib.payroll group by jobcode;
Salary Information That Required Remerging
Salary Information (First 10 Rows Only) Id Number Jobcode Salary AvgSalary ------------------------------------ 1704 BCK 25465 25794.22 1677 BCK 26007 25794.22 1383 BCK 25823 25794.22 1845 BCK 25996 25794.22 1100 BCK 25004 25794.22 1663 BCK 26452 25794.22 1673 BCK 25477 25794.22 1389 BCK 25028 25794.22 1834 BCK 26896 25794.22 1132 FA1 22413 23039.36
You can change the previous query to return only the average salary for each jobcode. The following query does not require remerging because the first pass of the data does the summarizing and the grouping. A second pass is not necessary.
proc sql outobs=10; title 'Average Salary for Each Jobcode'; select Jobcode, avg(salary) as AvgSalary from proclib.payroll group by jobcode;
Salary Information That Did Not Require Remerging
Average Salary for Each Jobcode Jobcode AvgSalary ------------------ BCK 25794.22 FA1 23039.36 FA2 27986.88 FA3 32933.86 ME1 28500.25 ME2 35576.86 ME3 42410.71 NA1 42032.2 NA2 52383 PT1 67908
When you use the HAVING clause, PROC SQL might have to remerge data to resolve the HAVING expression.
First, consider a query that uses HAVING but that does not require remerging. The query groups the data by values of Jobcode, and the result contains one row for each value of Jobcode and summary information for people in each Jobcode. On the first pass, the summary functions provide values for the Number , Average Age , and Average Salary columns. The first pass provides everything that PROC SQL needs to resolve the HAVING clause, so no remerging is necessary.
proc sql outobs=10; title 'Summary Information for Each Jobcode'; title2 '(First 10 Rows Only)'; select Jobcode, count(jobcode) as number label='Number', avg(int((today()-birth)/365.25)) as avgage format=2. label='Average Age', avg(salary) as avgsal format=dollar8. label='Average Salary' from proclib.payroll group by jobcode having avgage ge 30;
Jobcode Information That Did Not Require Remerging
Summary Information for Each Jobcode (First 10 Rows Only) Average Average Jobcode Number Age Salary ------------------------------------ BCK 9 36 $25,794 FA1 11 33 $23,039 FA2 16 37 $27,987 FA3 7 39 $32,934 ME1 8 34 $28,500 ME2 14 39 $35,577 ME3 7 42 $42,411 NA1 5 30 $42,032 NA2 3 42 $52,383 PT1 8 38 $67,908
In the following query, PROC SQL remerges the data because the HAVING clause uses the SALARY column in the comparison and SALARY is not in the GROUP BY clause.
proc sql outobs=10; title 'Employees who Earn More than the'; title2 'Average for Their Jobcode'; title3 '(First 10 Rows Only)'; select Jobcode, Salary, avg(salary) as AvgSalary from proclib.payroll group by jobcode having salary > AvgSalary;
Jobcode Information That Did Require Remerging
Employees who Earn More than the Average for Their Jobcode (First 10 Rows Only) Jobcode Salary AvgSalary ---------------------------- BCK 26007 25794.22 BCK 25823 25794.22 BCK 25996 25794.22 BCK 26452 25794.22 BCK 26896 25794.22 FA1 23177 23039.36 FA1 23738 23039.36 FA1 23979 23039.36 FA1 23916 23039.36 FA1 23644 23039.36
Keep in mind that PROC SQL remerges data when
the values returned by a summary function are used in a calculation. For example, the following query returns the values of X and the percentage of the total for each row. On the first pass, PROC SQL computes the sum of X, and on the second pass PROC SQL computes the percentage of the total for each value of X:
data summary; input x; datalines; 32 86 49 49 ;
proc sql; title 'Percentage of the Total'; select X, (100*x/sum(X)) as Pct_Total from summary;
Values of X as a Percentage of Total
Percentage of the Total x Pct_Total ------------------- 32 14.81481 86 39.81481 49 22.68519 49 22.68519
the values returned by a summary function are compared to values of a column that is not specified in the GROUP BY clause. For example, the following query uses the PROCLIB.PAYROLL table. PROC SQL remerges data because the column Salary is not specified in the GROUP BY clause:
proc sql; select jobcode, salary, avg(salary) as avsal from proclib.payroll group by jobcode having salary > avsal;
a column from the input table is specified in the SELECT clause and is not specified in the GROUP BY clause. This rule does not refer to columns used as arguments to summary functions in the SELECT clause.
For example, in the following query, the presence of IdNumber in the SELECT clause causes PROC SQL to remerge the data because IdNumber is not involved in grouping or summarizing during the first pass. In order for PROC SQL to retrieve the values for IdNumber, it must make a second pass through the data.
proc sql; select IdNumber, jobcode, avg(salary) as avsal from proclib.payroll group by jobcode;
FOOTNOTE 1: Currently, there is no way to designate a WEIGHT variable for a table in PROC SQL. Thus, each row (or observation) has a weight of 1.
Copyright © 2010 by SAS Institute Inc., Cary, NC, USA. All rights reserved.