Using Subqueries to Select Data

While a table join combines multiple tables into a new table, a subquery (enclosed in parentheses) selects rows from one table based on values in another table. A subquery, or inner query, is a query expression that is nested as part of another query expression. Depending on the clause that contains it, a subquery can return a single value or multiple values. Subqueries are most often used in the WHERE and the HAVING expressions.

Single-Value Subqueries

A single-value subquery returns a single row and column. It can be used in a WHERE or HAVING clause with a comparison operator. The subquery must return only one value, or else the query fails and an error message is printed to the log.
This query uses a subquery in its WHERE clause to select U.S. states that have a population greater than Belgium. The subquery is evaluated first, and then it returns the population of Belgium to the outer query.
libname sql 'SAS-library';

proc sql;
   title 'U.S. States with Population Greater than Belgium';
   select Name 'State' , population format=comma10. 
      from sql.unitedstates 
      where population gt
                  (select population from sql.countries
                      where name = "Belgium");
Internally, this is what the query looks like after the subquery has executed:
proc sql;
   title 'U.S. States with Population Greater than Belgium';
   select Name 'State', population format=comma10. 
      from sql.unitedstates 
      where population gt 10162614;
The outer query lists the states whose populations are greater than the population of Belgium.
Single-Value Subquery
U.S. States with Population Greater than Belgium

Multiple-Value Subqueries

A multiple-value subquery can return more than one value from one column. It is used in a WHERE or HAVING expression that contains IN or a comparison operator that is modified by ANY or ALL. This example displays the populations of oil-producing countries. The subquery first returns all countries that are found in the OILPROD table. The outer query then matches countries in the COUNTRIES table to the results of the subquery.
libname sql 'SAS-library';

proc sql outobs=5;
   title 'Populations of Major Oil Producing Countries';
   select name 'Country', Population format=comma15.
      from sql.countries
      where Name in 
            (select Country from sql.oilprod);
Multiple-Value Subquery Using IN
Populations of Major Oil Producing Countries
If you use the NOT IN operator in this query, then the query result will contain all the countries that are not contained in the OILPROD table.
libname sql 'SAS-library';

proc sql outobs=5;  
   title 'Populations of NonMajor Oil Producing Countries';
   select name 'Country', Population format=comma15. 
      from sql.countries  
      where Name not in 
            (select Country from sql.oilprod);
Multiple-Value Subquery Using NOT IN
Populations of NonMajor Oil Producing Countries

Correlated Subqueries

The previous subqueries have been simple subqueries that are self-contained and that execute independently of the outer query. A correlated subquery requires a value or values to be passed to it by the outer query. After the subquery runs, it passes the results back to the outer query. Correlated subqueries can return single or multiple values.
This example selects all major oil reserves of countries on the continent of Africa.
libname sql 'SAS-library';

proc sql;
   title 'Oil Reserves of Countries in Africa';
   select * from sql.oilrsrvs o
      where 'Africa' =
             (select Continent from sql.countries c
               where c.Name = o.Country);
The outer query selects the first row from the OILRSRVS table and then passes the value of the Country column, Algeria, to the subquery. At this point, the subquery internally looks like this:
(select Continent from sql.countries c
         where c.Name = 'Algeria');
The subquery selects that country from the COUNTRIES table. The subquery then passes the country's continent back to the WHERE clause in the outer query. If the continent is Africa, then the country is selected and displayed. The outer query then selects each subsequent row from the OILRSRVS table and passes the individual values of Country to the subquery. The subquery returns the appropriate values of Continent to the outer query for comparison in its WHERE clause.
Note that the WHERE clause uses an = (equal) operator. You can use an = (equal) operator if the subquery returns only a single value. However, if the subquery returns multiple values, then you must use IN or a comparison operator with ANY or ALL. For detailed information about the operators that are available for use with subqueries, see SQL Procedure.
Correlated Subquery
Oil Reserves of Countries in Africa

Testing for the Existence of a Group of Values

The EXISTS condition tests for the existence of a set of values. An EXISTS condition is true if any rows are produced by the subquery, and it is false if no rows are produced. Conversely, the NOT EXISTS condition is true when a subquery produces an empty table.
This example produces the same result as Correlated Subquery. EXISTS checks for the existence of countries that have oil reserves on the continent of Africa. Note that the WHERE clause in the subquery now contains the condition Continent = 'Africa' that was in the outer query in the previous example.
libname sql 'SAS-library';

proc sql;
   title 'Oil Reserves of Countries in Africa';
   select * from sql.oilrsrvs o
      where exists 
             (select Continent from sql.countries c
                where o.Country = c.Name and
                      Continent = 'Africa');
Testing for the Existence of a Group of Values
Oil Reserves of Countries in Africa

Multiple Levels of Subquery Nesting

Subqueries can be nested so that the innermost subquery returns a value or values to be used by the next outer query. Then, that subquery's value or values are used by the next outer query, and so on. Evaluation always begins with the innermost subquery and works outward.
This example lists cities in Africa that are in countries with major oil reserves.
  1. The innermost query is evaluated first. It returns countries that are located on the continent of Africa.
  2. The outer subquery is evaluated. It returns a subset of African countries that have major oil reserves by comparing the list of countries that was returned by the inner subquery against the countries in OILRSRVS.
  3. Finally, the WHERE clause in the outer query lists the coordinates of the cities that exist in the WORLDCITYCOORDS table whose countries match the results of the outer subquery.
libname sql 'SAS-library';

proc sql;
   title 'Coordinates of African Cities with Major Oil Reserves';
   select * from sql.worldcitycoords 
3   where country in
  2    (select Country from sql.oilrsrvs o
           where o.Country in 
        1 (select Name from sql.countries c
             where c.Continent='Africa'));
Multiple Levels of Subquery Nesting
Coordinates of African Cities with Major Oil Reserves

Combining a Join with a Subquery

You can combine joins and subqueries in a single query. Suppose that you want to find the city nearest to each city in the USCITYCOORDS table. The query must first select a city A, compute the distance from a city A to every other city, and finally select the city with the minimum distance from city A. This can be done by joining the USCITYCOORDS table to itself (self-join) and then determining the closest distance between cities by using another self-join in a subquery.
This is the formula to determine the distance between coordinates:
SQRT(((Latitude2−Latitude1)**2) + ((Longitude2−Longitude1)**2))
Although the results of this formula are not exactly accurate because of the distortions caused by the curvature of the earth, they are accurate enough for this example to determine whether one city is closer than another.
libname sql 'SAS-library';

proc sql outobs=10;
   title 'Neighboring Cities';
   select a.City format=$10., a.State,
          a.Latitude 'Lat', a.Longitude 'Long',
          b.City format=$10., b.State,
          b.Latitude 'Lat', b.Longitude 'Long',
          sqrt(((b.latitude-a.latitude)**2) +
               ((b.longitude-a.longitude)**2)) as dist format=6.1
      from sql.uscitycoords a, sql.uscitycoords b
      where a.city ne b.city and
            calculated dist =
            (select min(sqrt(((d.latitude-c.latitude)**2) +
                             ((d.longitude-c.longitude)**2)))
                from sql.uscitycoords c, sql.uscitycoords d
                where c.city = a.city and 
                      c.state = a.state and
                      d.city ne c.city)
            order by a.city;
Combining a Join with a Subquery
Neighboring Cities
The outer query joins the table to itself and determines the distance between the first city A1 in table A and city B2 (the first city that is not equal to city A1) in Table B. PROC SQL then runs the subquery. The subquery does another self-join and calculates the minimum distance between city A1 and all other cities in the table other than city A1. The outer query tests to see whether the distance between cities A1 and B2 is equal to the minimum distance that was calculated by the subquery. If they are equal, then a row that contains cities A1 and B2 with their coordinates and distance is written.