Previous Page | Next Page

Retrieving Data from Multiple Tables

Using Subqueries to Select Data

While a table join combines multiple tables into a new table, a subquery (enclosed in parentheses) selects rows from one table based on values in another table. A subquery, or inner query, is a query-expression that is nested as part of another query-expression. Depending on the clause that contains it, a subquery can return a single value or multiple values. Subqueries are most often used in the WHERE and the HAVING expressions.


Single-Value Subqueries

A single-value subquery returns a single row and column. It can be used in a WHERE or HAVING clause with a comparison operator. The subquery must return only one value, or else the query fails and an error message is printed to the log.

This query uses a subquery in its WHERE clause to select U.S. states that have a population greater than Belgium. The subquery is evaluated first, and then it returns the population of Belgium to the outer query.

proc sql;
   title 'U.S. States with Population Greater than Belgium';
   select Name 'State' , population format=comma10. 
      from sql.unitedstates 
      where population gt
                  (select population from sql.countries
                      where name = "Belgium");

Internally, this is what the query looks like after the subquery has executed:

proc sql;
   title 'U.S. States with Population Greater than Belgium';
   select Name 'State', population format=comma10. 
      from sql.unitedstates 
      where population gt 10162614;

The outer query lists the states whose populations are greater than the population of Belgium.

Single-Value Subquery

                U.S. States with Population Greater than Belgium

                State                                Population
                -----------------------------------------------
                California                           31,518,948
                Florida                              13,814,408
                Illinois                             11,813,091
                New York                             18,377,334
                Ohio                                 11,200,790
                Pennsylvania                         12,167,566
                Texas                                18,209,994

Multiple-Value Subqueries

A multiple-value subquery can return more than one value from one column. It is used in a WHERE or HAVING expression that contains IN or a comparison operator that is modified by ANY or ALL. This example displays the populations of oil-producing countries. The subquery first returns all countries that are found in the OILPROD table. The outer query then matches countries in the COUNTRIES table to the results of the subquery.

proc sql outobs=5;
   title 'Populations of Major Oil Producing Countries';
   select name 'Country', Population format=comma15.
      from sql.countries
      where Name in 
            (select Country from sql.oilprod);

Multiple-Value Subquery Using IN

                  Populations of Major Oil Producing Countries

              Country                                   Population
              ----------------------------------------------------
              Algeria                                   28,171,132
              Canada                                    28,392,302
              China                                  1,202,215,077
              Egypt                                     59,912,259
              Indonesia                                202,393,859

If you use the NOT IN operator in this query, then the query result will contain all the countries that are not contained in the OILPROD table.

proc sql outobs=5;  
   title 'Populations of NonMajor Oil Producing Countries';
   select name 'Country', Population format=comma15. 
      from sql.countries  
      where Name not in 
            (select Country from sql.oilprod);

Multiple-Value Subquery Using NOT IN

                Populations of NonMajor Oil Producing Countries

              Country                                   Population
              ----------------------------------------------------
              Afghanistan                               17,070,323
              Albania                                    3,407,400
              Andorra                                       64,634
              Angola                                     9,901,050
              Antigua and Barbuda                           65,644

Correlated Subqueries

The previous subqueries have been simple subqueries that are self-contained and that execute independently of the outer query. A correlated subquery requires a value or values to be passed to it by the outer query. After the subquery runs, it passes the results back to the outer query. Correlated subqueries can return single or multiple values.

This example selects all major oil reserves of countries on the continent of Africa.

proc sql;
   title 'Oil Reserves of Countries in Africa';
   select * from sql.oilrsrvs o
      where 'Africa' =
                  (select Continent from sql.countries c
               where c.Name = o.Country);

The outer query selects the first row from the OILRSRVS table and then passes the value of the Country column, Algeria , to the subquery. At this point, the subquery internally looks like this:

(select Continent from sql.countries c
         where c.Name = 'Algeria');

The subquery selects that country from the COUNTRIES table. The subquery then passes the country's continent back to the WHERE clause in the outer query. If the continent is Africa, then the country is selected and displayed. The outer query then selects each subsequent row from the OILRSRVS table and passes the individual values of Country to the subquery. The subquery returns the appropriate values of Continent to the outer query for comparison in its WHERE clause.

Note that the WHERE clause uses an = (equal) operator. You can use an = if the subquery returns only a single value. However, if the subquery returns multiple values, then you must use IN or a comparison operator with ANY or ALL. For detailed information about the operators that are available for use with subqueries, see the section about the SQL procedure in the Base SAS Procedures Guide.

Correlated Subquery

                      Oil Reserves of Countries in Africa

               Country                                   Barrels
               -------------------------------------------------
               Algeria                             9,200,000,000
               Egypt                               4,000,000,000
               Gabon                               1,000,000,000
               Libya                              30,000,000,000
               Nigeria                            16,000,000,000

Testing for the Existence of a Group of Values

The EXISTS condition tests for the existence of a set of values. An EXISTS condition is true if any rows are produced by the subquery, and it is false if no rows are produced. Conversely, the NOT EXISTS condition is true when a subquery produces an empty table.

This example produces the same result as Correlated Subquery. EXISTS checks for the existence of countries that have oil reserves on the continent of Africa. Note that the WHERE clause in the subquery now contains the condition Continent = 'Africa' that was in the outer query in the previous example.

proc sql;
   title 'Oil Reserves of Countries in Africa';
   select * from sql.oilrsrvs o
      where exists 
                  (select Continent from sql.countries c
                where o.Country = c.Name and
                      Continent = 'Africa');

Testing for the Existence of a Group of Values

                      Oil Reserves of Countries in Africa

               Country                                   Barrels
               -------------------------------------------------
               Algeria                             9,200,000,000
               Egypt                               4,000,000,000
               Gabon                               1,000,000,000
               Libya                              30,000,000,000
               Nigeria                            16,000,000,000

Multiple Levels of Subquery Nesting

Subqueries can be nested so that the innermost subquery returns a value or values to be used by the next outer query. Then, that subquery's value or values are used by the next outer query, and so on. Evaluation always begins with the innermost subquery and works outward.

This example lists cities in Africa that are in countries with major oil reserves.

[1] The innermost query is evaluated first. It returns countries that are located on the continent of Africa.

[2] The outer subquery is evaluated. It returns a subset of African countries that have major oil reserves by comparing the list of countries that was returned by the inner subquery against the countries in OILRSRVS.

[3] Finally, the WHERE clause in the outer query lists the coordinates of the cities that exist in the WORLDCITYCOORDS table whose countries match the results of the outer subquery.

proc sql;
   title 'Coordinates of African Cities with Major Oil Reserves';
   select * from sql.worldcitycoords 
    3  where country in
          2  (select Country from sql.oilrsrvs o
                 where o.Country in =
                           1  (select Name from sql.countries c
                           where c.Continent='Africa'));

Multiple Levels of Subquery Nesting

             Coordinates of African Cities with Major Oil Reserves

     City                        Country               Latitude  Longitude
     ---------------------------------------------------------------------
     Algiers                     Algeria                     37          3
     Cairo                       Egypt                       30         31
     Benghazi                    Libya                       33         21
     Lagos                       Nigeria                      6          3

Combining a Join with a Subquery

You can combine joins and subqueries in a single query. Suppose that you want to find the city nearest to each city in the USCITYCOORDS table. The query must first select a city A, compute the distance from city A to every other city, and finally select the city with the minimum distance from city A. This can be done by joining the USCITYCOORDS table to itself (self-join) and then determining the closest distance between cities by using another self-join in a subquery.

This is the formula to determine the distance between coordinates:

SQRT(((Latitude2-Latitude1)**2) + ((Longitude2-Longitude1)**2))

Although the results of this formula are not exactly accurate because of the distortions caused by the curvature of the earth, they are accurate enough for this example to determine whether one city is closer than another.

proc sql outobs=10;
   title 'Neighboring Cities';
   select a.City format=$10., a.State,
          a.Latitude 'Lat', a.Longitude 'Long',
          b.City format=$10., b.State,
          b.Latitude 'Lat', b.Longitude 'Long',
          sqrt(((b.latitude-a.latitude)**2) +
               ((b.longitude-a.longitude)**2)) as dist format=6.1
      from sql.uscitycoords a, sql.uscitycoords b
      where a.city ne b.city and
            calculated dist =
            (select min(sqrt(((d.latitude-c.latitude)**2) +
                             ((d.longitude-c.longitude)**2)))
                from sql.uscitycoords c, sql.uscitycoords d
                where c.city = a.city and 
                      c.state = a.state and
                      d.city ne c.city)
            order by a.city;

Combining a Join with a Subquery

                                 Neighboring Cities

City        State       Lat      Long  City        State       Lat      Long    dist
------------------------------------------------------------------------------------
Albany      NY           43       -74  Hartford    CT           42       -73     1.4
Albuquerqu  NM           36      -106  Santa Fe    NM           36      -106     0.0
Amarillo    TX           35      -102  Carlsbad    NM           32      -104     3.6
Anchorage   AK           61      -150  Nome        AK           64      -165    15.3
Annapolis   MD           39       -77  Washington  DC           39       -77     0.0
Atlanta     GA           34       -84  Knoxville   TN           36       -84     2.0
Augusta     ME           44       -70  Portland    ME           44       -70     0.0
Austin      TX           30       -98  San Antoni  TX           29       -98     1.0
Baker       OR           45      -118  Lewiston    ID           46      -117     1.4
Baltimore   MD           39       -76  Dover       DE           39       -76     0.0

The outer query joins the table to itself and determines the distance between the first city A1 in table A and city B2 (the first city that is not equal to city A1) in Table B. PROC SQL then runs the subquery. The subquery does another self-join and calculates the minimum distance between city A1 and all other cities in the table other than city A1. The outer query tests to see whether the distance between cities A1 and B2 is equal to the minimum distance that was calculated by the subquery. If they are equal, then a row that contains cities A1 and B2 with their coordinates and distance is written.

Previous Page | Next Page | Top of Page