Retrieving Data from Multiple Tables |
While a table join combines multiple tables into a new table, a subquery (enclosed in parentheses) selects rows from one table based on values in another table. A subquery, or inner query, is a query-expression that is nested as part of another query-expression. Depending on the clause that contains it, a subquery can return a single value or multiple values. Subqueries are most often used in the WHERE and the HAVING expressions.
Single-Value Subqueries |
A single-value subquery returns a single row and column. It can be used in a WHERE or HAVING clause with a comparison operator. The subquery must return only one value, or else the query fails and an error message is printed to the log.
This query uses a subquery in its WHERE clause to select U.S. states that have a population greater than Belgium. The subquery is evaluated first, and then it returns the population of Belgium to the outer query.
proc sql; title 'U.S. States with Population Greater than Belgium'; select Name 'State' , population format=comma10. from sql.unitedstates where population gt (select population from sql.countries where name = "Belgium");
Internally, this is what the query looks like after the subquery has executed:
proc sql; title 'U.S. States with Population Greater than Belgium'; select Name 'State', population format=comma10. from sql.unitedstates where population gt 10162614;
The outer query lists the states whose populations are greater than the population of Belgium.
U.S. States with Population Greater than Belgium State Population ----------------------------------------------- California 31,518,948 Florida 13,814,408 Illinois 11,813,091 New York 18,377,334 Ohio 11,200,790 Pennsylvania 12,167,566 Texas 18,209,994
Multiple-Value Subqueries |
A multiple-value subquery can return more than one value from one column. It is used in a WHERE or HAVING expression that contains IN or a comparison operator that is modified by ANY or ALL. This example displays the populations of oil-producing countries. The subquery first returns all countries that are found in the OILPROD table. The outer query then matches countries in the COUNTRIES table to the results of the subquery.
proc sql outobs=5; title 'Populations of Major Oil Producing Countries'; select name 'Country', Population format=comma15. from sql.countries where Name in (select Country from sql.oilprod);
Multiple-Value Subquery Using IN
Populations of Major Oil Producing Countries Country Population ---------------------------------------------------- Algeria 28,171,132 Canada 28,392,302 China 1,202,215,077 Egypt 59,912,259 Indonesia 202,393,859
If you use the NOT IN operator in this query, then the query result will contain all the countries that are not contained in the OILPROD table.
proc sql outobs=5; title 'Populations of NonMajor Oil Producing Countries'; select name 'Country', Population format=comma15. from sql.countries where Name not in (select Country from sql.oilprod);
Multiple-Value Subquery Using NOT IN
Populations of NonMajor Oil Producing Countries Country Population ---------------------------------------------------- Afghanistan 17,070,323 Albania 3,407,400 Andorra 64,634 Angola 9,901,050 Antigua and Barbuda 65,644
Correlated Subqueries |
The previous subqueries have been simple subqueries that are self-contained and that execute independently of the outer query. A correlated subquery requires a value or values to be passed to it by the outer query. After the subquery runs, it passes the results back to the outer query. Correlated subqueries can return single or multiple values.
This example selects all major oil reserves of countries on the continent of Africa.
proc sql; title 'Oil Reserves of Countries in Africa'; select * from sql.oilrsrvs o where 'Africa' = (select Continent from sql.countries c where c.Name = o.Country);
The outer query selects the first row from the OILRSRVS table and then passes the value of the Country column, Algeria , to the subquery. At this point, the subquery internally looks like this:
(select Continent from sql.countries c where c.Name = 'Algeria');
The subquery selects that country from the COUNTRIES table. The subquery then passes the country's continent back to the WHERE clause in the outer query. If the continent is Africa, then the country is selected and displayed. The outer query then selects each subsequent row from the OILRSRVS table and passes the individual values of Country to the subquery. The subquery returns the appropriate values of Continent to the outer query for comparison in its WHERE clause.
Note that the WHERE clause uses an = (equal) operator. You can use an = if the subquery returns only a single value. However, if the subquery returns multiple values, then you must use IN or a comparison operator with ANY or ALL. For detailed information about the operators that are available for use with subqueries, see the section about the SQL procedure in the Base SAS Procedures Guide.
Oil Reserves of Countries in Africa Country Barrels ------------------------------------------------- Algeria 9,200,000,000 Egypt 4,000,000,000 Gabon 1,000,000,000 Libya 30,000,000,000 Nigeria 16,000,000,000
Testing for the Existence of a Group of Values |
The EXISTS condition tests for the existence of a set of values. An EXISTS condition is true if any rows are produced by the subquery, and it is false if no rows are produced. Conversely, the NOT EXISTS condition is true when a subquery produces an empty table.
This example produces the same result as Correlated Subquery. EXISTS checks for the existence of countries that have oil reserves on the continent of Africa. Note that the WHERE clause in the subquery now contains the condition Continent = 'Africa' that was in the outer query in the previous example.
proc sql; title 'Oil Reserves of Countries in Africa'; select * from sql.oilrsrvs o where exists (select Continent from sql.countries c where o.Country = c.Name and Continent = 'Africa');
Testing for the Existence of a Group of Values
Oil Reserves of Countries in Africa Country Barrels ------------------------------------------------- Algeria 9,200,000,000 Egypt 4,000,000,000 Gabon 1,000,000,000 Libya 30,000,000,000 Nigeria 16,000,000,000
Multiple Levels of Subquery Nesting |
Subqueries can be nested so that the innermost subquery returns a value or values to be used by the next outer query. Then, that subquery's value or values are used by the next outer query, and so on. Evaluation always begins with the innermost subquery and works outward.
This example lists cities in Africa that are in countries with major oil reserves.
The innermost query is evaluated first. It returns countries that are located on the continent of Africa. | |
The outer subquery is evaluated. It returns a subset of African countries that have major oil reserves by comparing the list of countries that was returned by the inner subquery against the countries in OILRSRVS. | |
Finally, the WHERE clause in the outer query lists the coordinates of the cities that exist in the WORLDCITYCOORDS table whose countries match the results of the outer subquery. |
proc sql; title 'Coordinates of African Cities with Major Oil Reserves'; select * from sql.worldcitycoords 3 where country in 2 (select Country from sql.oilrsrvs o where o.Country in = 1 (select Name from sql.countries c where c.Continent='Africa'));
Multiple Levels of Subquery Nesting
Coordinates of African Cities with Major Oil Reserves City Country Latitude Longitude --------------------------------------------------------------------- Algiers Algeria 37 3 Cairo Egypt 30 31 Benghazi Libya 33 21 Lagos Nigeria 6 3
Combining a Join with a Subquery |
You can combine joins and subqueries in a single query. Suppose that you want to find the city nearest to each city in the USCITYCOORDS table. The query must first select a city A, compute the distance from city A to every other city, and finally select the city with the minimum distance from city A. This can be done by joining the USCITYCOORDS table to itself (self-join) and then determining the closest distance between cities by using another self-join in a subquery.
This is the formula to determine the distance between coordinates:
SQRT(((Latitude2-Latitude1)**2) + ((Longitude2-Longitude1)**2))
Although the results of this formula are not exactly accurate because of the distortions caused by the curvature of the earth, they are accurate enough for this example to determine whether one city is closer than another.
proc sql outobs=10; title 'Neighboring Cities'; select a.City format=$10., a.State, a.Latitude 'Lat', a.Longitude 'Long', b.City format=$10., b.State, b.Latitude 'Lat', b.Longitude 'Long', sqrt(((b.latitude-a.latitude)**2) + ((b.longitude-a.longitude)**2)) as dist format=6.1 from sql.uscitycoords a, sql.uscitycoords b where a.city ne b.city and calculated dist = (select min(sqrt(((d.latitude-c.latitude)**2) + ((d.longitude-c.longitude)**2))) from sql.uscitycoords c, sql.uscitycoords d where c.city = a.city and c.state = a.state and d.city ne c.city) order by a.city;
Combining a Join with a Subquery
Neighboring Cities City State Lat Long City State Lat Long dist ------------------------------------------------------------------------------------ Albany NY 43 -74 Hartford CT 42 -73 1.4 Albuquerqu NM 36 -106 Santa Fe NM 36 -106 0.0 Amarillo TX 35 -102 Carlsbad NM 32 -104 3.6 Anchorage AK 61 -150 Nome AK 64 -165 15.3 Annapolis MD 39 -77 Washington DC 39 -77 0.0 Atlanta GA 34 -84 Knoxville TN 36 -84 2.0 Augusta ME 44 -70 Portland ME 44 -70 0.0 Austin TX 30 -98 San Antoni TX 29 -98 1.0 Baker OR 45 -118 Lewiston ID 46 -117 1.4 Baltimore MD 39 -76 Dover DE 39 -76 0.0
The outer query joins the table to itself and determines the distance between the first city A1 in table A and city B2 (the first city that is not equal to city A1) in Table B. PROC SQL then runs the subquery. The subquery does another self-join and calculates the minimum distance between city A1 and all other cities in the table other than city A1. The outer query tests to see whether the distance between cities A1 and B2 is equal to the minimum distance that was calculated by the subquery. If they are equal, then a row that contains cities A1 and B2 with their coordinates and distance is written.
Copyright © 2008 by SAS Institute Inc., Cary, NC, USA. All rights reserved.