Subqueries are a powerful tool that you can use in all four SQL data statements. This chapter explores in great detail the many uses of the subquery.
A subquery is a query contained within another SQL statement (which I refer to as the containing statement for the rest of this discussion). A subquery is always enclosed within parentheses, and it is usually executed prior to the containing statement. Like any query, a subquery returns a result set that may consist of:
A single row with a single column
Multiple rows with a single column
Multiple rows and columns
The type of result set the subquery returns determines how it may be used and which operators the containing statement may use to interact with the data the subquery returns. When the containing statement has finished executing, the data returned by any subqueries is discarded, making a subquery act like a temporary table with statement scope (meaning that the server frees up any memory allocated to the subquery results after the SQL statement has finished execution).
You already saw several examples of subqueries in earlier chapters, but here’s a simple example to get started:
mysql>SELECT account_id, product_cd, cust_id, avail_balance->FROM account->WHERE account_id = (SELECT MAX(account_id) FROM account);+------------+------------+---------+---------------+ | account_id | product_cd | cust_id | avail_balance | +------------+------------+---------+---------------+ | 29 | SBL | 13 | 50000.00 | +------------+------------+---------+---------------+ 1 row in set (0.65 sec)
In this example, the subquery returns the maximum value found in the account_id column in the account table, and the containing statement then returns data about
that account. If you are ever confused about what a subquery is doing, you can run
the subquery by itself (without the parentheses) to see what it returns. Here’s the
subquery from the previous example:
mysql> SELECT MAX(account_id) FROM account;
+-----------------+
| MAX(account_id) |
+-----------------+
| 29 |
+-----------------+
1 row in set (0.00 sec)So, the subquery returns a single row with a single column, which allows it to be used as one of the expressions in an equality condition (if the subquery returned two or more rows, it could be compared to something but could not be equal to anything, but more on this later). In this case, you can take the value the subquery returned and substitute it into the righthand expression of the filter condition in the containing query, as in:
mysql>SELECT account_id, product_cd, cust_id, avail_balance->FROM account->WHERE account_id = 29;+------------+------------+---------+---------------+ | account_id | product_cd | cust_id | avail_balance | +------------+------------+---------+---------------+ | 29 | SBL | 13 | 50000.00 | +------------+------------+---------+---------------+ 1 row in set (0.02 sec)
The subquery is useful in this case because it allows you to retrieve information
about the highest numbered account in a single query, rather than retrieving the
maximum account_id using one query and then
writing a second query to retrieve the desired data from the account table. As you will see, subqueries are useful
in many other situations as well, and may become one of the most powerful tools in
your SQL toolkit.
Along with the differences noted previously regarding the type of result set a subquery returns (single row/column, single row/multicolumn, or multiple columns), you can use another factor to differentiate subqueries; some subqueries are completely self-contained (called noncorrelated subqueries), while others reference columns from the containing statement (called correlated subqueries). The next several sections explore these two subquery types and show the different operators that you can employ to interact with them.
The example from earlier in the chapter is a noncorrelated subquery; it may be
executed alone and does not reference anything from the containing statement. Most
subqueries that you encounter will be of this type unless you are writing update or delete
statements, which frequently make use of correlated subqueries (more on this later).
Along with being noncorrelated, the example from earlier in the chapter also returns
a table comprising a single row and column. This type of subquery is known as a
scalar subquery and can appear on either side of a
condition using the usual operators (=, <>, <, >, <=, >=). The next example shows how you can use a scalar subquery
in an inequality condition:
mysql>SELECT account_id, product_cd, cust_id, avail_balance->FROM account->WHERE open_emp_id <> (SELECT e.emp_id->FROM employee e INNER JOIN branch b->ON e.assigned_branch_id = b.branch_id->WHERE e.title = 'Head Teller' AND b.city = 'Woburn');+------------+------------+---------+---------------+ | account_id | product_cd | cust_id | avail_balance | +------------+------------+---------+---------------+ | 7 | CHK | 3 | 1057.75 | | 8 | MM | 3 | 2212.50 | | 10 | CHK | 4 | 534.12 | | 11 | SAV | 4 | 767.77 | | 12 | MM | 4 | 5487.09 | | 13 | CHK | 5 | 2237.97 | | 14 | CHK | 6 | 122.37 | | 15 | CD | 6 | 10000.00 | | 18 | CHK | 8 | 3487.19 | | 19 | SAV | 8 | 387.99 | | 21 | CHK | 9 | 125.67 | | 22 | MM | 9 | 9345.55 | | 23 | CD | 9 | 1500.00 | | 24 | CHK | 10 | 23575.12 | | 25 | BUS | 10 | 0.00 | | 28 | CHK | 12 | 38552.05 | | 29 | SBL | 13 | 50000.00 | +------------+------------+---------+---------------+ 17 rows in set (0.86 sec)
This query returns data concerning all accounts that were not
opened by the head teller at the Woburn branch (the subquery is written using the
assumption that there is only a single head teller at each branch). The subquery in
this example is a bit more complex than in the previous example, in that it joins
two tables and includes two filter conditions. Subqueries may be as simple or as
complex as you need them to be, and they may utilize any and all the available query
clauses (select, from, where, group by, having, and order
by).
If you use a subquery in an equality condition, but the subquery returns more than one row, you will receive an error. For example, if you modify the previous query such that the subquery returns all tellers at the Woburn branch instead of the single head teller, you will receive the following error:
mysql>SELECT account_id, product_cd, cust_id, avail_balance->FROM account->WHERE open_emp_id <> (SELECT e.emp_id->FROM employee e INNER JOIN branch b->ON e.assigned_branch_id = b.branch_id->WHERE e.title = 'Teller' AND b.city = 'Woburn');ERROR 1242 (21000): Subquery returns more than 1 row
If you run the subquery by itself, you will see the following results:
mysql>SELECT e.emp_id->FROM employee e INNER JOIN branch b->ON e.assigned_branch_id = b.branch_id->WHERE e.title = 'Teller' AND b.city = 'Woburn';+--------+ | emp_id | +--------+ | 11 | | 12 | +--------+ 2 rows in set (0.02 sec)
The containing query fails because an expression (open_emp_id) cannot be equated to a set of expressions (emp_ids 11 and 12). In other words, a single thing
cannot be equated to a set of things. In the next section, you will see how to fix
the problem by using a different operator.
If your subquery returns more than one row, you will not be able to use it on one side of an equality condition, as the previous example demonstrated. However, there are four additional operators that you can use to build conditions with these types of subqueries.
While you can’t equate a single value to a set of
values, you can check to see whether a single value can be found
within a set of values. The next example, while it
doesn’t use a subquery, demonstrates how to build a condition that uses the
in operator to search for a value
within a set of values:
mysql>SELECT branch_id, name, city->FROM branch->WHERE name IN ('Headquarters', 'Quincy Branch');+-----------+---------------+---------+ | branch_id | name | city | +-----------+---------------+---------+ | 1 | Headquarters | Waltham | | 3 | Quincy Branch | Quincy | +-----------+---------------+---------+ 2 rows in set (0.03 sec)
The expression on the lefthand side of the condition is the name column, while the righthand side of the
condition is a set of strings. The in
operator checks to see whether either of the strings can be found in the
name column; if so, the condition is
met and the row is added to the result set. You could achieve the same
results using two equality conditions, as in:
mysql>SELECT branch_id, name, city->FROM branch->WHERE name = 'Headquarters' OR name = 'Quincy Branch';+-----------+---------------+---------+ | branch_id | name | city | +-----------+---------------+---------+ | 1 | Headquarters | Waltham | | 3 | Quincy Branch | Quincy | +-----------+---------------+---------+ 2 rows in set (0.01 sec)
While this approach seems reasonable when the set contains only two
expressions, it is easy to see why a single condition using the in operator would be preferable if the set
contained dozens (or hundreds, thousands, etc.) of values.
Although you will occasionally create a set of strings, dates, or numbers
to use on one side of a condition, you are more likely to generate the set
at query execution via a subquery that returns one or more rows. The
following query uses the in operator with
a subquery on the righthand side of the filter condition to see which
employees supervise other employees:
mysql>SELECT emp_id, fname, lname, title->FROM employee->WHERE emp_id IN (SELECT superior_emp_id->FROM employee);+--------+---------+-----------+--------------------+ | emp_id | fname | lname | title | +--------+---------+-----------+--------------------+ | 1 | Michael | Smith | President | | 3 | Robert | Tyler | Treasurer | | 4 | Susan | Hawthorne | Operations Manager | | 6 | Helen | Fleming | Head Teller | | 10 | Paula | Roberts | Head Teller | | 13 | John | Blake | Head Teller | | 16 | Theresa | Markham | Head Teller | +--------+---------+-----------+--------------------+ 7 rows in set (0.01 sec)
The subquery returns the IDs of all employees who supervise other
employees, and the containing query retrieves four columns from the employee table for these employees. Here are
the results of the subquery:
mysql>SELECT superior_emp_id->FROM employee;+-----------------+ | superior_emp_id | +-----------------+ | NULL | | 1 | | 1 | | 3 | | 4 | | 4 | | 4 | | 4 | | 4 | | 6 | | 6 | | 6 | | 10 | | 10 | | 13 | | 13 | | 16 | | 16 | +-----------------+ 18 rows in set (0.00 sec)
As you can see, some employee IDs are listed more than once, since some
employees supervise multiple people. This doesn’t adversely affect the
results of the containing query, since it doesn’t matter whether an employee
ID can be found in the result set of the subquery once or more than once. Of
course, you could add the distinct
keyword to the subquery’s select clause
if it bothers you to have duplicates in the table returned by the subquery,
but it won’t change the containing query’s result set.
Along with seeing whether a value exists within a set of values, you can
check the converse using the not in
operator. Here’s another version of the previous query using not in instead of in:
mysql>SELECT emp_id, fname, lname, title->FROM employee->WHERE emp_id NOT IN (SELECT superior_emp_id->FROM employee->WHERE superior_emp_id IS NOT NULL);+--------+----------+----------+----------------+ | emp_id | fname | lname | title | +--------+----------+----------+----------------+ | 2 | Susan | Barker | Vice President | | 5 | John | Gooding | Loan Manager | | 7 | Chris | Tucker | Teller | | 8 | Sarah | Parker | Teller | | 9 | Jane | Grossman | Teller | | 11 | Thomas | Ziegler | Teller | | 12 | Samantha | Jameson | Teller | | 14 | Cindy | Mason | Teller | | 15 | Frank | Portman | Teller | | 17 | Beth | Fowler | Teller | | 18 | Rick | Tulman | Teller | +--------+----------+----------+----------------+ 11 rows in set (0.00 sec)
This query finds all employees who do not supervise
other people. For this query, I needed to add a filter condition to the
subquery to ensure that null values do
not appear in the table returned by the subquery; see the next section for
an explanation of why this filter is needed in this case.
While the in operator is used to see
whether an expression can be found within a set of expressions, the all operator allows you to make comparisons
between a single value and every value in a set. To build such a condition,
you will need to use one of the comparison operators (=, <>, <, >, etc.) in
conjunction with the all operator. For
example, the next query finds all employees whose employee IDs are not equal
to any of the supervisor employee IDs:
mysql>SELECT emp_id, fname, lname, title->FROM employee->WHERE emp_id <> ALL (SELECT superior_emp_id->FROM employee->WHERE superior_emp_id IS NOT NULL);+--------+----------+----------+----------------+ | emp_id | fname | lname | title | +--------+----------+----------+----------------+ | 2 | Susan | Barker | Vice President | | 5 | John | Gooding | Loan Manager | | 7 | Chris | Tucker | Teller | | 8 | Sarah | Parker | Teller | | 9 | Jane | Grossman | Teller | | 11 | Thomas | Ziegler | Teller | | 12 | Samantha | Jameson | Teller | | 14 | Cindy | Mason | Teller | | 15 | Frank | Portman | Teller | | 17 | Beth | Fowler | Teller | | 18 | Rick | Tulman | Teller | +--------+----------+----------+----------------+ 11 rows in set (0.05 sec)
Once again, the subquery returns the set of IDs for those employees who
supervise other people, and the containing query returns data for each
employee whose ID is not equal to all of the IDs returned by the subquery.
In other words, the query finds all employees who are not supervisors. If
this approach seems a bit clumsy to you, you are in good company; most
people would prefer to phrase the query differently and avoid using the
all operator. For example, this query
generates the same results as the last example in the previous section,
which used the not in operator. It’s a
matter of preference, but I think that most people would find the version
that uses not in to be easier to
understand.
When using not in or <> all to compare a value to
a set of values, you must be careful to ensure that the set of values
does not contain a null value,
because the server equates the value on the lefthand side of the
expression to each member of the set, and any attempt to equate a value
to null yields unknown. Thus, the following query returns
an empty set:
mysql>SELECT emp_id, fname, lname, title->FROM employee->WHERE emp_id NOT IN (1, 2, NULL);Empty set (0.00 sec)
In some cases, the all operator is a
bit more natural. The next example uses all to find accounts having an available balance smaller than
all of Frank Tucker’s accounts:
mysql>SELECT account_id, cust_id, product_cd, avail_balance->FROM account->WHERE avail_balance < ALL (SELECT a.avail_balance->FROM account a INNER JOIN individual i->ON a.cust_id = i.cust_id->WHERE i.fname = 'Frank' AND i.lname = 'Tucker');+------------+---------+------------+---------------+ | account_id | cust_id | product_cd | avail_balance | +------------+---------+------------+---------------+ | 2 | 1 | SAV | 500.00 | | 5 | 2 | SAV | 200.00 | | 10 | 4 | CHK | 534.12 | | 11 | 4 | SAV | 767.77 | | 14 | 6 | CHK | 122.37 | | 19 | 8 | SAV | 387.99 | | 21 | 9 | CHK | 125.67 | | 25 | 10 | BUS | 0.00 | +------------+---------+------------+---------------+ 8 rows in set (0.17 sec)
Here’s the data returned by the subquery, which consists of the available balance from each of Frank’s accounts:
mysql>SELECT a.avail_balance->FROM account a INNER JOIN individual i->ON a.cust_id = i.cust_id->WHERE i.fname = 'Frank' AND i.lname = 'Tucker';+---------------+ | avail_balance | +---------------+ | 1057.75 | | 2212.50 | +---------------+ 2 rows in set (0.01 sec)
Frank has two accounts, with the lowest balance being $1,057.75. The containing query finds all accounts having a balance smaller than any of Frank’s accounts, so the result set includes all accounts having a balance less than $1,057.75.
Like the all operator, the any operator allows a value to be compared to
the members of a set of values; unlike all, however, a condition using the any operator evaluates to true as soon as a single comparison is favorable. This is
different from the previous example using the all operator, which evaluates to true only if comparisons against all
members of the set are favorable. For example, you might want to find all
accounts having an available balance greater than any
of Frank Tucker’s accounts:
mysql>SELECT account_id, cust_id, product_cd, avail_balance->FROM account->WHERE avail_balance > ANY (SELECT a.avail_balance->FROM account a INNER JOIN individual i->ON a.cust_id = i.cust_id->WHERE i.fname = 'Frank' AND i.lname = 'Tucker');+------------+---------+------------+---------------+ | account_id | cust_id | product_cd | avail_balance | +------------+---------+------------+---------------+ | 3 | 1 | CD | 3000.00 | | 4 | 2 | CHK | 2258.02 | | 8 | 3 | MM | 2212.50 | | 12 | 4 | MM | 5487.09 | | 13 | 5 | CHK | 2237.97 | | 15 | 6 | CD | 10000.00 | | 17 | 7 | CD | 5000.00 | | 18 | 8 | CHK | 3487.19 | | 22 | 9 | MM | 9345.55 | | 23 | 9 | CD | 1500.00 | | 24 | 10 | CHK | 23575.12 | | 27 | 11 | BUS | 9345.55 | | 28 | 12 | CHK | 38552.05 | | 29 | 13 | SBL | 50000.00 | +------------+---------+------------+---------------+ 14 rows in set (0.00 sec)
Frank has two accounts with balances of $1,057.75 and $2,212.50; to have a balance greater than any of these two accounts, an account must have a balance of at least $1,057.75.
So far, all of the subquery examples in this chapter have returned a single column and one or more rows. In certain situations, however, you can use subqueries that return two or more columns. To show the utility of multiple-column subqueries, it might help to look first at an example that uses multiple, single-column subqueries:
mysql>SELECT account_id, product_cd, cust_id->FROM account->WHERE open_branch_id = (SELECT branch_id->FROM branch->WHERE name = 'Woburn Branch')->AND open_emp_id IN (SELECT emp_id->FROM employee->WHERE title = 'Teller' OR title = 'Head Teller');+------------+------------+---------+ | account_id | product_cd | cust_id | +------------+------------+---------+ | 1 | CHK | 1 | | 2 | SAV | 1 | | 3 | CD | 1 | | 4 | CHK | 2 | | 5 | SAV | 2 | | 17 | CD | 7 | | 27 | BUS | 11 | +------------+------------+---------+ 7 rows in set (0.09 sec)
This query uses two subqueries to identify the ID of the Woburn branch and the
IDs of all bank tellers, and the containing query then uses this information to
retrieve all checking accounts opened by a teller at the Woburn branch. However,
since the employee table includes information
about which branch each employee is assigned to, you can achieve the same
results by comparing both the account.open_branch_id and account.open_emp_id columns to a single subquery against the
employee and branch tables. To do so, your filter condition must name both
columns from the account table surrounded by
parentheses and in the same order as returned by the subquery, as in:
mysql>SELECT account_id, product_cd, cust_id->FROM account->WHERE (open_branch_id, open_emp_id) IN->(SELECT b.branch_id, e.emp_id->FROM branch b INNER JOIN employee e->ON b.branch_id = e.assigned_branch_id->WHERE b.name = 'Woburn Branch'->AND (e.title = 'Teller' OR e.title = 'Head Teller'));+------------+------------+---------+ | account_id | product_cd | cust_id | +------------+------------+---------+ | 1 | CHK | 1 | | 2 | SAV | 1 | | 3 | CD | 1 | | 4 | CHK | 2 | | 5 | SAV | 2 | | 17 | CD | 7 | | 27 | BUS | 11 | +------------+------------+---------+ 7 rows in set (0.00 sec)
This version of the query performs the same function as the previous example, but with a single subquery that returns two columns instead of two subqueries that each return a single column.
Of course, you could rewrite the previous example simply to join the three
tables instead of using a subquery, but it’s helpful when learning SQL to see
multiple ways of achieving the same results. Here’s another example, however,
that requires a subquery. Let’s say that there have been some customer
complaints regarding incorrect values in the available/pending balance columns
in the account table. Your job is to find all
accounts whose balances don’t match the sum of the transaction amounts for that
account. Here’s a partial solution to the problem:
SELECT 'ALERT! : Account #1 Has Incorrect Balance!'
FROM account
WHERE (avail_balance, pending_balance) <>
(SELECT SUM(<expression to generate available balance>),
SUM(<expression to generate pending balance>)
FROM transaction
WHERE account_id = 1)
AND account_id = 1;As you can see, I have neglected to fill in the expressions used to sum the
transaction amounts for the available and pending balance calculations, but I
promise to finish the job in Chapter 11 after you learn
how to build case expressions. Even so, the
query is complete enough to see that the subquery is generating two sums from
the transaction table
that are then compared to the avail_balance
and pending_balance columns in the account table. Both the subquery and the
containing query include the filter condition account_id = 1, so the query in its present form will check only
a single account at a time. In the next section, you will learn how to write a
more general form of the query that will check all accounts
with a single execution.
All of the subqueries shown thus far have been independent of their containing statements, meaning that you can execute them by themselves and inspect the results. A correlated subquery, on the other hand, is dependent on its containing statement from which it references one or more columns. Unlike a noncorrelated subquery, a correlated subquery is not executed once prior to execution of the containing statement; instead, the correlated subquery is executed once for each candidate row (rows that might be included in the final results). For example, the following query uses a correlated subquery to count the number of accounts for each customer, and the containing query then retrieves those customers having exactly two accounts:
mysql>SELECT c.cust_id, c.cust_type_cd, c.city->FROM customer c->WHERE 2 = (SELECT COUNT(*)->FROM account a->WHERE a.cust_id = c.cust_id);+---------+--------------+---------+ | cust_id | cust_type_cd | city | +---------+--------------+---------+ | 2 | I | Woburn | | 3 | I | Quincy | | 6 | I | Waltham | | 8 | I | Salem | | 10 | B | Salem | +---------+--------------+---------+ 5 rows in set (0.01 sec)
The reference to c.cust_id at the very end of
the subquery is what makes the subquery correlated; the containing query must supply
values for c.cust_id for the subquery to execute.
In this case, the containing query retrieves all 13 rows from the customer table and executes the subquery once for each
customer, passing in the appropriate customer ID for each execution. If the subquery
returns the value 2, then the filter condition is
met and the row is added to the result set.
Along with equality conditions, you can use correlated subqueries in other types of conditions, such as the range condition illustrated here:
mysql>SELECT c.cust_id, c.cust_type_cd, c.city->FROM customer c->WHERE (SELECT SUM(a.avail_balance)->FROM account a->WHERE a.cust_id = c.cust_id)->BETWEEN 5000 AND 10000;+---------+--------------+------------+ | cust_id | cust_type_cd | city | +---------+--------------+------------+ | 4 | I | Waltham | | 7 | I | Wilmington | | 11 | B | Wilmington | +---------+--------------+------------+ 3 rows in set (0.02 sec)
This variation on the previous query finds all customers whose total available balance across all accounts lies between $5,000 and $10,000. Once again, the correlated subquery is executed 13 times (once for each customer row), and each execution of the subquery returns the total account balance for the given customer.
Another subtle difference in the previous query is that the subquery is on the lefthand side of the condition, which may look a bit odd but is perfectly valid.
At the end of the previous section, I demonstrated how to check the available and pending balances of an account against the transactions logged against the account, and I promised to show you how to modify the example to run all accounts in a single execution. Here’s the example again:
SELECT 'ALERT! : Account #1 Has Incorrect Balance!'
FROM account
WHERE (avail_balance, pending_balance) <>
(SELECT SUM(<expression to generate available balance>),
SUM(<expression to generate pending balance>)
FROM transaction
WHERE account_id = 1)
AND account_id = 1;Using a correlated subquery instead of a noncorrelated subquery, you can execute the containing query once, and the subquery will be run for each account. Here’s the updated version:
SELECT CONCAT('ALERT! : Account #', a.account_id,
' Has Incorrect Balance!')
FROM account a
WHERE (a.avail_balance, a.pending_balance) <>
(SELECT SUM(<expression to generate available balance>),
SUM(<expression to generate pending balance>)
FROM transaction t
WHERE t.account_id = a.account_id);The subquery now includes a filter condition linking the transaction’s account ID
to the account ID from the containing query. The select clause has also been modified to concatenate an alert message
that includes the account ID rather than the hardcoded value 1.
While you will often see correlated subqueries used in equality and range
conditions, the most common operator used to build conditions that utilize
correlated subqueries is the exists operator.
You use the exists operator when you want to
identify that a relationship exists without regard for the quantity; for
example, the following query finds all the accounts for which a transaction was
posted on a particular day, without regard for how many transactions were
posted:
SELECT a.account_id, a.product_cd, a.cust_id, a.avail_balance
FROM account a
WHERE EXISTS (SELECT 1
FROM transaction t
WHERE t.account_id = a.account_id
AND t.txn_date = '2008-09-22');Using the exists operator, your subquery
can return zero, one, or many rows, and the condition simply checks whether the
subquery returned any rows. If you look at the select clause of the subquery, you will see that it consists of a
single literal (1); since the condition in
the containing query only needs to know how many rows have been returned, the
actual data the subquery returned is irrelevant. Your subquery can return
whatever strikes your fancy, as demonstrated next:
SELECT a.account_id, a.product_cd, a.cust_id, a.avail_balance
FROM account a
WHERE EXISTS (SELECT t.txn_id, 'hello', 3.1415927
FROM transaction t
WHERE t.account_id = a.account_id
AND t.txn_date = '2008-09-22');However, the convention is to specify either select
1 or select * when using
exists.
You may also use not exists to check for
subqueries that return no rows, as demonstrated by the following:
mysql>SELECT a.account_id, a.product_cd, a.cust_id->FROM account a->WHERE NOT EXISTS (SELECT 1->FROM business b->WHERE b.cust_id = a.cust_id);+------------+------------+---------+ | account_id | product_cd | cust_id | +------------+------------+---------+ | 1 | CHK | 1 | | 2 | SAV | 1 | | 3 | CD | 1 | | 4 | CHK | 2 | | 5 | SAV | 2 | | 7 | CHK | 3 | | 8 | MM | 3 | | 10 | CHK | 4 | | 11 | SAV | 4 | | 12 | MM | 4 | | 13 | CHK | 5 | | 14 | CHK | 6 | | 15 | CD | 6 | | 17 | CD | 7 | | 18 | CHK | 8 | | 19 | SAV | 8 | | 21 | CHK | 9 | | 22 | MM | 9 | | 23 | CD | 9 | +------------+------------+---------+ 19 rows in set (0.99 sec)
This query finds all customers whose customer ID does not appear in the
business table, which is a roundabout way
of finding all nonbusiness customers.
All of the examples thus far in the chapter have been select statements, but don’t think that means that subqueries
aren’t useful in other SQL statements. Subqueries are used heavily in update, delete,
and insert statements as well, with
correlated subqueries appearing frequently in update and delete statements.
Here’s an example of a correlated subquery used to modify the last_activity_date column in the account table:
UPDATE account a SET a.last_activity_date = (SELECT MAX(t.txn_date) FROM transaction t WHERE t.account_id = a.account_id);
This statement modifies every row in the account table (since there is no where clause) by finding the latest transaction date for each
account. While it seems reasonable to expect that every account will have at
least one transaction linked to it, it would be best to check whether an account
has any transactions before attempting to update the last_activity_date column; otherwise, the column will be set to
null, since the subquery would return no
rows. Here’s another version of the update
statement, this time employing a where clause
with a second correlated subquery:
UPDATE account a SET a.last_activity_date = (SELECT MAX(t.txn_date) FROM transaction t WHERE t.account_id = a.account_id) WHERE EXISTS (SELECT 1 FROM transaction t WHERE t.account_id = a.account_id);
The two correlated subqueries are identical except for the select clauses. The subquery in the set clause, however, executes only if the
condition in the update statement’s where clause evaluates to true (meaning that at least one transaction was found for the
account), thus protecting the data in the last_activity_date column from being overwritten with a null.
Correlated subqueries are also common in delete statements. For example, you may run a data maintenance
script at the end of each month that removes unnecessary data. The script might
include the following statement, which removes data from the department table that has no
child rows in the employee table:
DELETE FROM department WHERE NOT EXISTS (SELECT 1 FROM employee WHERE employee.dept_id = department.dept_id);
When using correlated subqueries with delete statements in MySQL, keep in mind that, for whatever
reason, table aliases are not allowed when using delete, which is why I had to use the entire table name in the
subquery. With most other database servers, you could provide aliases for the
department and employee tables, such as:
DELETE FROM department d WHERE NOT EXISTS (SELECT 1 FROM employee e WHERE e.dept_id = d.dept_id);
Now that you have learned about the different types of subqueries and the different operators that you can employ to interact with the data returned by subqueries, it’s time to explore the many ways in which you can use subqueries to build powerful SQL statements. The next three sections demonstrate how you may use subqueries to construct custom tables, to build conditions, and to generate column values in result sets.
Back in Chapter 3, I stated that the from clause of a select statement names the tables to be used
by the query. Since a subquery generates a result set containing rows and
columns of data, it is perfectly valid to include subqueries in your from clause along with tables. Although it might,
at first glance, seem like an interesting feature without much practical merit,
using subqueries alongside tables is one of the most powerful tools available
when writing queries. Here’s a simple example:
mysql>SELECT d.dept_id, d.name, e_cnt.how_many num_employees->FROM department d INNER JOIN->(SELECT dept_id, COUNT(*) how_many->FROM employee->GROUP BY dept_id) e_cnt->ON d.dept_id = e_cnt.dept_id;+---------+----------------+---------------+ | dept_id | name | num_employees | +---------+----------------+---------------+ | 1 | Operations | 14 | | 2 | Loans | 1 | | 3 | Administration | 3 | +---------+----------------+---------------+ 3 rows in set (0.04 sec)
In this example, a subquery generates a list of department IDs along with the number of employees assigned to each department. Here’s the result set generated by the subquery:
mysql>SELECT dept_id, COUNT(*) how_many->FROM employee->GROUP BY dept_id;+---------+----------+ | dept_id | how_many | +---------+----------+ | 1 | 14 | | 2 | 1 | | 3 | 3 | +---------+----------+ 3 rows in set (0.00 sec)
The subquery is given the name e_cnt and is
joined to the department table via the
dept_id column. The containing query then
retrieves the department ID and name from the department table, along with the employee count from the e_cnt subquery.
Subqueries used in the from clause must be
noncorrelated; they are executed first, and the data is held in memory until the
containing query finishes execution. Subqueries offer immense flexibility when
writing queries, because you can go far beyond the set of available tables to
create virtually any view of the data that you desire, and then join the results
to other tables or subqueries. If you are writing reports or generating data
feeds to external systems, you may be able to do things with a single query that
used to demand multiple queries or a procedural language to
accomplish.
Along with using subqueries to summarize existing data, you can use subqueries to generate data that doesn’t exist in any form within your database. For example, you may wish to group your customers by the amount of money held in deposit accounts, but you want to use group definitions that are not stored in your database. For example, let’s say you want to sort your customers into the groups shown in Table 9-1.
|
Group name |
Lower limit |
Upper limit |
|
Small Fry |
0 |
$4,999.99 |
|
Average Joes |
$5,000 |
$9,999.99 |
|
Heavy Hitters |
$10,000 |
$9,999,999.99 |
To generate these groups within a single query, you will need a way to define these three groups. The first step is to define a query that generates the group definitions:
mysql>SELECT 'Small Fry' name, 0 low_limit, 4999.99 high_limit->UNION ALL->SELECT 'Average Joes' name, 5000 low_limit, 9999.99 high_limit->UNION ALL->SELECT 'Heavy Hitters' name, 10000 low_limit, 9999999.99 high_limit;+---------------+-----------+------------+ | name | low_limit | high_limit | +---------------+-----------+------------+ | Small Fry | 0 | 4999.99 | | Average Joes | 5000 | 9999.99 | | Heavy Hitters | 10000 | 9999999.99 | +---------------+-----------+------------+ 3 rows in set (0.00 sec)
I have used the set operator union all
to merge the results from three separate queries into a single result set.
Each query retrieves three literals, and the results from the three queries
are put together to generate a result set with three rows and three columns.
You now have a query to generate the desired groups, and you can place it
into the from clause of another query to
generate your customer groups:
mysql>SELECT groups.name, COUNT(*) num_customers->FROM->(SELECT SUM(a.avail_balance) cust_balance->FROM account a INNER JOIN product p->ON a.product_cd = p.product_cd->WHERE p.product_type_cd = 'ACCOUNT'->GROUP BY a.cust_id) cust_rollup->INNER JOIN->(SELECT 'Small Fry' name, 0 low_limit, 4999.99 high_limit->UNION ALL->SELECT 'Average Joes' name, 5000 low_limit,->9999.99 high_limit->UNION ALL->SELECT 'Heavy Hitters' name, 10000 low_limit,->9999999.99 high_limit) groups->ON cust_rollup.cust_balance->BETWEEN groups.low_limit AND groups.high_limit->GROUP BY groups.name;+---------------+---------------+ | name | num_customers | +---------------+---------------+ | Average Joes | 2 | | Heavy Hitters | 4 | | Small Fry | 5 | +---------------+---------------+ 3 rows in set (0.01 sec)
The from clause contains two
subqueries; the first subquery, named cust_rollup, returns the total deposit balances for each
customer, while the second subquery, named groups, generates the three customer groupings. Here’s the
data generated by cust_rollup:
mysql>SELECT SUM(a.avail_balance) cust_balance->FROM account a INNER JOIN product p->ON a.product_cd = p.product_cd->WHERE p.product_type_cd = 'ACCOUNT'->GROUP BY a.cust_id;+--------------+ | cust_balance | +--------------+ | 4557.75 | | 2458.02 | | 3270.25 | | 6788.98 | | 2237.97 | | 10122.37 | | 5000.00 | | 3875.18 | | 10971.22 | | 23575.12 | | 38552.05 | +--------------+ 11 rows in set (0.05 sec)
The data generated by cust_rollup is
then joined to the groups table via a
range condition (cust_rollup.cust_balance
BETWEEN
groups.low_limit
AND
groups.high_limit). Finally, the joined
data is grouped and the number of customers in each group is counted to
generate the final result set.
Of course, you could simply decide to build a permanent table to hold the group definitions instead of using a subquery. Using that approach, you would find your database to be littered with small special-purpose tables after awhile, and you wouldn’t remember the reason for which most of them were created. I’ve worked in environments where the database users were allowed to create their own tables for special purposes, and the results were disastrous (tables not included in backups, tables lost during server upgrades, server downtime due to space allocation issues, etc.). Armed with subqueries, however, you will be able to adhere to a policy where tables are added to a database only when there is a clear business need to store new data.
In systems used for reporting or data-feed generation, you will often come across queries such as the following:
mysql>SELECT p.name product, b.name branch,->CONCAT(e.fname, ' ', e.lname) name,->SUM(a.avail_balance) tot_deposits->FROM account a INNER JOIN employee e->ON a.open_emp_id = e.emp_id->INNER JOIN branch b->ON a.open_branch_id = b.branch_id->INNER JOIN product p->ON a.product_cd = p.product_cd->WHERE p.product_type_cd = 'ACCOUNT'->GROUP BY p.name, b.name, e.fname, e.lname->ORDER BY 1,2;+------------------------+---------------+-----------------+--------------+ | product | branch | name | tot_deposits | +------------------------+---------------+-----------------+--------------+ | certificate of deposit | Headquarters | Michael Smith | 11500.00 | | certificate of deposit | Woburn Branch | Paula Roberts | 8000.00 | | checking account | Headquarters | Michael Smith | 782.16 | | checking account | Quincy Branch | John Blake | 1057.75 | | checking account | So. NH Branch | Theresa Markham | 67852.33 | | checking account | Woburn Branch | Paula Roberts | 3315.77 | | money market account | Headquarters | Michael Smith | 14832.64 | | money market account | Quincy Branch | John Blake | 2212.50 | | savings account | Headquarters | Michael Smith | 767.77 | | savings account | So. NH Branch | Theresa Markham | 387.99 | | savings account | Woburn Branch | Paula Roberts | 700.00 | +------------------------+---------------+-----------------+--------------+ 11 rows in set (0.00 sec)
This query sums all deposit account balances by account type, the employee
that opened the accounts, and the branches at which the accounts were
opened. If you look at the query closely, you will see that the product, branch, and employee
tables are needed only for display purposes, and that the account table has everything needed to
generate the groupings (product_cd,
open_branch_id, open_emp_id, and avail_balance). Therefore, you could separate out the task of
generating the groups into a subquery, and then join the other three tables
to the table generated by the subquery to achieve the desired end result.
Here’s the grouping subquery:
mysql>SELECT product_cd, open_branch_id branch_id, open_emp_id emp_id,->SUM(avail_balance) tot_deposits->FROM account->GROUP BY product_cd, open_branch_id, open_emp_id;+------------+-----------+--------+--------------+ | product_cd | branch_id | emp_id | tot_deposits | +------------+-----------+--------+--------------+ | BUS | 2 | 10 | 9345.55 | | BUS | 4 | 16 | 0.00 | | CD | 1 | 1 | 11500.00 | | CD | 2 | 10 | 8000.00 | | CHK | 1 | 1 | 782.16 | | CHK | 2 | 10 | 3315.77 | | CHK | 3 | 13 | 1057.75 | | CHK | 4 | 16 | 67852.33 | | MM | 1 | 1 | 14832.64 | | MM | 3 | 13 | 2212.50 | | SAV | 1 | 1 | 767.77 | | SAV | 2 | 10 | 700.00 | | SAV | 4 | 16 | 387.99 | | SBL | 3 | 13 | 50000.00 | +------------+-----------+--------+--------------+ 14 rows in set (0.02 sec)
This is the heart of the query; the other tables are needed only to
provide meaningful strings in place of the product_cd, open_branch_id, and open_emp_id foreign key columns. The next query wraps the
query against the account table in a
subquery and joins the table that results to the other three tables:
mysql>SELECT p.name product, b.name branch,->CONCAT(e.fname, ' ', e.lname) name,->account_groups.tot_deposits->FROM->(SELECT product_cd, open_branch_id branch_id,->open_emp_id emp_id,->SUM(avail_balance) tot_deposits->FROM account->GROUP BY product_cd, open_branch_id, open_emp_id) account_groups->INNER JOIN employee e ON e.emp_id = account_groups.emp_id->INNER JOIN branch b ON b.branch_id = account_groups.branch_id->INNER JOIN product p ON p.product_cd = account_groups.product_cd->WHERE p.product_type_cd = 'ACCOUNT';+------------------------+---------------+-----------------+--------------+ | product | branch | name | tot_deposits | +------------------------+---------------+-----------------+--------------+ | certificate of deposit | Headquarters | Michael Smith | 11500.00 | | certificate of deposit | Woburn Branch | Paula Roberts | 8000.00 | | checking account | Headquarters | Michael Smith | 782.16 | | checking account | Quincy Branch | John Blake | 1057.75 | | checking account | So. NH Branch | Theresa Markham | 67852.33 | | checking account | Woburn Branch | Paula Roberts | 3315.77 | | money market account | Headquarters | Michael Smith | 14832.64 | | money market account | Quincy Branch | John Blake | 2212.50 | | savings account | Headquarters | Michael Smith | 767.77 | | savings account | So. NH Branch | Theresa Markham | 387.99 | | savings account | Woburn Branch | Paula Roberts | 700.00 | +------------------------+---------------+-----------------+--------------+ 11 rows in set (0.01 sec)
I realize that beauty is in the eye of the beholder, but I find this
version of the query to be far more satisfying than the big, flat version.
This version may execute faster, as well, because the grouping is being done
on small, numeric foreign key columns (product_cd, open_branch_id, open_emp_id) instead of potentially lengthy string columns
(branch.name, product.name, employee.fname, employee.lname).
Many of the examples in this chapter used subqueries as expressions in filter
conditions, so it should not surprise you that this is one of the main uses for
subqueries. However, filter conditions using subqueries are not found only in
the where clause. For example, the next query
uses a subquery in the having clause to find
the employee responsible for opening the most accounts:
mysql>SELECT open_emp_id, COUNT(*) how_many->FROM account->GROUP BY open_emp_id->HAVING COUNT(*) = (SELECT MAX(emp_cnt.how_many)->FROM (SELECT COUNT(*) how_many->FROM account->GROUP BY open_emp_id) emp_cnt);+-------------+----------+ | open_emp_id | how_many | +-------------+----------+ | 1 | 8 | +-------------+----------+ 1 row in set (0.01 sec)
The subquery in the having clause finds the
maximum number of accounts opened by any employee, and the containing query
finds the employee that has opened that number of accounts. If multiple
employees tie for the highest number of opened accounts, then the query would
return multiple rows.
For this last section of the chapter, I finish where I began: with
single-column, single-row scalar subqueries. Along with being used in filter
conditions, scalar subqueries may be used wherever an expression can appear,
including the select and order by clauses of a query and the values clause of an insert statement.
In Task-oriented subqueries, I showed you how to use a subquery to separate out the grouping mechanism from the rest of the query. Here’s another version of the same query that uses subqueries for the same purpose, but in a different way:
mysql>SELECT->(SELECT p.name FROM product p->WHERE p.product_cd = a.product_cd->AND p.product_type_cd = 'ACCOUNT') product,->(SELECT b.name FROM branch b->WHERE b.branch_id = a.open_branch_id) branch,->(SELECT CONCAT(e.fname, ' ', e.lname) FROM employee e->WHERE e.emp_id = a.open_emp_id) name,->SUM(a.avail_balance) tot_deposits->FROM account a->GROUP BY a.product_cd, a.open_branch_id, a.open_emp_id->ORDER BY 1,2;+------------------------+---------------+-----------------+--------------+ | product | branch | name | tot_deposits | +------------------------+---------------+-----------------+--------------+ | NULL | Quincy Branch | John Blake | 50000.00 | | NULL | So. NH Branch | Theresa Markham | 0.00 | | NULL | Woburn Branch | Paula Roberts | 9345.55 | | certificate of deposit | Headquarters | Michael Smith | 11500.00 | | certificate of deposit | Woburn Branch | Paula Roberts | 8000.00 | | checking account | Headquarters | Michael Smith | 782.16 | | checking account | Quincy Branch | John Blake | 1057.75 | | checking account | So. NH Branch | Theresa Markham | 67852.33 | | checking account | Woburn Branch | Paula Roberts | 3315.77 | | money market account | Headquarters | Michael Smith | 14832.64 | | money market account | Quincy Branch | John Blake | 2212.50 | | savings account | Headquarters | Michael Smith | 767.77 | | savings account | So. NH Branch | Theresa Markham | 387.99 | | savings account | Woburn Branch | Paula Roberts | 700.00 | +------------------------+---------------+-----------------+--------------+ 14 rows in set (0.01 sec)
There are two main differences between this query and the earlier version
using a subquery in the from clause:
Instead of joining the product,
branch, and employee tables to the account data,
correlated scalar subqueries are used in the select clause to look up the product, branch, and
employee names.
The result set has 14 rows instead of 11 rows, and three of the
product names are null.
The reason for the extra three rows in the result set is that the previous
version of the query included the filter condition p.product_type_cd = 'ACCOUNT'. That filter eliminated rows with
product types of INSURANCE and LOAN, such as small business loans. Since this
version of the query doesn’t include a join to the product table, there is no way to include the filter condition in
the main query. The correlated subquery against the product table does include this filter, but the only effect is to
leave the product name null. If you want to
get rid of the extra three rows, you could join the product table to the account
table and include the filter condition, or you could simply do the
following:
mysql>SELECT all_prods.product, all_prods.branch,->all_prods.name, all_prods.tot_deposits->FROM->(SELECT->(SELECT p.name FROM product p->WHERE p.product_cd = a.product_cd->AND p.product_type_cd = 'ACCOUNT') product,->(SELECT b.name FROM branch b->WHERE b.branch_id = a.open_branch_id) branch,->(SELECT CONCAT(e.fname, ' ', e.lname) FROM employee e->WHERE e.emp_id = a.open_emp_id) name,->SUM(a.avail_balance) tot_deposits->FROM account a->GROUP BY a.product_cd, a.open_branch_id, a.open_emp_id->) all_prods->WHERE all_prods.product IS NOT NULL->ORDER BY 1,2;+------------------------+---------------+-----------------+--------------+ | product | branch | name | tot_deposits | +------------------------+---------------+-----------------+--------------+ | certificate of deposit | Headquarters | Michael Smith | 11500.00 | | certificate of deposit | Woburn Branch | Paula Roberts | 8000.00 | | checking account | Headquarters | Michael Smith | 782.16 | | checking account | Quincy Branch | John Blake | 1057.75 | | checking account | So. NH Branch | Theresa Markham | 67852.33 | | checking account | Woburn Branch | Paula Roberts | 3315.77 | | money market account | Headquarters | Michael Smith | 14832.64 | | money market account | Quincy Branch | John Blake | 2212.50 | | savings account | Headquarters | Michael Smith | 767.77 | | savings account | So. NH Branch | Theresa Markham | 387.99 | | savings account | Woburn Branch | Paula Roberts | 700.00 | +------------------------+---------------+-----------------+--------------+ 11 rows in set (0.01 sec)
Simply by wrapping the previous query in a subquery (called all_prods) and adding a filter condition to
exclude null values of the product column, the query now returns the desired
11 rows. The end result is a query that performs all grouping against raw data
in the account table, and then embellishes
the output using data in three other tables, and without doing any
joins.
As previously noted, scalar subqueries can also appear in the order by clause. The following query retrieves
employee data sorted by the last name of each employee’s boss, and then by the
employee’s last name:
mysql>SELECT emp.emp_id, CONCAT(emp.fname, ' ', emp.lname) emp_name,->(SELECT CONCAT(boss.fname, ' ', boss.lname)->FROM employee boss->WHERE boss.emp_id = emp.superior_emp_id) boss_name->FROM employee emp->WHERE emp.superior_emp_id IS NOT NULL->ORDER BY (SELECT boss.lname FROM employee boss->WHERE boss.emp_id = emp.superior_emp_id), emp.lname;+--------+------------------+-----------------+ | emp_id | emp_name | boss_name | +--------+------------------+-----------------+ | 14 | Cindy Mason | John Blake | | 15 | Frank Portman | John Blake | | 9 | Jane Grossman | Helen Fleming | | 8 | Sarah Parker | Helen Fleming | | 7 | Chris Tucker | Helen Fleming | | 13 | John Blake | Susan Hawthorne | | 6 | Helen Fleming | Susan Hawthorne | | 5 | John Gooding | Susan Hawthorne | | 16 | Theresa Markham | Susan Hawthorne | | 10 | Paula Roberts | Susan Hawthorne | | 17 | Beth Fowler | Theresa Markham | | 18 | Rick Tulman | Theresa Markham | | 12 | Samantha Jameson | Paula Roberts | | 11 | Thomas Ziegler | Paula Roberts | | 2 | Susan Barker | Michael Smith | | 3 | Robert Tyler | Michael Smith | | 4 | Susan Hawthorne | Robert Tyler | +--------+------------------+-----------------+ 17 rows in set (0.01 sec)
The query uses two correlated scalar subqueries: one in the select clause to retrieve the full name of each
employee’s boss, and another in the order by
clause to return just the last name of each employee’s boss for sorting
purposes.
Along with using correlated scalar subqueries in select statements, you can use noncorrelated scalar subqueries to
generate values for an insert statement. For
example, let’s say you are going to generate a new account row, and you’ve been
given the following data:
The product name (“savings account”)
The customer’s federal ID (“555-55-5555”)
The name of the branch where the account was opened (“Quincy Branch”)
The first and last names of the teller who opened the account (“Frank Portman”)
Before you can create a row in the account
table, you will need to look up the key values for all of these pieces of data
so that you can populate the foreign key columns in the account table. You have two choices for how to go about it:
execute four queries to retrieve the primary key values and place those values
into an insert statement, or use subqueries
to retrieve the four key values from within an insert statement. Here’s an example of the latter
approach:
INSERT INTO account (account_id, product_cd, cust_id, open_date, last_activity_date, status, open_branch_id, open_emp_id, avail_balance, pending_balance) VALUES (NULL, (SELECT product_cd FROM product WHERE name = 'savings account'), (SELECT cust_id FROM customer WHERE fed_id = '555-55-5555'), '2008-09-25', '2008-09-25', 'ACTIVE', (SELECT branch_id FROM branch WHERE name = 'Quincy Branch'), (SELECT emp_id FROM employee WHERE lname = 'Portman' AND fname = 'Frank'), 0, 0);
Using a single SQL statement, you can create a row in the account table and look up four foreign key column
values at the same time. There is one downside to this approach, however. When
you use subqueries to generate data for columns that allow null values, your insert statement will succeed even if one of your subqueries
fails to return a value. For example, if you mistyped Frank Portman’s name in
the fourth subquery, a row will still be created in account, but the open_emp_id
would be set to null.
I covered a lot of ground in this chapter, so it might be a good idea to review it. The examples I used in this chapter demonstrated subqueries that:
Return a single column and row, a single column with multiple rows, and multiple columns and rows
Are independent of the containing statement (noncorrelated subqueries)
Reference one or more columns from the containing statement (correlated subqueries)
Are used in conditions that utilize comparison operators as well as the
special-purpose operators in, not in, exists, and not
exists
Can be found in select, update, delete, and insert
statements
Generate result sets that can be joined to other tables (or subqueries) in a query
Can be used to generate values to populate a table or to populate columns in a query’s result set
Are used in the select, from, where, having, and order by clauses of queries
Obviously, subqueries are a very versatile tool, so don’t feel bad if all these concepts haven’t sunk in after reading this chapter for the first time. Keep experimenting with the various uses for subqueries, and you will soon find yourself thinking about how you might utilize a subquery every time you write a nontrivial SQL statement.
These exercises are designed to test your understanding of subqueries. Please see Appendix C for the solutions.
Construct a query against the account table
that uses a filter condition with a noncorrelated subquery against the product table to find all loan accounts (product.product_type_cd = 'LOAN'). Retrieve the
account ID, product code, customer ID, and available balance.
Rework the query from Exercise 9-1 using a correlated
subquery against the product table to achieve
the same results.
Join the following query to the employee
table to show the experience level of each employee:
SELECT 'trainee' name, '2004-01-01' start_dt, '2005-12-31' end_dt UNION ALL SELECT 'worker' name, '2002-01-01' start_dt, '2003-12-31' end_dt UNION ALL SELECT 'mentor' name, '2000-01-01' start_dt, '2001-12-31' end_dt
Give the subquery the alias levels, and
include the employee ID, first name, last name, and experience level (levels.name). (Hint: build a join condition using
an inequality condition to determine into which level the employee.start_date column falls.)