7 min read

Subqueries

1 Introduction

You can usually get pretty far with SQL using simple queries like we’ve seen so far. You can get even further if you know how to use subqueries.

For this post I used the dataset flight from the statistics platform “Kaggle”. You can download it from my GitHub Repository.

2 Preparation

SET LANGUAGE ENGLISH


CREATE DATABASE Subqueries;

USE Subqueries;

2.1 Loading the data set flight

How to load a dataset via SSMS into your database I described in this post: Read and write to files

SELECT *
    FROM flight
    ;

2.2 Creation of the flight_add_info dataset

Below we will still need the following dataset, which I will call flight_add_info.

CREATE TABLE flight_add_info
    (Tai_lNum VARCHAR(100),
    Manufacturer VARCHAR(100),
    Typ VARCHAR(100),
    Maximum_Speed INT,
    Range INT)
    ;
INSERT INTO flight_add_info (Tai_lNum, Manufacturer, Typ, Maximum_Speed, Range) VALUES ('N790AA', 'Airbus', 'A380', 1185, 15200)
INSERT INTO flight_add_info (Tai_lNum, Manufacturer, Typ, Maximum_Speed, Range) VALUES ('N787AA', 'Airbus', 'A340', 870, 10300)
INSERT INTO flight_add_info (Tai_lNum, Manufacturer, Typ, Maximum_Speed, Range) VALUES ('Z793AA', 'Airbus', 'A310', 900, 8000)
SELECT * 
    FROM flight_add_info
    ;

2.3 Creation of the customers dataset

And one more data set that we will need:

CREATE TABLE customers
    (Name VARCHAR(100),
    Age INT,
    Salary INT)
    ;
INSERT INTO customers (Name, Age, Salary) VALUES ('A', 25, 2000)
INSERT INTO customers (Name, Age, Salary) VALUES ('B', 33, 4000)
INSERT INTO customers (Name, Age, Salary) VALUES ('C', 23, 8500)
INSERT INTO customers (Name, Age, Salary) VALUES ('D', 27, 6000)
INSERT INTO customers (Name, Age, Salary) VALUES ('E', 24, 4150)
SELECT * 
    FROM customers
    ;

3 Subqueries

Here is a short explanation of the difference between Subqueries and Joins:

Subquery: When an existing table needs to be manipulated or aggregated to then be joined to a larger table.

Joins: A fully flexible and discretionary use case where a user wants to bring two or more tables together and select and filter as needed.

3.1 Subqueries in the WHERE statement

For our first example, we will use the flight dataset.

There is a column there called ‘Dep_Delay’, which indicates the delay of aircraft departures. Let’s take a look at the mean value for this column:

SELECT AVG(Dep_Delay) AS avg_dep_delay
    FROM flight
    ;

Now, if we want to look at which flights have a greater delay than this mean (22 minutes), we can write it like this:

SELECT Tai_lNum, Origin_Airport, Dep_Delay
    FROM flight
    WHERE Dep_Delay > 22
    ;

But it is not very smart, because the calculated mean value was hard coded here. There is another way to write this and that is with the help of Subqueries:

SELECT Tai_lNum, Origin_Airport, Dep_Delay
    FROM flight
    WHERE Dep_Delay > (SELECT AVG(Dep_Delay)
                        FROM flight)
    ;

Here we insert a subquery that calculates the mean value to which the main query can then reference.

3.2 Nested Subqueries in the WHERE statement

Of course, you can also use several subqueries nested within each other. Let’s take a look at the following data set:

SELECT * 
    FROM flight_add_info
    ;

What do I want to know here?

I want to display the additional information such as manufacturer or aircraft type (data set flight_add_info) for the aircraft whose delayed departure in the data set flight is above the mean.

So we use the above query including subquery only with the addition DISTINCT.

SELECT DISTINCT Tai_lNum
    FROM flight
    WHERE Dep_Delay > (SELECT AVG(Dep_Delay)
                        FROM flight)
    ;

Based on this, I now want to add all information for which Tai_lNum I also have further information available in flight_add_info.

SELECT *
    FROM flight_add_info
    WHERE Tai_lNum IN (SELECT DISTINCT Tai_lNum
                        FROM flight
                        WHERE Dep_Delay > (SELECT AVG(Dep_Delay)
                                            FROM flight))
    ;

For two numbers I had additional information available. Of course, there are other flight numbers (‘Tai_lNum’) in the flight dataset that were above the mean in terms of departure delay, but for which I did not have any additional information (the dataset was of course very very small).

3.3 Subqueries in the FROM statement

So far we have written Subqueries in the WHERE statement. But this is also possible in the FROM statement.

Let’s have a look at this part of the flight dataset:

SELECT Flight_Date, Origin_Airport
    FROM flight
    ;

First I extract the day from the date (‘Flight_Date’) and count the observations grouped by the day from the Flight_Date and the Origin_Airport.

SELECT DAY(Flight_Date) AS day, 
       Origin_Airport, 
       COUNT(*) AS event_count
    FROM flight
    GROUP BY DAY(Flight_Date), Origin_Airport
    ORDER BY DAY(Flight_Date), Origin_Airport
    ;

The output what you see above is my base (and therefore my following Subquery) for my main query.

Tip: When I want to use Subqueries in a SQL statement, I always start writing with them.

SELECT Origin_Airport, AVG(event_count) AS avg_event_count
    FROM (SELECT DAY(Flight_Date) AS day, 
          Origin_Airport, 
          COUNT(*) AS event_count
            FROM flight
            GROUP BY DAY(Flight_Date), Origin_Airport) AS sub
    GROUP BY Origin_Airport
    ;

Based on the previously shown count (now integrated as a Subquery) I can now calculate the mean value of the counted observations grouped by Origin_Airport.

4 EXISTS Operator

The EXISTS Operator is used to test for the existence of any record in a Subquery.

For example, if I want to check if the Tai_lNums in this table:

SELECT * 
    FROM flight_add_info
    ;

are also found in my large flight table (of course also within the column ‘Tai_lNum’), I can use the following command:

SELECT DISTINCT Tai_lNum
    FROM flight
    WHERE EXISTS (SELECT Tai_lNum
                    FROM flight_add_info
                    WHERE flight_add_info.Tai_lNum = flight.Tai_lNum)
    ;

Fits perfectly. Two of the deposited numbers can also be found in flight. Now I display how often these numbers appear in flight:

SELECT Tai_lNum, COUNT(*) AS event_count
    FROM flight
    WHERE EXISTS (SELECT Tai_lNum
                    FROM flight_add_info
                    WHERE flight_add_info.Tai_lNum = flight.Tai_lNum)
    GROUP BY Tai_lNum
    ;

This output can also be obtained by using the IN Clause:

SELECT DISTINCT Tai_lNum
    FROM flight
    WHERE Tai_lNum IN (SELECT DISTINCT Tai_lNum
                        FROM flight_add_info)
    ;

The IN is best used where you have a static list to pass like here.

Based on rule optimizer:

  • EXISTS is much faster than IN, when the sub-query results is very large.
  • IN is faster than EXISTS, when the sub-query results is very small.

EXISTS will be faster because once the engine has found a hit, it will quit looking as the condition has proved true.

With IN, it will collect all the results from the sub-query before further processing.

5 ANY and ALL Operator

Do I need more operators if I already have IN? Yes, because there is a very important difference between ANY/ALL and IN:

  • With ANY or ALL you must place an =, <>, <, >, <=, or >= operator before ANY/ALL.
  • But with the IN operator you cannot use =, <>, <, >, <=, or >=.

To get the last shown outputs (using EXISTS and IN) with ANY, I have to write the syntax as follows:

SELECT DISTINCT Tai_lNum
    FROM flight
    WHERE Tai_lNum = ANY (SELECT Tai_lNum
                            FROM flight_add_info
                            WHERE flight_add_info.Tai_lNum = flight.Tai_lNum)
    ;

To understand ANY and ALL even better, let’s consider the following example (dataset customers):

SELECT *
    FROM customers
    ;

Let’s take a look at the people (we only let the age of the people output) whose income is greater than 5,000.

SELECT Age 
    FROM customers 
    WHERE Salary > 5000
    ;

Now we use ANY once and ALL once in the WHERE statement:

SELECT * 
    FROM customers 
    WHERE Age > ANY(SELECT Age 
                        FROM customers 
                        WHERE Salary > 5000)
    ;

All persons whose age is greater than 23 or 27 are output here.

Now with ALL:

SELECT * 
    FROM customers 
    WHERE Age > ALL(SELECT Age 
                        FROM customers 
                        WHERE Salary > 5000)
    ;

Only persons whose age is greater than 27 are output here.

Here again IN, ANY and ALL in a nutshell:

  • IN: Equals to Anything in the List
  • ANY: Compares Value to Each Value Returned by the Sub Query
  • ALL: Compares Value To Every Value Returned by the Sub Query

6 Conclusion

In this post I have shown how to create subqueries and use them with

  • EXISTS
  • IN
  • ANY and
  • ALL