7 min read

Data Aggregation

1 Introduction

Now that we have got to know the best-known methods from the field of Data Wrangling, we come to the possibilities of Data Aggregation.

For this post I used the dataset flight from the statistics platform “Kaggle”. You can download it from my GitHub Repository.

2 Preparation

For the following examples, I set up a new database.

SET LANGUAGE ENGLISH


CREATE DATABASE Aggregation;

USE Aggregation;

I now import the record flight into this database (‘Aggregation’). I have described how to do this here. Tip: Pay attention to the data types assigned to the columns. For example, Dep_Delay is stored as nvchar by default. Change all data types accordingly.

SELECT *
    FROM flight
    ;

Furthermore, I create a table that contains NULL values among other things:

CREATE TABLE example_Table
    (ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
    Cat1 VARCHAR(100),
    Cat2 INT)
    ;

INSERT INTO example_Table (Cat1, Cat2) VALUES ('A', 1)
INSERT INTO example_Table (Cat1, Cat2) VALUES ('A', 2)
INSERT INTO example_Table (Cat1, Cat2) VALUES ('A', NULL)

INSERT INTO example_Table (Cat1, Cat2) VALUES ('B', 1)
INSERT INTO example_Table (Cat1, Cat2) VALUES ('B', 2)
INSERT INTO example_Table (Cat1, Cat2) VALUES ('B', 3)
INSERT INTO example_Table (Cat1, Cat2) VALUES (NULL, 4)
INSERT INTO example_Table (Cat1, Cat2) VALUES ('B', NULL)
INSERT INTO example_Table (Cat1, Cat2) VALUES ('B', NULL)

3 Data Aggregation

In the following, I will present the best-known aggregation techniques.

3.1 Count

For the examples with Count, let’s first look at the example table that was created:

SELECT *
    FROM example_Table
    ;

As we know from the post Data Wrangling from chapter Count Rows, with the following SELECT COUNT command we have the possibility to determine how many rows the data set contains in total. It does not matter whether some columns contain NULL values or not.

SELECT COUNT(*) AS num_observations
    FROM example_Table
    ;

The method of counting the primary key is just as safe, since it must not contain any NULL values.

SELECT COUNT(ID) AS num_observations
    FROM example_Table
    ;

Now let’s count the observations in column Cat1:

SELECT COUNT(Cat1) AS num_observations_Cat1
    FROM example_Table
    ;

And now the one from Cat2:

SELECT COUNT(Cat2) AS num_observations_Cat2
    FROM example_Table
    ;

The results fit because we have one NULL value in Cat1 and two in Cat2.

I have described here how to specifically filter NULL values. This way you don’t have to do the workaround and compare the number of counted rows of a column with the counted rows of the primary key to determine if and how many NULL values are contained.

3.2 Sum

Next we will deal with the SUM function. As the name suggests, we can get the sum of a numeric column.

In the following I will continue to work with the loaded data set flight. Here, for example, I would like to have the sum of the flight distances travelled output:

SELECT SUM(Distance) AS sum_distance
    FROM flight
    ;

Ok the output doesn’t look very pretty yet. So I’ll have it output in thousands next:

SELECT SUM(Distance)/1000 AS sum_distance_in_thousand
    FROM flight
    ;

Perfect. If I also want to have the number of the underlying observations displayed, I can also use the COUNT command shown above.

SELECT COUNT(*) AS num_departures, 
       SUM(Distance)/1000 AS sum_distance_in_thousand
    FROM flight
    ;

If I do not yet have metrics, such as the planned travel time in my data set, available as a separate column, I can also perform this calculation operation within the SELECT statement without having to modify the original table.

SELECT SUM(Scheduled_Arrival - Scheduled_Departure) AS scheduled_travel_time
    FROM flight
    ;

3.3 Min, Max and Average

Just like the sum of a column, we can also output the minimum, maximum or average of a numerical column.

For now, let’s focus only on the Dep_Delay column, which shows the difference between the planned and the actual departure time.

With MIN we get the highest time an aircraft has taken off before the scheduled departure time.

SELECT MIN(Dep_Delay) AS min_dep_delay
    FROM flight
    ;

MAX is the latest time an aircraft has taken off after the scheduled take-off time.

SELECT MAX(Dep_Delay) AS max_dep_delay
    FROM flight
    ;

With AVG as already suspected, we get the average delay of all departures.

SELECT AVG(Dep_Delay) AS avg_dep_delay
    FROM flight
    ;

3.4 Group By & Order By

Now we come to the probably most used aggregation function: Group By. All of the functions that we have learned about before can be used here and only really develop their added value with Group By.

With the Group By command, we can have values, such as the mean value, output in groups.

3.4.1 Simple Group By with AVG

Let’s take a look at the average delay of departures grouped by originating airports.

SELECT Origin_Airport, 
       AVG(Dep_Delay) AS avg_dep_delay
    FROM flight
    GROUP BY Origin_Airport
    ;

In order for this select statement to work, when using the Group By function, you must ensure that the column after which the grouping takes place is also output as a column (after the SELECT command).

3.4.2 Group and Order By with AVG

After the previously shown output is still a bit confused, I use the ORDER BY command in the following statement to have the departure airports output in alphabetical order.

SELECT Origin_Airport, 
       AVG(Dep_Delay) AS avg_dep_delay
    FROM flight
    GROUP BY Origin_Airport
    ORDER BY Origin_Airport
    ;

3.4.3 Group and Order By with ABS & STDEV

What is also always good to have a measure of position as it is the average value of a measure of dispersion output. Therefore, for the same grouping, I also look at the absolute values behind the groupings and their standard deviation.

SELECT Origin_Airport, 
       COUNT(*) AS num_departures,
       STDEV(Dep_Delay) AS st_dev_dep_delay
    FROM flight
    GROUP BY Origin_Airport
    ORDER BY Origin_Airport
    ;

Now I no longer order by Origin_Airport but by the number of absolute values.

SELECT Origin_Airport, 
       COUNT(*) AS num_departures,
       STDEV(Dep_Delay) AS st_dev_dep_delay
    FROM flight
    GROUP BY Origin_Airport
    ORDER BY num_departures DESC
    ;

3.4.4 Group and Order By with MAX

Let’s take a look at the highest value recorded for each group (Origin_Airport).

SELECT Origin_Airport, 
       max(Dep_Delay) AS max_dep_delay
    FROM flight
    GROUP BY Origin_Airport
    ORDER BY max_dep_delay DESC
    ;

3.4.5 Multiple Group and Order By

Of course, we also have the possibility to use multiple Group By and Order By. Let’s have a look at the absolute numbers of departures grouped by Origin_Airport and DayOfWeek.

SELECT Origin_Airport, DayOfWeek,
       COUNT(*) AS num_departures
    FROM flight
    GROUP BY Origin_Airport, DayOfWeek
    ORDER BY Origin_Airport, DayOfWeek
    ;

3.5 Having

As we can see from the last output by the multiple grouping, quite a few lines are generated quite quickly. Often not all rows are relevant but only those that exceed a certain threshold. We can set this threshold using the HAVING command.

In the following I want to have only the absolute values of the departures grouped by Origin_Airport and DayOfWeek that exceed the number 100. Means in our example if a weekday of a departure airport has less than 100 departures it should not be displayed at all.

SELECT Origin_Airport, DayOfWeek,
       COUNT(*) AS num_departures
    FROM flight
    GROUP BY Origin_Airport, DayOfWeek
    HAVING COUNT(*) > 100
    ORDER BY Origin_Airport, DayOfWeek
    ;

And already we have a clear output of only 21 observations.

3.6 Order By in a specific order

Let’s take a look at this simple Group By and Order By query:

SELECT Origin_Airport, 
       COUNT(*) As total_num_flights
    FROM flight
    WHERE Origin_Airport IN ('LGA', 'PHX', 'SFO')
    GROUP BY Origin_Airport
    ORDER BY Origin_Airport
    ;

As expected, we get the sum of flights divided and ordered by Origin_Airport, which is ordered by alphabet.

However, this is not always desired. For example, the requirement may be that we should not be sorted by alphabet but by the size of the airport. We do not have this information in our dataset and therefore cannot use it. But there is a way, which is admittedly a bit awkward but works, to specify a specific order of the output.

Look at this:

SELECT Origin_Airport, 
       COUNT(*) As total_num_flights
    FROM flight
    WHERE Origin_Airport IN ('LGA', 'PHX', 'SFO')
    GROUP BY Origin_Airport
    ORDER BY CASE Origin_Airport
                WHEN 'PHX' THEN 1
                WHEN 'SFO' THEN 2
                ELSE 3
                END
    ;

4 Conclusion

In this post I showed how to use data aggregation techniques to generate profitable comparisons of values.