7 min read

Data Wrangling

1 Introduction

Now that we know how to load data sets into SSMS, we start with the topic of Data Wrangling.

For this post I used the dataset flight from the statistics platform “Kaggle”. You can download it from my GitHub Repository.

2 Preparation

For the following examples, I set up a new database.

SET LANGUAGE ENGLISH


CREATE DATABASE Wrangling;

USE Wrangling;

I now import the record flight into this database (‘Wrangling’). I have described how to do this here. Tip: Pay attention to the data types assigned to the columns. For example, Dep_Delay is stored as nvchar by default. Change all data types accordingly.

3 Get an Overview of the Data

First of all, I would like to know which tables are stored in my database.

SELECT *
    FROM SYSOBJECTS
    WHERE xtype = 'U';

Now I would like to get an overview of the flight table.

EXEC sp_help 'flight' 

We get a little more detailed view this way:

EXEC sp_columns 'flight'

We can also display the amount of memory used per table.

EXEC sp_spaceused 'flight'

3.1 Count Rows

SELECT COUNT(*)
    FROM flight
    ;

3.2 Count Columns

SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.Columns 
    WHERE TABLE_NAME = 'flight'
    ;

SELECT Count(*) FROM INFORMATION_SCHEMA.Columns 
    WHERE TABLE_NAME = 'flight'
    ;

3.3 Get Distinct Values

3.3.1 Digression: Difference between Unique vs. Distinct

People often ask me what the difference is between Unique and Distinct. In short, they are two completely different use cases:

  • UNIQUE: always take part in data insertion
  • DISTINCT: always concern on data retrieval

A unique constraint is used when the column itself (or set of columns) must be unique according to the schema details (the data). In other words: If the data in that column (or set of columns) must be unique, a unique constraint is used. This is the case, for example, with membership numbers (IDs that usually serve as primary keys). The database must protect itself from questionable data, this is not something that should be left to the applications themselves, as the first incorrect application that comes along would break the database.

If the data does not need to be unique (e.g. specifying a person’s city), then it can be decided to make it “unique” in a select statement, as needed.

3.3.2 Get distinct values

Now let’s have a look at the DISTINCT command in practice and let us output the departure airports in the Origin_Airport column.

SELECT DISTINCT Origin_Airport  
    FROM flight
    ;

As we can see from the output, a total of 18 different airports are stored in this data set (in the Origin_Airport column).

4 Data Wrangling

4.1 Select specific columns

Certain columns can be selected with the Select Statement.

SELECT DayOfWeek, Origin_Airport, Departure_Time, Arrival_Time
    FROM flight
    ;

Select first 5 rows of specific Columns

SELECT TOP 5 DayOfWeek, Origin_Airport, Departure_Time, Arrival_Time
    FROM flight
    ;

4.2 Output columns in a specific order

4.2.1 Order ascending

SELECT Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    ORDER BY Dep_Delay
    ;

Here are the three flights that took off soonest (ahead of schedule).

SELECT TOP 3 Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    ORDER BY Dep_Delay
    ;

4.2.2 Order descending

SELECT Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    ORDER BY Dep_Delay DESC
    ;

Here are the three flights that took off the latest.

SELECT TOP 3 Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    ORDER BY Dep_Delay DESC
    ;

4.2.3 Multiple Orders

SELECT Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    ORDER BY Origin_Airport, Dep_Delay DESC
    ;

First, Origin_Airport is ordered (ascending [alphabetical] here), then Dep_Delay (descending).

4.3 Conditional Statements (WHERE)

With the help of the WHERE statement we can set certain conditions for the select.

4.3.1 With One Condition

SELECT Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    WHERE Origin_Airport = 'JFK'
    ;

4.3.2 With Multiple Conditions (AND)

SELECT Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    WHERE Origin_Airport = 'JFK' AND Dep_Delay < -8
    ;

4.3.3 With Multiple Conditions (OR)

SELECT Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    WHERE Origin_Airport = 'JFK' OR Dep_Delay > 200
    ;

4.3.4 With a Range

SELECT Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    WHERE Dep_Delay BETWEEN -7 AND 15
    ;

4.3.5 With Lists

Is in List

SELECT Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    WHERE Origin_Airport IN ('SFO', 'ORD', 'IAH')
    ;

Is not in List

SELECT Tai_lNum, Origin_Airport, Scheduled_Departure, Departure_Time, Dep_Delay
    FROM flight
    WHERE Origin_Airport NOT IN ('SFO', 'ORD', 'IAH')
    ;

4.3.6 With Text

Starts with L

SELECT Tai_lNum, Origin_Airport
    FROM flight
    WHERE Origin_Airport LIKE 'L%'
    ;

Contains an F in the middle

SELECT Tai_lNum, Origin_Airport
    FROM flight
    WHERE Origin_Airport LIKE '%F%'
    ;

Ends with H

SELECT Tai_lNum, Origin_Airport
    FROM flight
    WHERE Origin_Airport LIKE '%H'
    ;

Does not include an A in the middle

SELECT Tai_lNum, Origin_Airport
    FROM flight
    WHERE Origin_Airport NOT LIKE '%A%'
    ;

4.3.7 Filter for NULL Values

Let’s take the testCSV table for this example. I had already used this here once.

When importing this table, I have set that the Gender column may contain NULL values. Now let’s add an example with NULL values to the existing table.

INSERT INTO testCSV (ID_testCSV, First_Name, Last_Name, Gender) VALUES (4, 'Michael', 'Fuchs', NULL)

SELECT * 
    FROM testCSV
    ;

Now we filter for NULL values:

SELECT * 
    FROM testCSV
    WHERE Gender IS NULL
    ;

4.4 Drop Duplicates

For this example, we create an artificial index in the flight dataset:

ALTER TABLE flight ADD Idx INT IDENTITY(1,1) PRIMARY KEY;

SELECT TOP 10 Idx, Origin_Airport
    FROM flight
    ;

Let’s see how many duplicates there are for the Origin_Airport column:

SELECT Origin_Airport,
       COUNT(*) AS DuplicatesCount
    FROM flight
    GROUP BY Origin_Airport
    ;

I have explained the use of the Group By command in more detail here.

4.4.1 Just keep the first element and drop the following duplicates

SELECT Idx, Origin_Airport
    FROM flight
    WHERE Idx IN (SELECT MIN(Idx)
                        FROM flight
                        GROUP BY Origin_Airport);

4.4.2 Just keep the last element and drop the first/s duplicate/s

SELECT Idx, Origin_Airport
    FROM flight
    WHERE Idx IN (SELECT MAX(Idx)
                        FROM flight
                        GROUP BY Origin_Airport);

4.5 Create new Columns

4.5.1 with arithmetic operations

Because it is so convenient and the departure times are very often the same (duplicate values if you want to call them like that) we will use the syntax shown before again for the following example of how to create new columns based on existing columns.

SELECT TOP 5 Idx, Scheduled_Departure, Scheduled_Arrival
    FROM flight
    WHERE Idx IN (SELECT MIN(Idx)
                        FROM flight
                        GROUP BY Scheduled_Departure);

On the one hand, we see the planned departure time and the planned arrival time. So now we calculate the planned travel time.

SELECT TOP 5 Idx, Scheduled_Departure, Scheduled_Arrival,
       Scheduled_Arrival - Scheduled_Departure AS Scheduled_Travel_Time
    FROM flight
    WHERE Idx IN (SELECT MIN(Idx)
                        FROM flight
                        GROUP BY Scheduled_Departure);

This is only one example of what is possible with arithmetic operations. Of course you can do many more and divide the values or multiply them and so on.

4.5.2 with text

Let’s take a look at this section:

SELECT TOP 5 Tai_lNum, Origin_Airport 
    FROM flight
    ;

Text columns can also be connected with each other.

Method 1:

SELECT TOP 5 Tai_lNum + Origin_Airport AS Full_Name
    FROM flight
    ;

Method 2:

SELECT TOP 5 Tai_lNum + ', ' + Origin_Airport AS Full_Name
    FROM flight
    ;

Tip: In MySQL you would write it like this, but it also works in SQL Server

SELECT TOP 5 CONCAT(Tai_lNum,', ',Origin_Airport) AS Full_Name
    FROM flight
    ;

5 Conclusion

In this post I have covered the following topics:

  • How to get an overview of the data
  • Selecting specific columns
  • Conditional Statements
  • Handling Duplicate Values

These are the basics of data wrangling which can be used in all possible combinations.