15 min read

Data Manipulation

1 Introduction

Next post this time about the topic of data manipulation.

Almost no data set in real life does not need to be modified in some way. Therefore, you should familiarize yourself with the most common methods of manipulating data.

2 Preparation

For the following examples I create a new database and a new table.

SET LANGUAGE ENGLISH


CREATE DATABASE Manipulation;

USE Manipulation;
CREATE TABLE example_Table
    (Cat1 VARCHAR(100),
    Cat2 INT,
    Title VARCHAR(100),
    First_Name VARCHAR(100),
    Middle_Name VARCHAR(15),
    Last_Name VARCHAR(100),
    Age INT,
    LastOrderDate DATE,
    LastWebSearchDate DATE,
    LastInquiryDate DATE,
    LastPaymentDate DATE)
    ;
INSERT INTO example_Table (Cat1, Cat2, Title, First_Name, Middle_Name, Last_Name, Age, LastOrderDate, LastWebSearchDate, 
    LastInquiryDate, LastPaymentDate) VALUES ('A', 1, 'Mr', 'Alex', NULL, 'Smith', 18, '2021-03-02', NULL, NULL, NULL)
INSERT INTO example_Table (Cat1, Cat2, Title, First_Name, Middle_Name, Last_Name, Age, LastOrderDate, LastWebSearchDate, 
    LastInquiryDate, LastPaymentDate) VALUES ('A', 2, 'Miss', 'Cindy', 'J.', 'Steel', 34, NULL, '2021-03-22', NULL, NULL)
INSERT INTO example_Table (Cat1, Cat2, Title, First_Name, Middle_Name, Last_Name, Age, LastOrderDate, LastWebSearchDate, 
    LastInquiryDate, LastPaymentDate) VALUES ('A', NULL, 'Mrs', 'Jessy', 'Kathrin', 'Williams', NULL, NULL, NULL, '2021-02-27', NULL)
INSERT INTO example_Table (Cat1, Cat2, Title, First_Name, Middle_Name, Last_Name, Age, LastOrderDate, LastWebSearchDate, 
    LastInquiryDate, LastPaymentDate) VALUES ('B', 1, 'Miss', 'Sarah', NULL, 'Pike', 37, NULL, NULL, NULL, '2021-02-15')
INSERT INTO example_Table (Cat1, Cat2, Title, First_Name, Middle_Name, Last_Name, Age, LastOrderDate, LastWebSearchDate, 
    LastInquiryDate, LastPaymentDate) VALUES ('B', 2, 'Ms', 'Emily', 'C.', 'Taylor', 40, '2021-03-27', NULL, NULL, NULL)
INSERT INTO example_Table (Cat1, Cat2, Title, First_Name, Middle_Name, Last_Name, Age, LastOrderDate, LastWebSearchDate, 
    LastInquiryDate, LastPaymentDate) VALUES ('B', 3, 'Mrs', 'Clara', NULL, 'Wood', 50, NULL, '2021-01-29', NULL, NULL)
INSERT INTO example_Table (Cat1, Cat2, Title, First_Name, Middle_Name, Last_Name, Age, LastOrderDate, LastWebSearchDate, 
    LastInquiryDate, LastPaymentDate) VALUES (NULL, 4, 'Mr', 'Andrew', 'John', 'Evergreen', NULL, NULL, NULL, '2021-03-05', NULL)
INSERT INTO example_Table (Cat1, Cat2, Title, First_Name, Middle_Name, Last_Name, Age, LastOrderDate, LastWebSearchDate, 
    LastInquiryDate, LastPaymentDate) VALUES ('B', NULL, 'Ms', 'Emilia', 'Alina', 'Fox', 24, NULL, NULL, NULL, '2021-01-18')
INSERT INTO example_Table (Cat1, Cat2, Title, First_Name, Middle_Name, Last_Name, Age, LastOrderDate, LastWebSearchDate, 
    LastInquiryDate, LastPaymentDate) VALUES ('B', NULL, 'Miss', 'Elisabeth', NULL, 'Field', 37, '2021-04-01', NULL, NULL, NULL)

This is what our data set looks like so far:

SELECT *
    FROM example_Table
    ;

3 Set Index

As you may have noticed earlier, our dataset does not yet have an index. Therefore I would like to present two methods that can be used here:

3.1 Create an Index via Primary Key

On the one hand, we can use a primary key, which every table should have anyway.

ALTER TABLE example_Table ADD ID INT IDENTITY(1,1) PRIMARY KEY;

SELECT ID
    FROM example_Table
    ;

The disadvantage here is if you want to use this only for a temporary index, that you can no longer delete it, which of course also makes sense.

ALTER TABLE example_Table DROP COLUMN ID;

4 Replacing Values

What happens very often is that you want / need to replace certain values with others.

4.1 Replace NULL Values (ISNULL)

Let’s take a look at this part of the data set:

SELECT Cat1, Cat2
    FROM example_Table
    ;

With ISNULL we have the possibility to replace NULL values from columns with other values.

SELECT Cat1, Cat2,
       ISNULL(Cat1, 'Unknown') AS non_null_Cat1,
       ISNULL(Cat2, 999) AS non_null_Cat2
    FROM example_Table
    ;

Hint. in other SQL languages the syntax is different:

  • MySQL = IFNULL()
  • MS Access = IsNull()
  • Oracle = NVL()

ISNULL can also be used nested:

SELECT Cat1, 
       AVG(ISNULL(Cat2, 10)) AS avg_non_null_Cat2, 
       AVG(Cat2) AS avg_with_null_Cat2
    FROM example_Table
    GROUP BY Cat1
    HAVING Cat1 = 'A' OR Cat1 = 'B'
    ;

To track the calculation even better see the modified syntax here:

SELECT Cat1, 
       AVG(ISNULL(Cat2, 10)) AS avg_non_null_Cat2, 
       AVG(Cat2) AS avg_with_null_Cat2,
       Count(ISNULL(Cat2, 10)) AS num_abs_non_null_Cat2,
       Count(Cat2) AS num_abs_with_null_Cat2,
       SUM(ISNULL(Cat2, 10)) AS sum_non_null_Cat2,
       SUM(Cat2) AS sum_with_null_Cat2
    FROM example_Table
    GROUP BY Cat1
    HAVING Cat1 = 'A' OR Cat1 = 'B'
    ;

Let’s take a look at Cat1, B (non NULL): 26/5 = 5.2 = 5 Works fine!

4.2 Replace Values with NULL (NULLIF)

The same game goes around in the other direction. Let’s take a look at this part:

SELECT ID
    FROM example_Table
    ;

Now I want to specifically set some values to NULL

SELECT ID,
       NULLIF(ID, 8) AS null_if_ID
    FROM example_Table
    ;

This is of course only possible because I have preset that the column ID may also contain NULL values (which is normally not the case, especially not if it is the primary key):

Limitation: only one value can be set as NULL in this way.

If we want to set several values to NULL or replace several values with new ones we have to resort to another function: CASE.

4.3 IF Else Statemtents in SQL (CASE)

4.3.1 Case with NULL Values

Let’s give this a try:

SELECT ID,
        CASE 
            WHEN ID = 7 THEN NULL
            WHEN ID = 8 THEN NULL
            WHEN ID = 9 THEN NULL
            ELSE ID
        END AS ID_with_null
    FROM example_Table
    ;

4.3.2 Case with other Values

As already mentioned, you can not only set values to ZERO with NULL, but also pass other values. See the following example:

SELECT ID, Title
    FROM example_Table
    ;

SELECT ID, Title,
        CASE 
            WHEN Title = 'Mr' THEN 'Male'
            WHEN Title = 'Mrs' THEN 'Female'
            WHEN Title = 'Miss' THEN 'Female'
            WHEN Title = 'Ms' THEN 'Female'
            ELSE 'Unknown'
        END AS Gender
    FROM example_Table
    ;

4.3.3 Case with Lists

The problem shown before can be solved even easier:

SELECT ID, Title,
        CASE 
            WHEN Title IN ('Mrs', 'Miss','Ms') THEN 'Female'
            ELSE 'Male'
        END AS Gender
    FROM example_Table
    ;

4.3.4 Case with a Range

Have a look at this part of the dataset:

SELECT First_Name, Age
    FROM example_Table
    WHERE Age IS NOT NULL
    ;

And now the CASE Statement with a range:

SELECT First_Name, Age,
        CASE 
            WHEN Age BETWEEN 0 AND 35 THEN 'young' 
            WHEN Age BETWEEN 36 AND 99 THEN 'old'  
        END AS Age_State
    FROM example_Table
    WHERE Age IS NOT NULL
    ;

4.3.5 Case with Multiple Conditions (AND & OR)

Have a look at this part of the dataset:

SELECT First_Name, Age, Cat1
    FROM example_Table
    WHERE Age IS NOT NULL
    ;

And now the CASE Statement with multiple conditions:

SELECT First_Name, Age, Cat1,
        CASE 
            WHEN Age < 30 AND Cat1 = 'A' THEN 'Young_Professional' 
            WHEN Age < 40 OR Cat1 = 'B' THEN 'Expert'
            ELSE 'Normal_Employee'
        END AS Expertise
    FROM example_Table
    ;

CASE is a very useful function but it also requires a lot of typing. If the situation allows it, you can also use a simple If statement.

4.4 IF Statemtents in SQL (IIF)

The syntax is very simple but of course somewhat limited: IIF(boolean_expression,true_value,false_value) It must be a valid boolean value, or the function will raise an error.

Here is a simple example:

SELECT IIF(500<1000, 'true', 'false') AS _500_is_less_than_1000;

Since we have only one title notation for the man in the example shown above, we can also apply the IIF in this case.

SELECT ID, Title,
       IIF(Title = 'Mr', 'M', 'F') AS Gender
    FROM example_Table
    ;

IIF also works wonderfully with local variables:

DECLARE @BankBalance INT = 150;

SELECT IIF(@BankBalance > 100, 'Rich!', 'Poor');

Even nested is no problem:

DECLARE @bankBalance INT = 120;

SELECT IIF(
    @BankBalance > 100, 
        IIF(@BankBalance > 150, 'Rich!', 'Wealthy'), 
    'Poor'
    );

4.5 COALESCE

COALESCE is a more difficult function but super helpful in some situations that I will show you in a moment:

4.5.1 Replace NULL Values with the columns’average

As a Data Scientist, I don’t like to throw away NULL values, as this would result in the loss of unnecessary information (for example, during model training). A smarter solution is to use other metrics, such as the mean value of the column.

Let’s have a look here:

SELECT First_Name, Age
    FROM example_Table
    ;

If you have enough data, you can use the mean value for the missing lines.

SELECT First_Name, Age,
       COALESCE(Age, AVG(Age) over ()) as new_Age
    FROM example_Table
    ;

4.5.2 Coalesce for String Concatenation

Let’s have a look at this part:

SELECT First_Name, Middle_Name, Last_Name
    FROM example_Table
    ;

As I have already written here, it is very easy to link text variables in SQL.

Let’s see if this works in our case as well…

SELECT First_Name, Middle_Name, Last_Name,
       First_Name + ' ' + Middle_Name + ' ' + Last_Name AS Full_Name
    FROM example_Table
    ;

No, unfortunately not. But COALESCE can help here:

SELECT First_Name, Middle_Name, Last_Name,
       First_Name + ' ' + COALESCE(Middle_Name, '') + ' ' + Last_Name AS Full_Name
    FROM example_Table
    ;

4.5.3 Compare SQL Server COALESCE to CASE

Once you understand COALESCE, you can save yourself a lot of typing again. See the example below:

SELECT First_Name, LastOrderDate, LastWebSearchDate, LastInquiryDate, LastPaymentDate 
    FROM example_Table
    ;

Here I want to generate a column that shows the Last Activity Date. I can do this with COALESCE as well as with CASE.

SELECT First_Name, 
       COALESCE(LastOrderDate, LastWebSearchDate, LastInquiryDate, LastPaymentDate) AS LastActivityDate
    FROM example_Table
    ;

SELECT First_Name,
        CASE 
            WHEN LastOrderDate IS NOT NULL THEN LastOrderDate
            WHEN LastWebSearchDate IS NOT NULL THEN LastWebSearchDate
            WHEN LastInquiryDate IS NOT NULL THEN LastInquiryDate
            WHEN LastPaymentDate IS NOT NULL THEN LastPaymentDate
        END AS LastActivityDate
    FROM example_Table
    ;

However, it is simply much faster with COALESCE.

4.5.4 Compare SQL Server COALESCE to ISNULL

I often hear that COALESCE works just like ISNULL. But this is only partly true and you should be aware of the differences, otherwise you may be in for a nasty surprise.

For this reason I have taken the time to explain the difference here with an own example.

I create a small table for this purpose:

CREATE TABLE COALESCE_Table (Char_Variable VARCHAR(10))

INSERT INTO COALESCE_Table VALUES(NULL)

SELECT * FROM COALESCE_Table

Now I use ISNULL to replace the NULL value with a short text.

SELECT Char_Variable, 
       ISNULL(Char_Variable, 'some_Text') AS filled_Char_Var
    FROM COALESCE_Table
    ;

Works as expected. But what about when I want to insert a longer text?

SELECT Char_Variable, 
       ISNULL(Char_Variable, 'some_very_long_Text') AS filled_Char_Var
    FROM COALESCE_Table
    ;

The text will be truncated. This is also intentional, because I have set a limit of 10 characters for this column when I created the table.

Now let’s see what COALESCE does here:

SELECT Char_Variable,
       COALESCE(Char_Variable, 'some_very_long_Text') AS filled_Char_Var
    FROM COALESCE_Table
    ;

Here we get the complete string returned, regardless of the type of Char_Variable.

That is, while ISNULL guarantees us type safety, COALESCE can only do so to a limited extent. That means, if your application depends on a type safety in the return, then you are always on the safe side with ISNULL.

Another example to make it even clearer (this time with a numeric variable):

ALTER TABLE COAlESCE_Table ADD Num_Variable INT 

SELECT * FROM COALESCE_Table

Here I use both ISNULL and COALESCE in one example so the difference is immediately apparent.

SELECT Num_Variable,
       5 / ISNULL(Num_Variable, 2.00) AS _ISNULL, 
       5 / COALESCE(Num_Variable, 2.00) AS _COALESCE
    FROM COALESCE_Table
    ;

What happens now is that due to the type safety of ISNULL, the 2.00 is returned as the INT value 2, which has the consequence that the 5 is divided by the INT 2. Here the SQL server then returns the INT value, so we get the 2 as return value.

In the case of COALESCE, SQL Server decides to convert to the 2.00 data type, i.e. to perform an implicit conversion. This way we divide 5 by a floating point number, which results in a floating point number, so we get the correct value of 2.5 here (with some decimal places).

5 Rename Columns

Of course, you can assign Aliases with each select statement and then you have renamed the column. But you can also rename them permanently!

Caution here the columns of the table are renamed! Other statements that access the old names would no longer work.

For this example we use the table from the previous example:

SELECT * 
    FROM COALESCE_Table
    ;

EXEC sp_rename 'COALESCE_Table.Char_Variable', 'C_Var', 'COLUMN';
EXEC sp_rename 'COALESCE_Table.Num_Variable', 'N_Var', 'COLUMN';
SELECT * 
    FROM COALESCE_Table
    ;

6 Delete Duplicates

Again, the note that lines are completely lost!

I have shown how to select Duplicate in a Select Statement in this post: Data Wrangling - Drop Duplicates

Now we don’t want to drop the rows but delete them completely. With the following statement SELECT MIN(ID) only the first value will be kept in the table, the others will be deleted (except for the observation with the NULL Value, which we will exclude in this example).

SELECT ID, Cat1 
    FROM example_Table
    WHERE Cat1 IS NOT NULL
    ;

DELETE FROM example_Table 
    WHERE ID NOT IN (SELECT MIN(ID)
                    FROM example_Table 
                    GROUP BY Cat1)
    AND Cat1 IS NOT NULL;

Let’s take a look at what’s left:

SELECT ID, Cat1 
    FROM example_Table
    WHERE Cat1 IS NOT NULL
    ;

And now again the data set without restrictions:

SELECT ID, Cat1 
    FROM example_Table
    ;

7 Data Type Conversion (CAST & CONVERT)

Both CAST and CONVERT are functions used by Data Analysts / Scientists to convert variables of one data type into another data type. One difference between CAST and CONVERT is CAST is an ANSI standard while CONVERT is a specific function in the SQL server. This means that the CAST function can be used by many databases. CONVERT is more powerful and flexible than CAST in this respect and therefore more popular especially for time values.

The syntax of CAST is very simple to write. It contains the value to convert and the type of the resulting data type. It has ‘AS’ as a keyword to separate the datatype from the value. Furthermore there is an option to express the length.

Example: CAST(expression AS datatype(length))

In comparison, the CONVERT syntax first specifies the resulting datatype together with the optional length. There is another optional parameter in CONVERT called style. Style allows formatting of the datatype and specifies how the CONVERT function should format the datatype.

Example: CONVERT(data_type(length), expression, style)

7.1 Create Example Data

CREATE TABLE customers
    (Name VARCHAR(100),
    Debts FLOAT,
    Salary INT,
    Age INT,
    Purchasing_Power FLOAT)
    ;
INSERT INTO customers (Name, Debts, Salary, Age, Purchasing_Power) VALUES ('A', 480000, 2000, 25, 150.84)
INSERT INTO customers (Name, Debts, Salary, Age, Purchasing_Power) VALUES ('B', 225000, 4000, 30, 420.31)
INSERT INTO customers (Name, Debts, Salary, Age, Purchasing_Power) VALUES ('C', 407000, 8500, 42, 85.17)
INSERT INTO customers (Name, Debts, Salary, Age, Purchasing_Power) VALUES ('D', 670000, 6000, 22, 320.78)
INSERT INTO customers (Name, Debts, Salary, Age, Purchasing_Power) VALUES ('E', 160000, 4150, 18, 240.45)
SELECT *
    FROM customers
    ;

7.2 CAST

Let’s say we want to get the salary in thousands:

SELECT Salary / 1000 AS Salary_in_thousand
    FROM customers
    ;

Since the Salary column in the customers table is stored as INT, we also receive only INT values as output.

Therefore, we need to use CAST:

SELECT CAST(Salary AS FLOAT) / 1000 AS Salary_in_thousand
    FROM customers
    ;

Now let’s divide the Debt column by the Salary column to see how long it will take me to pay this off.

SELECT Debts / Salary AS num_years
    FROM customers
    ;

Also this issue is just not advantageous for my purpose. Therefore we use CAST again.

SELECT CAST(Debts AS INT) / CAST(Salary AS INT) AS num_years
    FROM customers
    ;

Do I really need to cast both columns?

No!

SQL would not be SQL if this could not be done more simply.

SELECT CAST(Debts / Salary AS INT) AS num_years
    FROM customers
    ;

7.3 CONVERT

As already described above, CONVERT does this job for us as well, is only slightly different from the syntax notation.

Here again the original data set:

SELECT *
    FROM customers
    ;

In the following I will show the two examples from the previous chapter again, only this time with CONVERT:

SELECT CONVERT(FLOAT, Salary) / 1000 AS Salary_in_thousand
    FROM customers
    ;

SELECT CONVERT(INT, Debts / Salary) AS num_years
    FROM customers
    ;

8 Rounding numerical Values

Now that we have seen how to change the data type of numeric variables, we come to the topic of rounding numbers. For the following examples, we again use the customers table created in Chapter 7.

SELECT *
    FROM customers
    ;

8.1 ROUND

With Round we have the possibility to determine the number of decimal places. In our example, we round the Purchasing_Power variable to one decimal place.

SELECT Purchasing_Power, 
       ROUND(Purchasing_Power, 1) AS rounded_power_1
    FROM customers
    ;

Of course, we also have the option to round to zero decimal places. Here, the function rounds the number down to a value of .49 and up from .50.

SELECT Purchasing_Power, 
       ROUND(Purchasing_Power, 0) AS rounded_power_0
    FROM customers
    ;

8.2 CEILING

However, if we want the number to always be rounded up regardless of the amount of decimal places, then we need the CEILING function.

SELECT Purchasing_Power,
       ROUND(Purchasing_Power, 0) AS rounded_power_0,
       CEILING(Purchasing_Power) AS ceiled_power
    FROM customers
    ;

8.3 FLOOR

If we want to have the numbers always rounded off, we need FLOOR.

SELECT Purchasing_Power,
       ROUND(Purchasing_Power, 0) AS rounded_power_0,
       FLOOR(Purchasing_Power) AS floored_power
    FROM customers
    ;

9 Power

Sometimes it makes sense to bring the existing values into a higher dimension. This can be done in different ways, one of the most common methods is the POWER function.

SELECT Age,
       POWER(Age, 2) AS power_Age,
       CONCAT(Age, '*', Age,'=', POWER(Age, 2)) AS [(Calculation:)]
    FROM customers
    ;

10 Conclusion

That’s it for now.

I have not covered all the available data manipulation functions in this post, but the most commonly used ones. I will discuss String Functions in detail in the following post.