top of page
hand-businesswoman-touching-hand-artificial-intelligence-meaning-technology-connection-go-

Data Cleaning using SQL


From Unsplash.com


What is Data Cleaning? Why does it matter? Is it important? And how to do it? Every data analyst should know the answer to these questions. Data cleaning is preparing the data for analysis. It is the process of ensuring that the data is correct, consistent, and usable, by detecting incorrect and inconsistent data. It helps to eliminate any unrelated data from what we need for analysis. If data is not cleaned, it may lead to false insights and assumptions, which could lead to poorly made decisions. Data cleansing improves data quality and overall productivity.

As a data analyst, if we want to get some insights from the data, it is vital to ensure it is clean enough. If we have a small dataset, then we can easily do it manually. But if the dataset is big with millions of rows, cleaning them manually is not possible. And if we want to combine multiple datasets from different sources, then we need SQL functions to help us. Let us see how to use SQL to clean data. We will look at some of the different SQL functions used to clean and transform data.

The steps to clean the data depend on the research methodology, and if the resulting data is qualitative or quantitative. Below listed are the frequently used steps involved in data cleaning.

1) Remove irrelevant data

In most cases, we might need to retrieve only part of the dataset, or combine data from multiple sources, to prepare the data relevant to our analysis. We can use SQL JOIN clauses to effectively join tables from many sources. We need to understand the relationship between tables and how to use given data to create new tables.

For example, we have a COVID-19 dataset, indicating hospital occupancy on different dates in different countries and I have retrieved only a part of it from a larger sample.


From author: Table coviddata

2) Remove duplicate data

Duplicate data is common in datasets, especially if it is collected from multiple sources. If the dataset has a limited number of records, it is easier to delete duplicate rows. If the dataset is enormous, finding duplicate rows is challenging. There are multiple ways to find and delete duplicate rows in data.

We can use SELECT DISTINCT to display only unique entries and keep only one occurrence of each record. However, the query will be computationally expensive if the dataset is huge.


In the first method, let us use COUNT() function, to find the duplicate rows in the table.



This result shows the duplicate rows in the table. Now, to delete these rows from the table ,let us use the following code.



Here, we have joined the ‘coviddata’ table to itself and checked if the two records have the same value in the ‘Day’ column. If it is the same value, then the duplicated record will be deleted using DELETE function.


As we can see, the query has deleted the records with the lesser IDs and retained the records with later IDs. If we want the first recorded ID, then we can use the ‘>’ sign in the WHERE clause.

In the second method, let us see how to delete duplicate rows using ROW_NUMBER() in a sub-query. ROW_NUMBER() is a SQL ranking window function that assigns a sequential integer to each row within the partition of a result set.


The above query performs the following functions.

  • Uses a subquery in the WHERE clause.

  • Within the subquery, the ROW_NUMBER() function assigns a unique integer value to each row of the partitioned set.

  • The subquery will return the duplicate records except for the first row in the partition.

  • The DELETE FROM query deletes the duplicates returned from the subquery.



3) Fix errors

We can see that some of the values in the column Code are entered wrong. The code for Australia is ‘AUS’, whereas in two of the records, it is entered as ‘AUST’. We can use REPLACE() to replace ’AUST’ with ‘AUS’. The REPLACE() function searches and replace all occurrences with a new one.




Similarly, we can use UPPER() and LOWER() to turn texts to uppercase or lowercase(). TRIM(),LTRIM(), RTRIM() to remove unnecessary spaces at the beginning or end of the string.

4) Datatype Conversion

In most cases, we would be required to do datatype conversion. Numbers, in general, might be entered as text, but we need them to be converted into integers to be processed and used for calculations. We can use CAST() to update different date formats to one specific DateTime format. CAST() can also be used to convert string to a date, string to a Boolean, or any other format. In this example, we have converted the column ‘Day’ with datatype date to SQL format date with a timestamp.


5) Missing Data

How to handle missing data? There are many ways to handle missing data, depending on the type of analysis required and the nature of the data. Not all options are optimal due to the loss of data, but we can try to minimize the negative impact.

The first option would be to remove missing records with missing values. But, this method could cause a loss of information. So, we need to review how the missing data might affect our analysis, and if there are enough records left to make a reliable observation.

The second option would be to replace missing values. For numerical data, we can replace missing values based on their mean or median. But, if we are missing a lot of data, using mean or median could result in a loss of variation in the data.

In our example, the column ‘Entity’ is missing two values. But, the code for the country is available. Hence, we can replace the empty values with ‘England’, since the code for the country is listed as ‘ENG’.


The daily hospital occupancy column also has a couple of null values. In some cases, the null values can be replaced with 0. But, in our case, the daily hospital occupancy shows the number of patients admitted for that day in all the hospitals around the country. So, if the previous day has a count of 426, then the count the next day cannot be 0. It would mean all 426 patients were discharged the same day and there were no more new patients the next day, which is practically an impossible scenario. This would weaken our analysis. Therefore, before imputing the null values with 0, we need to understand the dataset and make the best decision for that circumstance. All the changes can be made to the data using UPDATE or ALTER table functions.

6) Identify Outliers

Outliers are data points that don’t fit in with other data. So, what to do with an outlier? If the outlier looks like improper data was entered, then it might help to remove that data. But, in the other case, if the outliers are valid, then we need to consider if they are relevant to our analysis. If they are, then they can be a meaningful part of data, which is very much needed for analysis. We can remove the outliers if they don’t affect our analysis in any way.

7) Standardize the data

Data Standardization is the process of transforming the data taken from multiple sources into a consistent format that adheres to the standards. It involves converting data to a common format to enable us to process and analyze it. Standardizing the data will improve access to the most relevant and current information. Each of the data in the data source will have the same format and labels. This will make the analytics and reporting easier.

8) Validating the data

Data validation is the method of checking the accuracy and quality of data. After going through the above steps, we need to a final quality check before starting data analysis. This is to verify if we missed any corrections and double-check if the data is formatted correctly. Validating data can prevent incorrect analysis based on low-quality data.


These are some of the basic guidelines to follow during the data cleaning process. There is no single method to clean the data. Cleaning the data involves understanding the data and identifying where the inconsistencies are coming from, which makes the process easier. In short, data cleaning is important because it provides more accurate insights and predictions, higher revenues, better employee productivity, lower costs, and more satisfied customers.





918 views0 comments

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page