All you need to know about Data Cleaning!
Photo by Towfiqu barbhuiya on Unsplash
What is data cleaning
Data cleaning is the most important step in any data analysis process. It deals with or removing errors, inconsistencies, and inaccuracies in a dataset and usually takes place before your core analysis. The majority of work goes into detecting and (wherever possible) correcting dirty data which may have incomplete, inaccurate, irrelevant, corrupt or incorrectly formatted data.
The objective of data cleaning is to improve data quality and to ensure that the data is accurate, complete, consistent, and reliable, so that it can be used for analysis, decision-making, and other applications. It’s typically done by data quality analysts and engineers or other data management professionals.
Before going further let’s first understand what we mean by dirty data:
1.Duplicate data: Duplicate data means presence of multiple identical or near-identical copies of data within a dataset or across different datasets. This will happen most often during data collection. When you combine data sets from multiple places, scrape data, or receive data from clients or multiple departments, there are opportunities to create duplicate data.
Duplicate data can slow down the data processing and analysis time, as more time and resources are needed to process and analyze the same data multiple times and also it can increase the storage requirements for a dataset, which can result in increased storage costs.
A title or a suffix in contact data may cause duplicate data as well. A person called Dr. John Doe and a person called John Doe or Mr. John Doe could be created as separate records and live in different data systems, although they could be the same person. These name variations can easily create duplicate records in a database.
2.Irrelevant observations: Irrelevant observations are when you notice observations that do not fit into the specific problem you are trying to analyze.
•If you want to analyze the effectiveness of a new drug, data from patients who did not receive that drug would be irrelevant.
•If data to be analyzed is collected outside the study period, it can be irrelevant to the problem being analyzed. For example, if you we analyzing the effectiveness of a new drug over a six-month period, data collected before or after that period would be irrelevant.
3.Structural errors: Structural errors are the errors that occurs due to the way data is structured or organized, rather than errors in the actual data values. Structural errors can happen at various stages of data collection, storage, and processing, and can have a significant impact on the accuracy and reliability of data analysis. These errors usually happen because of strange naming conventions, typos, or incorrect capitalization.
•Dates that are recorded in different formats (e.g., MM/DD/YYYY or DD/MM/YYYY) can create confusion when analyzing time-series data
• “N/A” and “Not Applicable” both can appear in same column, although they entered differently both have same meaning and should be analyzed as the same category.
•Title VP, V.P., and Vice President can be entered in different ways but falls under same category.
•Phone number can be entered like 1234567890, 123–456–7890, (123)-456–7890, and 1–123–456–7890. All these numbers are same and correct.
4.Outliers: In terms of data analytics, outliers are the values present in a dataset that vary greatly from the others — they’re either much larger, or significantly smaller.
You can remove an outlier only If you have a legitimate reason like improper data-entry. Removing such outliers will help the performance of the data you are working with. However, sometimes it is an outlier that will prove a theory that you are working on. Remember: just because an outlier exists, doesn’t mean it is incorrect. If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it.
Examples of outliers are:
•A person who is 7 feet tall may be considered an outlier in a dataset of heights for a population.
•In terms of patient’s data, patients age above 100 can be considered as outliers.
•In case of hospital data set, consider length of stay. A patient who stayed in the hospital for several months may be considered an outlier compared to the other patients in the dataset.
5.Missing Values: When there is no data available for certain variables or participants in data set, they are considered as missing values. There can be many reasons for missing values like:
•Data entry errors
•In case of a survey, a participant may choose not to answer certain questions like age, salary, race etc. In such cases data will be incomplete and those fields will have missing values.
•Missing values due to data corruption or data transmission errors.
Why Data cleaning is important
Photo by Emily Morter on Unsplash
With the help of data analytics organizations can improve business performance and gain competitive advantages over rivals. Many business operations and decision-making are dependent on data. As a result, clean data is a must for BI and data science teams, business executives, marketing managers, sales reps and operational workers.
As they say Garbage in, Garbage out it simply means that if you use dirty or uncleaned data for analysis you will get defective results or insights from that data. Any strong structure needs a really strong foundation. Get the quality data cleaning done before the analysis and you’ll create something strong, long-lasting and reliable or do it incorrect (or skip it) and your analysis will soon crumble!
If data isn’t properly cleansed, analytics applications may provide false information. Which can result into inaccurate business decisions, misguided strategies, missed opportunities and operational problems, which ultimately can be extremely costly in the long run.
That’s why data experts spend a good 60% of their time on data cleaning.
Uncleaned data used for analysis can result into inaccurate or misleading analysis like:
•Using uncleaned data can affect the result of statistical analysis. For example, calculating the mean or standard deviation can lead to inaccurate results that do not reflect the correct distribution of the data.
•Using uncleaned data from a dataset having missing values can lead to biased predictions based on incomplete information.
•If the scatter plot is created using uncleaned data of a dataset containing outliers, then it can lead to a misleading visualization that does not truly represent the relationship between variables.
Benefits of data cleaning
•Improved Data Quality: Data cleaning enhances the quality of data and makes it more consistent, which can lead to better decision-making and analytics applications to produce better results. That enables organizations to make more informed decisions on business strategies and operations.
•Improved Data Analysis: When data is clean and consistent, it becomes easier to analyze, identify trends and patterns, and gain insights.
•Increased use of data: Data has become a crucial corporate asset, but unless it is used it can’t generate business value. By making data more trustworthy, data cleansing helps convince business people to rely on it as part of their jobs.
Challenges faced during data cleaning
•Data volume: The volume of data can be a substantial challenge for data cleaning. As the size of the data increases, the time and effort required to clean that data also increases.
•Data complexity: Data can be difficult to understand, with many different variables, formats, and structures. Because of that it becomes difficult to identify errors and inconsistencies in the data.
•Dirty Data: Poor data quality, such as missing values or incomplete data, can make data cleaning more challenging. In some cases, data needs to be corrected manually or imputed, which can be time-consuming.
•Human error: Data cleaning is often performed manually, which can be prone to human error. Data cleaning can also be a tedious and time-consuming job, which can lead to mistakes or omissions.
•Data privacy: Data cleaning can lead to privacy concerns, particularly when dealing with sensitive data. Organizations need to ensure that data is cleaned in such a way that privacy and security of individuals are maintained.
•Data integration: When integrating data from multiple sources, data cleaning can be particularly challenging. Standardizing such data requires additional efforts as data from different sources may be structured differently or have different formats.
To overcome these challenges, organizations need to develop robust data cleaning processes and invest in tools and technologies that can automate and streamline data cleaning. This can help to reduce errors and inconsistencies, improve data quality, and enable more accurate analysis and decision-making.
How is data cleaning done
Photo by Myriam Jessier on Unsplash
•Data cleaning Goal: The first step in data cleaning is to define the goals of the data cleaning process. This includes finding the types of errors or inconsistencies that need to be corrected, such as missing data, incorrect values, or formatting issues.
•Profiling Data: Next step is to profiling the data. It is the process where data is analyzed and valuable insights are gained into structure, content, quality, and relationships of data. The goal of data profiling is to examine the data to identify patterns, missing values, and outliers and to identify data quality issues that may affect the accuracy and reliability of analytical results.
•Standardized Data: Once the potential errors have been identified, next step is to standardized data to ensure for its consistency. This involves correcting formatting issues, such as date or time formats, and standardizing data values, such as converting all states to their two-letter abbreviations.
•Data validation: Data validation is a crucial step that is carried out next to data profiling. Once data has been profiled, it is important to validate it against certain quality standards and business requirements:
1. This step validates that all expected data is present and accounted for, and that there are no missing values or nulls present in the data. Verifications of the data is carried out to check if its correct and error-free, and it accurately reflects the real world.
2. Also, it is ensured that the data is consistent across different sources and systems, and no contradictions or anomalies present in the data. This step ensures that the data is formatted correctly to follows all predefined data standards and rules. Data is compared with external sources or previous versions of the data to ensure that it is consistent and accurate.
3. Data validation involves verifying the accuracy, completeness, and consistency of data, and ensuring that it conforms to predefined rules and standards. Data validation involves verifying the accuracy of the data. This includes checking for data completeness, identifying duplicates, and cross-referencing data with external sources to ensure accuracy.
By validating data after profiling, organizations can assure the accuracy and usefulness of their analytical results, reduce the risk of errors and inconsistencies for better decisions making.
•Data enrichment: Data enrichment involves enhancing or improving the quality and usefulness of data by adding additional information to it. Few examples are adding demographic data or standardizing date formats or other external data sources to improve analysis.
•Data documentation: Documentation is very crucial step in for data cleaning, as it provides a record of the cleaning process and any changes made to the data. This includes documenting the data cleaning steps taken, any assumptions made during the process, and any data validation or cross-referencing performed so that data is accurately and consistently understood and used by those who need to work with it.
We can consider data cleaning as an iterative process that involves quite a few rounds of data profiling, standardization, validation, transformation, enrichment, and documentation. By following a structured approach to data cleaning, organizations can ensure that their data is accurate, complete, and consistent, enabling better analysis and decision-making.
Data Cleaning Basics by OpenRefine — This online tutorial provides a basic introduction to data cleaning using OpenRefine, a free and open-source tool for working with messy data.
The Importance of Data Cleaning by Melissa Harris (2020) — This article provides an overview of the importance of data cleaning and some common issues that can arise when working with messy data.
Data Cleaning and Preprocessing in R by Rafael Irizarry (2015) — This tutorial provides an introduction to cleaning data using the R programming language and some popular data cleaning libraries.