Data Cleaning Techniques
What is data Cleaning:
Data cleaning also called Data cleansing or Data scrubbing is one of the most important step in Data management process. Data cleaning is the first step after data collection. Real world data will be always messy. As a data analyst or a data scientist it is important to make sure the data you are working with is tidy before you develop some business insights out of it. It can be done manually by using tools or it can be automated by running the data through a computer program.
why do we do Data cleaning:
Unclean Data comes as a human error. In most cases, the data we collected from our research contains dirty data such as duplicates, formats, etc. As a data analyst it is our job to clean the data and transforming it into ready to make useful business insights. Data transformation typically involve converting a raw data source into cleansed, validated and ready to use format.
In this Blog I will walk you through some basic Data cleaning Techniques.
Data Cleaning Techniques:
Remove irrelevant Data
Convert Data types
Handle Missing data
Filter unwanted outliers
Validate and QA
when you collect data from different sources, it is likely to have duplicates. These reason for these duplicates may be because of human errors where the person in entering or filling out the data These duplicates can interrupt your data and it may result in false decision making. so it is mandatory to remove them before you proceed for further analysis.
Remove irrelevant Data:
Irrelevant Data will slow down and confuse any analysis you want to do. So identifying what is necessary and what is not necessary before you start your data cleaning.
For example: You are calculating the number of sales done by of a particular manager.
You get the data which contains Product_name, Manager_name, Email, phone_number, Total _Sales
Here you can remove the Email and phone_number of the manger if you feel it irrelevant.
To remove irrelevant data while cleaning data for effective business insights, we must understand the data and the problem statement.
Covert Data Types:
The most common data types that we find in the data are text, numeric, and Date data types. The text data types can accept any kind of mixed values including alphabets, digits, or even special characters. A person’s name, product name, store location, email ID, password, etc., are some examples of text data types.
Numeric data types are int(whole numbers) and float (decimals) we can perform aggregate functions like Minimum, Maximum, Average, Median and count. Having the numeric column as String or text data type will not allow us to calculate aggregate functions. so we need to convert the data into numeric format before we analyze the data.
The data type if identified incorrectly will end up being identified as a string or text column. In such cases, we need to explicitly define the data type of the column and the date format which is mentioned in the data. The date column can be represented in different formats:
July 05, 2017
The data from your research may contain linguistic discrepancies. Software used to evaluate data typically uses monolingual Natural Language Processing (NLP) models, which are unable to process more than one language. Therefore, you must translate everything into a single language.
Clear formatting plays a key role when cleaning your data. Many machine learning models cannot process data if they are different document formats which can make your data more confusing and incorrect. Google sheets and excel sheets have a very simple standardization method to format data.
Structural errors occurs when you measure or transfer data, you will observe naming conventions, typos or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes.
Spelling mistakes or extra punctuation in data like an email address could mean you miss out on communicating with your customers. It could also lead to you sending unwanted emails to people who didn’t sign up for them.
Handle Missing Data:
There are two potions when comes to handling missing values.
a. Removing the observations that have missing values.
b. Entering the missing Data.
Removing the missing value completely might remove useful insights from your data. After all, there was a reason that you wanted to pull this information in the first place.
You can input missing values based on other observations; again, there is a chance of losing data integrity because you may be operating from assumptions and not actual observations.
Filter unwanted Outliers:
You can remove an outlier like improper data-entry to increase the performance of the data you are working with, There will be some cases outliers do not appear to be a fit with the data you are analyzing. But that is not the case all the times, just because an outlier exists, doesn’t mean it is incorrect.
you need to validate that If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it.
Validate and QA:
At the end of the data cleaning process, you should be able to answer these questions as a part of basic validation:
Does the data make sense?
Does the data follow the appropriate rules for its field?
Does it prove or disprove your working theory, or bring any insight to light?
Can you find trends in the data to help you form your next theory?
If not, is that because of a data quality issue?
Data Cleaning Tools
Microsoft Excel (Popular data cleaning tool)
Programming languages (Python, Ruby, SQL)
Data Visualizations (Data interpreter in Tableau)
Data cleaning is necessary for valid and appropriate decision making. You want the data to be clean before you start your research since unclean data can cause a whole variety of problems and biases in your results. Every data set requires different techniques to clean the data. I hope this blog will help you get started with Data cleaning for data analysis.