Exploratory Data Analysis or EDA is very crucial for the success of all data science projects. So, let’s try to understand what EDA is all about. It is an approach to analyze and understand the various aspects of the data. Through EDA, we must understand the relationship between the features and we must be able to make out conclusions or gather insights about the data.
So, what is the purpose of doing EDA on any dataset?
The purpose of performing EDA on any dataset is to make sure that the data is clean and there are no redundancies, missing values or null values in the dataset. We need to identify the significant features in the dataset and remove the unnecessary noise in the dataset that could hamper the accuracy of our conclusions when we work on building the model. In order to move on to more complex processes in the data processing lifecycle, we need to have proper interpretation of the dataset.
Following are the steps involved in the whole process of Exploratory Data Analysis:
Step 1: Understand the data
Step 2: Clean the data from the irregularities in the data
Step 3: Analyze the relationship between the features
Below image shows where EDA fits in the process of any data science project:
Python code implementation to understand EDA:
Let’s go through this example to perform EDA on the Student dataset. You can find this dataset at following kaggle link.
Please read the inline-comments of each code cell to understand the implementation.
Note: In another blog, I will include example to handle null values and explain how to make the data clean.
Note: If you are interested to learn more about Seaborn Heatmap, please check out this link.
Note: If you are interested to learn more about Seaborn Pairplot, please check out this link
Note: If you are interested to learn more about Seaborn Relplot, please check out this link
Note: If you are interested to learn more about Seaborn Distplot, please check out this link