Exploratory Data Analysis (EDA) is a technique used to understand a dataset before we actually start to model it. Some people refer to EDA as data exploration. In other words, EDA is the process of summarizing important characteristics of data in order to gain better understanding of the dataset. The main objective of the EDA is to understand the data.
The EDA is performed to make sure that the data is clean and does not have any redundancies or missing value or null value. We should also identify the most important variable in the data set and remove all the unnecessary noise which may actually hinder the accuracy when building the model.
Before we begin exploratory data analysis, let’s understand few key terms:
Value: A data value which is informative such as number or a date.
Variable: A data variable is an attribute which can be measured, such as weight or income.
Distribution: The distribution of a dataset is how the whole dataset is arranged.
Outlier: An outlier is a data value that is significantly different, can be much higher or lower, from the rest of a dataset.
Data model: A data model is a method of organizing data and building relationships between values in a dataset.
How to conduct Exploratory Data Analysis?
Steps involved in EDA are-
1. Understand the Data
The first step to conduct EDA is to understand the dataset at a high level. Start by determining the size of the dataset that includes number of rows and columns.
2. Clean the Data
Data clean refers to cleaning the data from redundancies and replacing some missing values. Redundancy can be irregularity of the data including some variables or columns not necessary for making our conclusions or interpretations which we can simply remove or delete the columns. Missing values can be replaced with estimates by finding the cause for missing.
3. Categorize the data
After finding missing values, categorize values to determine which statistical methods and charts visuals can be used. Values are divided into 3 categories.
a) Categorical: These variables can have limited set of values.
b) Continuous: These variables can have infinite set of values.
c) Discrete: These variables are fixed set of values which are numeric.
4. Determine the form of the dataset
Analyzing the data distribution, identifying skewness, and locating any gaps will assist in detecting patterns and trends that will eventually form the dataset.
5. Identify relationships in the dataset
Correlations between variables can be used to gain significant insights. Scatter plots are excellent for displaying and understanding these interactions.
6. Detect Outliers
Detecting outliers is the dataset is another crucial step in EDA. Outliers are significantly different from rest of the values and it can be on higher or lower side. Outliers can be located by observing graphs or sorting the data in numerical order during EDA.
Types of exploratory data analysis
There are four primary types of EDA-
Univariate non-graphical: This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
Univariate graphical: The non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:
Stem-and-leaf plots, which show all data values and the shape of the distribution.
Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.
Other common types of multivariate graphics include:
Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
Run chart, which is a line graph of data plotted over time.
Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
Heat map, which is a graphical representation of data where values are depicted by color.
Let’s perform Exploratory Data Analysis on the Netflix userbase dataset.
The dataset contains various fields, including user ID, subscription type, monthly revenue, joining date, last payment date, country, age, gender, device, and plan duration.
To get started, I import the data from a CSV file using the "get data" feature in Power BI. Once loaded, all fields will be available for visualization.
We will begin our analysis by creating various visualizations:
The “Basic Statistical Analysis” dashboard is created by using a column chart to display the count and distinct count of all records for all primary variables. Numeric values are displayed as average, minimum, and maximum, taking into account Age, Gender, Country, Revenue, and Subscription type.
In our next dashboard, “Fields Analysis” we will perform exploratory analysis on the main fields or columns in our dataset by choosing histogram, column and donut charts.
Firstly, we need to import a histogram by using the import visual feature. Once import is successful add visual to the dashboard.
· A histogram displaying the distribution of customer ages.
· A column chart displaying the number of customers by country.
· A donut chart showing the number of customers by device and gender.
· A histogram analyzing the number of days since joining and last payment.
· A histogram analyzing monthly revenue.
Next dashboard named “Country Analysis” is created to show customers by country.
From the analysis, key insights are high usage of Netflix usage by customers in USA and France followed by Canada.
To analyze devices by country, let’s create a dashboard named “Devices by Country”.
In this dashboard, devices type and its count is analyzed in all countries.
From the analysis, found that Spain ranks highest for Smart TV, USA for Laptop and Canada for Tablets.
Our Next dashboard is “Gender by Country” where gender is shown in numbers in all countries.
From the analysis, found that Spain country has more females who use Netflix, USA with equal proportion among male and female, Canada with male numbers on higher side.
In next dashboard, we will see about “Subscription Type by Country”
From the analysis, found that Spain country lists highest for Premium subscription, and Italy, Germany, USA, Brazil, Canada with highest Basic subscriptions whereas Mexico and United Kingdom ranks highest for Standard subscription.
Final dashboard explains about “Subscription Type by Gender Analysis”.
Through our analysis, Female marked highest for Basic subscription and Laptop usage.
Tools for Exploratory Data Analysis
Some of the most common data science tools to create an EDA are Python, R, Power BI and Excel.
Why is Exploratory Data Analysis important in Data Science?
The main purpose of EDA is to look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalies, find interesting relations among the variables.
We can use exploratory analysis to ensure the results we got are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are obtained, its features can then be used for further data analysis.
Conclusion
The EDA provides crucial insights by removing redundancies or missing value or null value which enables data professionals to take informed decision-making and drive project success.
Comments