What is Exploratory Analysis?
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, using statistical graphics and other data visualization methods.
The primary goal of EDA is to understand the data and its underlying structure, patterns, distributions, and relationships between variables. It is often the first step in the data analysis process and helps guide further analysis and hypothesis generation.
It is often not easy to look at a column of numbers or a whole spreadsheet and determine important characteristics of the data. It may be tedious, boring, and/or overwhelming to derive insights by looking at plain numbers. Exploratory data analysis techniques have been devised as an aid in this situation.
Objective of Exploratory Data Analysis
The overall objective of exploratory data analysis is to obtain vital insights and hence usually includes the following sub-objectives:
Identifying and removing data outliers
Identifying trends in time and space
Uncover patterns related to the target
Creating hypotheses and testing them through experiments
Extract critical insights and make decisions.
Steps Involved in Exploratory Data Analysis (EDA)
Here are some key aspects of Exploratory Data Analysis:
2. Finding all Variables and understanding them
3. Cleaning the Dataset
4. Identify Correlated Variables
5. Choosing the Right Statistical Methods
6. Visualizing and Analyzing Results
Lets delve into each step of EDA in more detail with an example using
A. Python programming language with Jupyter Notebook.
B. Sequential Query Language using MySql workbench.
Exploratory Data Analsys with python programming using Jupyter Notebook:
USA cars dataset is used in our example. In this dataset, we will try to analyze the cars data and how EDA focuses on identifying the factors influencing the cars sales.
Data set is avaliable at https://www.kaggle.com/datasets/olivia05144/usa-cars-dataset
Jupyter Notebbok:
Jupyter Notebook is an open-source web application that allows us to create and share documents containing live code, equations, visualizations, and narrative text. It supports various programming languages, including Python, R. While Jupyter runs code in many programming languages, Python is a requirement for installing the Jupyter Notebook.
1. Data Collection
Data collection is the process of gathering information or data from various sources for the purpose of analysis, research, decision-making.
Data collection is not only about gathering raw data but also involves ensuring the quality, accuracy, and reliability of the data collected. Proper data collection methods are crucial to obtaining meaningful and actionable insights from the data.
2. Finding all Variables and Understanding Them
When the analysis process starts, the first focus is on the available data that gives a lot of information.
It requires first identifying the important variables which affect the outcome and their possible impact.
This step is crucial for the final result expected from any analysis.
lets first Import all libraries which are required for our analysis, such as Data Loading, Statistical analysis, Visualizations etc
Let's read USA cars dataset using read_csv method.
head() -> will display the top 5 observations of the dataset, with which we can look at the columns and its values of the dataset.
tail() -> will display the last 5 observations of the dataset
info() -> shows the variables Mileage, color, Brand, Seats and Price. Numeric variables like Mileage, lot are of datatype as float64 and int64.
3. Cleaning the Dataset
The next step is to clean the data set, so that data contains only those values that are relevant and important from the target point of view. This will not only reduce time but also reduces the computational power from an estimation point of view.
Lets get into cleaning our data.
Check for Duplication:
nunique() -> function is used to count the number of unique values in a Series or DataFrame column. It returns the count of distinct elements present in the column.
duplicated() -> shows if the value is duplicate or not.
We can see, we do not have any duplicates in our dataset.
dop_duplicates() -> this method helps remove duplicates.
Duplicates are dropped if any and we could check from the number of entries using info method.
Handling Null Values :
isnull() -> is widely been in all pre-processing steps to identify null values in the data.
4. Identify Correlated Variables
Finding a correlation between variables helps to know how a particular variable is related to another. This gives a clear picture of how different variables correlate, which further helps in understanding vital relationships among them.
From the charts we can see Year and Mileage variables are neagatively correlated.
5. Choosing the Right Statistical Methods
Depending on the data, categorical or numerical, the size, type of variables, and the purpose of analysis, different statistical tools are employed.
Statistical formulae applied for numerical outputs give fair information.
Graphical visuals are more appealing and easier to understand and interpret.
Statistics summary gives a high-level idea to identify whether the data has any outliers, data entry error, distribution of data such as the data is normally distributed or left/right skewed. In python, this can be achieved using describe()
6. Visualizing and Analyzing Results
Visualization is a core component of EDA, making complex relationships and trends within the dataset easily comprehensible. EDA employs visual techniques to represent the statistics graphically.
Using the right charts can help us identify trends within a big dataset and find hidden patterns or outliers
Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist in identifying styles, trends, and relationships within the facts.
Data Visualization Tools
Here are some of the the most popular buisiness intelligent tools,
Tableau: is a powerful and widely used data visualization tool that allows users to create interactive and shareable dashboards and reports. It enables users to connect to various data sources, including databases, spreadsheets, and cloud services, and then create visualizations such as charts, graphs, maps, and tables to explore and analyze the data.
Power BI: is a business analytics tool developed by Microsoft. It allows users to visualize and analyze data from various sources to gain insights into their business processes and make informed decisions.
Looker: is a data analytics and business intelligence platform that provides organizations with tools to explore, analyze, and share insights from their data. Looker is known for its innovative approach to data modeling, exploration, and visualization.
Types of Exploratory Data Analysis
There are two main types of EDA:
Univariate
Multivariate
1. Univariate EDA
It is the simplest of all types of data analysis used in practice. As the name suggests, uni means only one variable is considered whose data is compiled and studied. The significant parameters which are estimated from a distribution point of view are as follows:
· Mean, Median, Mode: Central Tendency: This term refers to values located at the data's central position or middle zone.
· Mean is the average of all values in data, while the mode is the value that occurs the maximum number of times. The Median is the middle value with equal observations to its left and right.
· Minimum, Maximum :Range: The range is the difference between the maximum and minimum value in the data, thus indicating how much the data is away from the central value on the higher and lower side.
· Variance and Standard Deviation: Two more useful parameters are standard deviation and variance. Variance is a measure of dispersion that indicates the spread of all data points in a data set. while standard deviation is the square root value of it. The larger the value of standard deviation, the farther the spread of data, while a low value indicates more values clustering near the mean.
· Charts: Histograms (Bar Charts), Box plots,
3. Multivariate EDA
Multivariate Exploratory Data Analysis (EDA) is an approach to analyzing data that involves examining the relationships between multiple variables simultaneously. Multivariate EDA considers the interactions and dependencies between multiple variables to gain deeper insights into the underlying structure of the data.
Non Graphical Analysis:
Here are some of the Multivariant Non Graphical EDA techniques.
Correlation Analysis: Investigate the pairwise correlations between variables to identify patterns of association. You can use correlation coefficients for linear relationships or for non-linear relationships.
Regression Analysis : Regression analysis can be used to model the relationships between multiple independent variables and a dependent variable. Multivariate regression techniques, such as multiple linear regression or logistic regression, allow you to analyze how changes in multiple variables simultaneously affect the outcome variable.
Cluster Analysis : Cluster analysis is used to identify groups or clusters of similar observations in a dataset based on the values of multiple variables. This can help uncover natural groupings or patterns within the data.
Graphical Analysis:
We can create multivariate plots such as Scatter Plot, Bubble Chart,Heat Map, Tree Map, to visualize relationships between multiple variables simultaneously.
These plots allow us to identify patterns, trends, and outliers across multiple dimensions of the data.
Heatmap shows the correlation between the variables price, year, mileage and lot and their correlation coefficient values.
The correlation coefficient, is a statistical measure that quantifies the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where:
correlation coefficient = 1: Indicates a perfect positive linear relationship, meaning that as one variable increases = -1, the other variable also increases proportionally.
correlation coefficient: Indicates a perfect negative linear relationship, meaning that as one variable increases, the other variable decreases proportionally.
correlation coefficient = 0: Indicates no linear relationship between the variables.
From our results we can see that year and mileage are negatively correlated. Price and year are wekaly positively correlated in our dataset.
Exploratory Data Analysis using MySQL:
MySQL:
MySQL is an open-source relational database management system (RDBMS) that is widely used for managing and manipulating structured data. It is one of the most popular database systems in the world, known for its reliability, performance, ease of use, and extensive features.
I am using MySQL workbench for SQL programming.
MySQL workbench :
MySQL Workbench is an integrated development environment (IDE) and graphical user interface (GUI) tool for MySQL database management and administration. It provides a comprehensive set of features for designing, modeling, developing, and administering MySQL databases, making it a popular choice among database administrators, developers, and data professionals.
We are using the same dataset for EDA using SQL.
Firstly we have to load the dataset and we will start with an understanding of the data. So, we perform the following queries.
Head( ): It is primarily used for quick data inspection and provides a way to get a glimpse of the structure and content of our data without displaying the entire dataset, which can be especially useful for large datasets.
limit -> clause is used to restrict the number of rows returned by a query.
2 . Checking for the size of the dataset:Here we count the rows of all the entries to know the exact number of rows in the dataset.
COUNT() -> Count function is used to count the number of rows.
3. Lets understand the percentage of cars sold model wise in our dataset:
I am using COUNT(), ROUND() , group by , order by.
ROUND() -> Function is used to round a numeric value to a specified number of decimal places.
GROUP BY -> Clause is used to group rows that have the same value into summary rows.
ORDER BY -> This clause is used to sort the result set returned by a select statement based on one or more columns.
From the above results, we can see that most of the cars sold are of ford, dodge, nissan and chevrolet.
4. Statestical Analysis on price :
Query helps us understand minimum, maximum, average and standard deviation of price of the cars.
MIN() -> This function is used to find the smallest value in the set of values.
MAX() -> This function is used to find the largest value in the set of values.
AVG() -> This function is used to find the average value in the set of values.
STDDEV_SAMP() ->This function is used to calculate the sample standard deviation of a set of numeric values.
5. Brand wise price analysis:
Query helps us understand minimum, maximum, average and standard deviation of price for each brand.
6. Year wise Brand and price analysis:
From our results we can see for year 2020 bmw, Chevrolet, chrysler are the cars sold and their average price.
Conclusion:
These are the insighful analysis of our cars dataset using Python programming language and SQL.
Exploratory Data Analysis EDA is an essential step in the data analysis workflow, helping to guide data-driven decision-making and drive actionable insights from the data. Whether through the Python programming language or structured querying of SQL , EDA empowers us to unravel the mysteries hidden within our data, guiding us toward informed decisions and actionable insights.
I hope this blog has provided some valuable insights into EDA and performing EDA using python and sql.
Let me know your feedback! Thank you for reading!
Comments