In the field of Data Science, Exploratory Data Analysis or EDA is a crucial step in deciphering and ana the narrative concealed data. It’s similar to a detective painstakingly going through the evidence in search of trends, anomalies, and hidden insights. In python, EDA is the process of analyzing and visualizing datasets in order to highlight key features, find patterns, spot errors and acquire new knowledge. Prior to creating predictive models or making data driven decisions, it is an essential phase in any data analysis effort. Different Key Components of EDA:
1. Descriptive Statistics
2. Data Visualization
3. Correlation Analysis
4. Outlier Detection
Importing Data from Files:
We can handle the majority of data sources well thanks to Python’s versatility. This tutorial explains how to import data from several source, including files and URL’s.
· To read CSV File: df=pd.read_csv(‘filename.csv’)
· To read Excel file: df=pd.read_excel(‘filename.xlsx’)
· To read from SQL Database: df=pd.read_sql(query,connection)
Basic Data Inspection:
Once you have loaded the data into data frame, you can use following methods to get initial sense of your data.
· Display Top Rows: df.head()
· Display Bottom Rows: df.tail()
· Display Data Types: df.dtypes
· Summary Statistics: df.describe()
· Display Index, Columns and Data: df.info()
Data Cleaning:
Working with several data sources increases the likelihood that the data will be inaccurate, duplicate or mislabeled. Though they may appear accurate, results and algorithms are unreliable if the data is incorrect. Pandas, a powerful python library, offers a wide variety of tools and functions that simplify the data cleaning process. In this blog, we’ll walk through some methods to know how to perform data cleaning using pandas.
· Check for Missing Values: df.isnull().sum()
· Fill Missing Values: df.fillna(value)
· Drop Missing Values: df.drpna()
· Rename Columns: df.rename(columns={‘old_name’ : ‘new_name’})
· Drop Columns: df.drop(columns=[‘column_name’])
Data Transformation:
Data Transformation techniques in pandas is an effective python module for analyzing and manipulating data. Here we are looking into some common data transformation functions.
· Apply Function: df[‘column’].apply(lambda x: function(x))
· Group by and Aggregate: df.groupby(‘column’).agg({‘column’: ‘sum’})
· Pivot Tables: df.pivot_table(index=’column1’, values=’column2’, aggfunc=’mean’)
· Merge Dataframes: df.merge(df1, df2, on=’column’)
· Concatenate Dataframes: df.concat([df1,df2])
Data Visualization Integration:
The graphical display of data is known as data visualization with Pandas. It aids with the clear and efficient communication of information and helps people comprehend the importance of data by condensing and presenting vast amounts of data in an understandable manner. In order to improve the intuitiveness and insight of your data analysis, we will examine how to use pandas to create a variety of plot types in this blog.
· Histogram: df[‘column’].hist()
· Boxplot: df.boxplot(column=[‘column1’, ‘column2’])
· Scatter Plot: df.plot.scatter(x=’col1’, y=’col2’)
· Line Plot: df.plot.line()
· Bar Chart: df[‘column’].value_counts().plot.bar()
Statistical Analysis:
Statistical Analysis in Pandas using the library’s function to analyze, explore and summarize the data.
· Correlation Matrix: df.corr()
· Covariance Matrix: df.cov()
· Value counts: df[‘column’].value_counts()
· Unique values in Column: df[‘column’].unique()
· Number of Unique Values: df[column’].nunique()
Indexing and Selection:
Indexing and selection in Pandas refers to the process of accessing specific row, columns or elements within a data frame. The index is a unique identifier for each row in a data frame. You can access specific data points in a data frame using square brackets ([]) and the index, column name, or a combination of both.
· Select Column: df[‘column’]
· Select Multiple Columns: df[[‘col1’, ‘col2’]]
· Select Rows by Position: df.iloc[0:5]
· Select Rows by Label: df.loc[0:5]
· Conditional Selection: df[df[‘column’] > value]
Data Formatting and Conversion:
The process of converting data inside a DataFrame to a desired structure, type or representation is known as data formatting and conversion in Pandas. When cleaning and preparing data for analysis or display, this is an essential step. Data formatting is the process of transforming data into a standard format so that users may compare them more easily. Examples of unformatted data include using the same item in the same column but with distinct values, such “Los Angeles” and “LA”.
· Convert Data Types: df[‘column’].astype(‘type’)
· String Operations: df[‘column’].str.lower()
· DateTime Conversion: df.to_datetime(df[‘column’])
· Setting Index: df.set_index(‘column’)
Advance Data Transformation:
Pandas provides a robust set of tools for advanced data transformation. The stack function rotates the innermost level of the index to the columns whereas unstack function rotates the innermost level of the column to the index. Crosstab() function is used to compute frequency of table of two or more factors. It’s a powerful tool to analyze the relationship between categorical variables. Here are a few key techniques
· Lambda Functions: df.apply(lambda x: x+1)
· Pivot Longer/Wider Format: df.melt(id_vars=[‘col1])
· Stack/Unstack: df.stack(), df.unstack()
· Cross Tabulations: pd.crosstab(df[‘col1’], df[‘col2’])
Handling Time Series Data:
Handling time series data in Pandas involves working with data that’s indexed by time. Pandas provides excellent support for working with timeseries data.
· S et Datetime Index: df.set_index(pd.to_datetime(df[‘date’])
· Resampling Data: df.resample(‘M’).mean()
· Rolling Window Operations: df.rolling(window=5).mean()
File Export:
· Write to CSV: df.to_csv(‘filename.csv’)
· Write to Excel: df.to_excel(‘filename.xlsx’)
· Write to SQL Database: df.to_sql(‘table_name’, connection)
Conclusion:
Pandas provides a rich set of tools for transforming data. In this blog, we have explored few of function and methods used in python. Gaining proficiency in these methods will prepare you to work with pandas to manipulate and analyze data effectively.