In todays, Data driven era, information surrounds us like never before. From online transactions to social media interactions, every digital footprints takes us to the vast sea of data that shapes our world.
In the midst of this wealth of information, a significant challenge arises:
How do we effectively navigate, understand and extract valuable insights from this vast sea of data?
In this blog, we will be discussing about Hero of Data manipulation and analysis in the Python Ecosystem.
Like the friendly bear it’s named after, PANDAS offer a powerful toolkit for taming and harnessing the raw power of data. One must understand the concept of Pandas whether you are a seasoned data scientist, an aspiring analyst, or a curious developer pandas provides the means to transform raw datasets to actionable insights with ease and finesse.
Pandas is an open source software library which built on top of NumPy.
Let’s go through the following topics which serves as the Fundamentals of Pandas.
What is Python Pandas ?
Pandas used for Data Manipulation, Analysis and Cleaning. Python Pandas well suited for different kinds of data structures and functions. It is essential tool for data scientists, analysts and developers while dealing with tabular or time- series data.
Pandas introduces two main Data structures.
Series and Data frames.
Series : A one dimensional labeled array capable of holding any data types such as integer, string, float and other python objects. It is similar to python lists or NumPy arrays but additional features like Index labels.
Data Frames : A two dimensional labeled structure with columns of potentially different types. It’s comparable to spreadsheets or SQL tables, where each column signifies a distinct variable and each row corresponds to specific observation.
How to install pandas?
To install pandas go to terminal or command line and type
!pip install pandas
And once the requirements satisfies , simply type
“import pandas as pd”
How to Read data from the dataset ?
read_csv or read_excel are the functions in pandas used to read data from csv file (comma separated values) or excel files into a data frame, a tabular data structure similar to a spreadsheet.
From this output, we can see, three columns with 5 rows that's the student data we retrieved from csv file.
df.head() basically is a method in pandas to display the first few rows from the dataset
similarly we can also use method "df.tail()" to display last few rows from the data set
Python Pandas Operations
Using Pandas Operations, we can perform operations like series, data frames, missing data types, groups by etc.,
Some of the common operations for data manipulations are listed below:
1. SLICING THE DATA FRAME
In order to perform slicing on data, we need a data frame which is 2 dimensional data structure.
In this example, we are using 2 series object provided by pandas. We also used data frame method which will accept dictionary and the keys are “classes” and “grades”.
Now, we can see data frame that we created with 2 columns and the assigned values.
"df.loc[]" is a method in pandas is used for label-based indexing to access group of rows and columns.
Now Let us see iloc[] method :
2. MERGING
Merge operation in pandas "pd.merge()" is used to combine data frame objects based on one or more keys.
For merging two different data frames initially we have to read the datasets LOTR and LOTR 2.
From this two tables, we can see Fellowship ID and FirstName are common columns and Skills and Age are not.
Let's see how we merge this two data frames now:
2.1 Inner Merge:
Yes, we successfully merged two different data frames with simple .merge() method. Can also use " how = inner"
2.2 Outer Merge:
In this example, 1006,1007,1008 Skills is not given from the original dataset . So we can see NaN (Not a Number) values . 1003 ,1004 are not given.
2.3 Left Merge:
From this example, Left merge considers all values from the left data frame i.e., df1 and also merges which are only matches.
2.4 Right Merge:
The result contains all rows from right data frame with corresponding values of df1 inserted where there are matches. 'NaN' values which are not matched.
2.5 Cross Merge :
In this example, The values from df1 first row will be returned and the it compares with all rows from df2 , and again records values from df1 second row and then again compares with all rows from df2 and it repeats.
Joins
Joining operation is used to ".join()" to combine data frame objects by aligning their indexes or columns.
These are very similar to Merge operations but Join operations implement on “Index” instead of ”columns.
This code sets the 'FellowshipID' column as index for both df1 and df2 , then joins using outer and adds suffixes to overlap the column names.
3. CONCATENATION OPERATIONS
In pandas, concatenation is a process of combining two or more data frames along a particular axis.
Concatenation along rows where (axis =0)
Concatenation along rows where (axis =1 )
4. INDEX Operation
In pandas, Indexing refers to selecting specific rows or columns of a Data frame or Series.
Lets create a new data frame
Lets set an Index by changing the "Day" as the Index for the same data frame
‘Set_index’ allows to manipulate data frame based on the value in the index.
From the output we can notice, value of the index has been changed with respect to ’Day’ column.
Another Example By Reading the data frame
5.CHANGE THE COLUMN HEADERS
This method we use to rename the header of the column.
Here, we are using the rename method to change the original column name from 'visitors' to new renamed column name 'Users'
Yes, Successfully we renamed the header of the column.
Let's see with another example from csv file :
Renamed the column header to 'Students' from original header 'Name'.
6. DATA MUNGING
Data Munging also knowns as Data Wrangling, is the process of cleaning, transforming and preparing raw data into format suitable for users to analyze. It involves in several steps aimed to improve the quality and usability of data.
Let's see the interesting data conversion technique by using 'to_html' method in pandas which converts Data frame to HTML table and save to file in web server.
This is the original Data frame :
Now Let us see how its converting to HTML and saving it in web server.
USE CASE TO ANALYSE THE DATASETS
First Lets load the 'titanic' data set
where we have information containing about the passengers abord the Titanic and analyze the data about the survival rates.
Lets Find the Passengers who survived by Age group
CONCLUSION:
In this blog of Pandas fundamental operations and data visualization with examples we have explored the key concepts of data loading , data manipulations , data cleaning, indexing , merging , concatenating , data munging and data transformations . We have also seen visualizations using matplotlib library along with pandas. We got few valuable insights from the Titanic data set and would love to explore more datasets in my next blog .
Thank you for joining me in this journey of pandas exploration !
Happy coding !!
Comments