Python is a data management tool. To perform join in python,Pandas library is needed.Pandas contain different functions/operations on data frames.Pandas refer tables as data frame.So, each sheet in an excel file is a data frame.
There are three types of join functions in Python.
1.Merge: It merges two data frames on columns or indecies.
2.Join: The join() function will join two data frames together on a key column or index.
3.Concat: The concat() function bonds two data frames across the rows or columns.
It sounds rather similar,so let see what is the difference between three approaches.
Definition: The merge() method combines the content of two data frames by merging them together using the specified method(s). Uses the parameters to control which values to keep and which to replace.
Syntax:
Dataframe.merge(right, how,on,left_on,right_on,left_index,right_index,sort,suffixes,copy,indicator,validate)
All parameters except right, are keyword arguments.
Pandas supports
· Inner Join
. Left Join
· Right Join
· Outer Join
· Cross Join
To perform joins in python , we can either use Visual Studio code /Jupyter Notebook to write the syntax. We are using jupyter notebook here.First we have to import the pandas library using the ‘import pandas as pd’.
Data
We have two data frames .Our first data frame is "Lotr.csv". The table has the following columns. The data preview is shown below.
The second data frame is "Lotr2.csv" and it has the following columns,Also, the data preview is shown below.
Here , we have two dataframes df1 and df2. We used the ‘pd.read_csv(“filename”)’ to read the csv file .
As a default it will do inner join , and returns the specific values or the keys that matches in the df1 and df2 . Here FellowshipID and FirstName 1001 and 1002 matches in both the tables .
Inner Join:
Inner join in pandas is used to merge two data frames at the intersection. It returns the mutual rows of both.
On:Label or list
It’s a column or index level names to join on . These must be found in both dataframes.
Outer Join:
Use intersection of keys from both dataframes. It will match all rows that exists in both dataframes. The rows found only in one of the two data frames will get th NaN value.If there is overlapping data it won’t be duplicated.
Here, it will give us all of the values regardless of if they are same. In this dataframe,if the value doesn’t have so if we can’t join on the fellowshipID then it gives us Nan which means Not a number.
Right Join:
The right join returns all rows from the right data frame and the remaining data from the left.The data which doesnot correspond to right data frame will have NAN's assigned.
Left Join:
Use only keys from left dataframe.The left join returns all rows from the left data frame ,which will be merged with the corresponding rows from the right data frame. It fills the unmatched rows with NAN'S.
Here we are only looking at the left dataframe (df1)and then if there is something that matches in the right dataframe (df2) then those rows we will be given.
Cross Join:
Creates the cartesian product from both the frames, preserves order of the left keys.
It takes each value fro the left dataframe(df1) and compares it to each value in the right dataframe (df2).
JOIN
Join the column with other dataframe either on index or a key column. Join efficiently multiple dataframes objects by index at once by passing the list
If we want to join the key columns,we need to set key to be the index in both the dataframes.The return dataframe will have the key as its index.Another way to use to join the key columns is to use the 'on' parameter.Dataframe .join always uses others index but we can also use any column in dataframe.This method uses the original dataframe index in the results.
Different parameters in Join:
Other : Dataframes,series, or list of Dataframe
On : str,list of str, array-like,optional
how : {'left','right','outer','inner'}
lsuffix : str,default
rsuffix : str,default
sort :bool,default False
Finally , returns the dataframe containing columns from both the caller and the other.
Concat:
Concatenates pandas objects along a particular axis. When concatenating all series along the index(axis=0), a series is returned.When objects contain dataframe, then dataframe is returned.
Parameters used with concat () are:
objs: a sequence or mapping of series or datframe objects
axis: {0/'index',1/'columns'}
join: {'inner','outer'}
ignore_index: bool,default False
keys: sequence,default None
levels: list of sequences, default None
names: list,default None
verify_integrity: bool, default False
copy: bool,default true
Conclusion:
Merge() allows you to perform more flexible joins because it provides you more combinations,yet concat() is less structured. Join() combines data frames on the index but not on columns,yet merge() gives you a chance to specify the column you want to join on.
Thanks for reading my blog. Keep exploring.