Joins
Joins in the data world means combining two tables horizontally. Join functions/queries help us to consolidate two separate tables with different column information into 1 table with all the needed information for data analysis. The two or more tables that we are intending to combine should have at least one common key.
Types of Join
Inner join
Inner join table query/function will return only the common rows that are present in the parent table ( base table on which join is executed) along with the added column information from join table (Secondary table) for those rows.
For example:
Notice the difference in rows in both parent table and join table.
student id 7 which is in the Parent table is not there in the join table. And, student id 9 is there in the join table but not in the Parent table.
Parent Table:
Join table:
Inner join result table:
If you notice the inner join result table, both student id 7 and 9 are missing. It is safe to say that the inner join only returns the common rows .
Outer join:
Outer join ensures to keep all the rows from both parent table and join table irrespective of the information present in them. And, the missing values of the corresponding columns that don't have information are presented as ‘null’.
Here, you can see all the rows from both parent table and join table. Even Though, student id 9 from the join table is absent in the parent table, it's still included.And, the corresponding columns ‘Years’, ‘previous work experience’, ’join date’ appear as ‘Null’. Likewise, the student id from the parent table was absent in the join table hence the columns ‘Age’ and ‘Gender’ are shown as null.
Outer join is crucial to ensure no data loss and for understanding a comprehensive view of both tables.
Left join (Left outer join):
Left join ensures to keep all the rows from the parent table (left table) and takes only the common rows from the join table (right table). And, rows from the parent table that are absent in the join table will still be included and will have missing values in the corresponding columns that are added from the join table. And, these missing values will be represented as ‘null’.
Here in the result table, student id 7 is included with ‘null’ values for columns from the join table and student id 9 row is not included at all.
Right Join (Right outer join):
Almost the same concept as left join but here the right table (join table) will have the upper hand. Right join ensures to keep all the rows from the join table (right table) and takes only the common rows from the parent table (left table). And, rows from the join table that are absent in the parent table will still be included and will have missing values in the corresponding columns that are added from the parent table. And, these missing values will be represented as ‘null’.
Here in the result table, student id 9 is included with ‘null’ values for columns from the parent table and student id 7 row is not included at all.
Now that we understand what joins are and the different types. Let's get into how we can do these joins on different platforms: Tableau, SQL, and Python (Pandas). When I started learning data analysis, I was amazed by the variety of platforms available for performing analysis. I often wondered what differences allowed these software programs to thrive. Here is my research on how to perform joins in the most commonly used software, highlighting their similarities and differences.
Performing Join in Tableau
Tableau is a data visualizing tool and it supports
Inner join
Left join
Right join
Outer join
Self join
Cross-database join.
Join in Tableau is straightforward and requires just a few clicks, drag and drop actions.
First step in data analysis irrespective of the platform, is importing data. So, let's import the data using the 'Connect' in Tableau. The example here uses an excel file hence we are importing it using 'Microsoft Excel'.
Once the data is imported, the dataset name and all the sheet names present in the dataset should become visible in the data source page also known as canvas in Tableau.
Dataset name : Joins . The dataset has three sheets with sheet names 'work experience', 'demographics', 'course'.
Now, drag and drop the sheet that you needed as a parent table to the canvas. And, double click on the sheet name after dropping it in the 'Drag tables here' section to open the join canvas
Another way is to right click and select 'open' from the menu.
Here, from the dataset, 'work experience' table is selected as parent table.
Once, after double clicking the 'work experience', a new window will open that will say how many tables are present in the tableau's data source.This window is called join canvas.
Now, the join table from the sheets shelf is dragged and dropped near 'work experience'.
Here for example, join table is demographics.
By default, inner join will be performed. If a different join is preferred, the joined circles between the two
table name (join icon) should be clicked. And from the listed join operators , select the one that is appropriate for the analysis.
And, Tableau is smart enough to select a join clause by going through the table's column names. In case, a different join clause is need, just click on the suggested column name 'Student id'
List of the column names will appear, select the appropriate column on which the table should be joined.
Note that the 'Data Source' refers to the parent table. Here for example, 'Student name' is used as join clause.
Multiple join clause can be used too using 'Add new join clause'. And multiple data can be joined too by dragging and dropping the sheet near to the already joined table.
If the common column name in join table differs from the parent table, tableau will throw an error. In that case, the join column names should be manually selected. Once after join is created, close the join dialog by clicking on to 'x'.
Join clause usually work based on equal symbol '=', which means the values from the parent table and join table should be exactly same. It is also possible to join tables on '<' , '>' , '<>' (meaning not equal to). Even though non eqi joins are possible, their use cases are limited.
Lets, take a look at the joined table. Joined table is visible in the bottom right corner of the data source page.
Note that even after performing join, tableau still keeps the common columns separately specifying the origin ' table name. Tableau does not merge them in to one column.
Self Join
Joining the table to itself is Self join. Commonly used as a step in data densification for creating advanced charts.
Cross-database Join
Tableau also supports Cross-database joins, which means we can join sheets from completely two different data source files. If a Cross-database join is need for a data analysis, simply connect the other data source by importing the file using the connections and select the sheets to be joined.
In conclusion, performing joins in Tableau is an easy task and does not require any additional syntax or coding interface. However, one drawback is that viewing the entire table without duplicate column is not possible.
Performing Join in Python
Python is a data management tool. To perform join in python, Pandas library is needed. Pandas contain different functions/ operations on dataframes. Pandas refer tables as dataframes. So, each sheet in an excel file is a dataframe. Dataframe uses 'merge' syntax to perform join.
Pandas support
Inner join
Left join
Right join
Outer join and
Index join
To perform join in Python, specific syntax must be written in the jupyter notebook/ visual studio code.
First and foremost important step is to import the Pandas library using 'import pandas as pd'.
Lets first import the data- same excel file 'Joins' is used.
syntax used:
data _parent = pd.read_excel("Joins.xlsx",sheet_name="work experience") ----- 'sheet_name' specification is important because by default pandas will import only the first sheet, excluding other sheet information.
data_parent.head() ----- This step is used to look at the table/dataframe in the python.
Next step is to import the join table sheet in to python.
syntax used: data_join = pd.read_excel("Joins.xlsx",sheet_name="demographics")
data_join.head()
Both tables are imported and ready to be joined.
A. Inner Join
syntax used: df = pd.merge(data_parent, data_join, on='Student id', how='inner')
B. Outer join
syntax used : df_outer= pd.merge(data_parent, data_join, on='Student id', how='inner')
C. Left join
syntax used : df_left = pd.merge(data_parent, data_join, on='Student id', how='left')
df_left
D. Right join
syntax used: df_right = pd.merge(data_parent, data_join, on='Student id', how='right')
df_right
E. Index join
Index join is performed on indices. if u have noted in the above tables, dataframes are automatically assigned to an index starting from 0.
syntax used: df_index = pd.merge(data_parent, data_join, left_index=True, right_index=True)
df_index
What did index join do? Notice the table, index join is nothing but the left join with the display of both common column from parent table and join table just like the display in Tableau.
To join multiple tables in Pandas, the tables should be joined one by one.
for example to join another dataframe, we have to use the already joined table as parent table and should perform join on it.
Lets see an example, import the third sheet.
syntax used: data _third = pd.read_excel("Joins.xlsx",sheet_name="course")
data_third
'df_outer' is the joined table of sheets 'work experience' and 'demographicsAnd, we are joining table 'Course'.
Self join and Cross database join
Performing Self join or cross database join do not need any different approach, can be done using the same syntax as any other join
In conclusion, performing joins in Python needs a library Pandas and a coding interface. In context of joining tables, Tableau was much easier to use. Unlike Python, which requires precise syntax, and can throw errors for minor mistakes such as misspelling or missing brackets, Tableau allows for join to be performed with simple drag and drop action. However, viewing the merged table in Python was more cleaner as it merges the common columns into one providing a more unified table view.
Performing Join in SQL
SQL - Structured Query Language is a programming language for storing and processing information in relational database. SQL is also a data management tool like Python but SQL handles data only in relational databases. SQL is used to retrieve data, analyze it and also manipulates the data whereas Python is mainly used for data cleaning, visualization, and machine learning.
SQL supports
Inner join
Outer join
Left join
Right join
To perform join in SQL, I am using pgAdmin4 SQL server.
First and foremost important step is to import the data.
Importing a data into SQL is a complete different approach, it needs multiple steps. So, for this blog I am goin to concentrate only on how to do join.
Inner Join / Right Join /Left join
Same syntax is used - just replace 'INNER JOIN' with 'LEFT JOIN' or 'RIGHT JOIN' depending on the join desired.
Example Syntax used:
SELECT Work_experience.Student_ID,
Work_experience.StudentName,
Work_experience.previous_experience,
Work_experience.years,
demographics.Student_ID,
demographics.Gender ,
demographics.Age
FROM Work_experience
INNER JOIN demographics ON Work_experience.Student_ID = demographics.Student_ID;
B. Outer Join
Outer join is specified as full join or outer full join in SQL otherwise syntax is same.
Syntax used:
SELECT Work_experience.Student_ID,
Work_experience.StudentName,
Work_experience.previous_experience,
Work_experience.years,
demographics.Student_ID,
demographics.Gender ,
demographics.Age
FROM Work_experience
FULL OUTER JOIN demographics ON Work_experience.Student_ID = demographics.Student_ID;
Self Join in SQL is done by using aliases 'AS' and self join is mainly performed to map two columns or if a hierarchal output is expected.
common syntax example:
SELECT D.StudentName, De.StudentName
from Work_experience as D
join Work_experience as De
on D.StudentName=D.StudentName
order by D.StudentName;
This example illustrates how many combinations are possible when any two students are selected, resulting in clustering within the output.
Conclusion
Let's see the similarity / dissimilarity learnt while performing joins in 3 different software. This table is completely based on my understanding while doing these joins in SQL, TABLEAU and PYTHON.
TOPIC | TABLEAU | PYTHON | SQL |
Coding | User friendly and no coding interface needed | Needs coding knowledge | Needs coding knowledge |
Process | Simple drag and drop concept | Needs correctly written syntax otherwise will throw error | Needs correctly written syntax otherwise will throw error |
Default Join | Inner Join | Inner Join | Inner Join |
Choosing subset of columns | Choosing /filtering columns cannot be done in joining step | Filtering out columns cannot be done in joining step. Pandas needs an extra data step syntax to filter out columns. | SELECT statement in syntax gives the flexibility of choosing column variables |
Joining more than 2 tables | Joining more than 2 tables is a breeze | Joining more than two tables is a multistep process | Joining more than 2 tables, needs just an extra line of syntax in join step. SELECT columns FROM ((parent table INNER JOIN jointable1 ON parenttable.primarykey = jointable1.primarykey) INNER JOIN jointable2 ON jointable1.primarykey = jointable2.primarykey)) |
Table view | Common column (primary key columns) are kept separately. They are not merged in to one. | Common columns are merged to provide a clean view. | Similar to Tableau, primary key columns are not merged into single column |
Data exclusion | Data exclusion is possible as the primary key columns don't merge. In case of analysis needed on primary key column, I think Union is much better option | Since, Primary key columns merges well, data exclusion does not happen when primary key is used in analysis | Data exclusion is possible as the primary key columns don't merge. In case of analysis needed on primary key column, I think Union is much better option |
Creating subset of joined table | Not possible in joining step | Possible but needs an extra data step. | Needs just an additional line of WHERE clause syntax in data joining step |
Cross database join | It's easy to perform- just need to import data using connections in Tableau | No different type of syntax is needed, similar steps should be followed | Since SQL uses relational database system, performing cross database joins uses complete different syntax. |
Result set | output is not stored as a separate table | New dataframe containing joined data | output is not stored as a separate table |
Missing valuerepresentation | null | Nan | [null] |
Mastering joins will enhance your ability to understand complex data relationships. Understanding the syntax and application of joins will help in producing insightful connections between datasets. I hope this blog has provided some valuable insights into performing joins in different software platforms. Let me know your feedback! Thank you for reading!
you explain so well !