top of page


For handling the big data, we need a good Data Processing tool. Spark is the Data Processing tool in the world of big data, and Databricks was founded by the creators of Spark.

To do the data processing we will need a cluster of computers .A cluster means a group of computers(nodes) working together .when we distribute our work to the cluster , by doing parallel working a large amount of data processing will be done at high speed. This is the specialty of Apache Spark. But maintaining a cluster is not easy, installing, and configuring takes a lot of our time.

This is where Databricks comes into action. Databricks allows us to define a cluster and maintains the cluster when we are not using them the Cluster will be shut down. It has another feature called scaling, we can add or subtract the nodes as processing increase or decrease. Spark runs better in Databricks, we don't need to worry about the infrastructure.

Databricks provides a unified, open platform for all your data. It empowers data scientists, data engineers and data analysts with a simple collaborative environment to run interactive and scheduled data analysis workloads.

Apache spark,DataFrame a distributed collection of rows under named columns. In simple terms, it is same as a table in relational database or an Excel sheet with Column headers.

Spark Dataframe supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, and CROSS, Spark SQL Joins are wider transformations that result in data shuffling over the network hence they have huge performance issues when not designed with care

so let's compare both Joins in spark and SQL


I am using Databrick public edition for this explanation.

Now let's jump into Spark SQL Join examples, I have created two sample Dataframe in Databrick using pyspark

Inner Join

Inner join is the default join and it’s mostly used, It is used to join two DataFrames on key columns, and where keys don’t match the rows get dropped from both datasets.From the above example you can see the from the df1 the cid 4 is removed.

Left outer Join

Left Outer join returns all rows from the left DataFrame regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found.As you can see that cid from left table have no match in the right table ,so its value are shown as NULL

Right Outer Join

Right Outer join is opposite of left join, Returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also referred to as a right outer join.

Full Outer Join

Returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join.

Left semi Join

Left Semi join is similar to inner join difference being left semi join returns all columns from the left DataFrame and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets

Anti Semi Join

Anti Semi-join does the exact opposite of the Spark left semi join, left anti join returns only columns from the left DataFrame/Dataset for non-matched records

Cross Join

This join simply combines each row of the first table with each row of the second table. For example, we have m rows in one table and n rows in another, this gives us m*n rows in the resulting table

Full Outer Join

Returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join.


Self Join is not available in Pyspark. you have to use inner join to create self Join .

114 views0 comments


Obtuvo 0 de 5 estrellas.
Aún no hay calificaciones

Agrega una calificación
bottom of page