For handling the big data, we need a good Data Processing tool. Spark is the Data Processing tool in the world of big data, and Databricks was founded by the creators of Spark.
To do the data processing we will need a cluster of computers .A cluster means a group of computers(nodes) working together .when we distribute our work to the cluster , by doing parallel working a large amount of data processing will be done at high speed. This is the specialty of Apache Spark. But maintaining a cluster is not easy, installing, and configuring takes a lot of our time.
This is where Databricks comes into action. Databricks allows us to define a cluster and maintains the cluster when we are not using them the Cluster will be shut down. It has another feature called scaling, we can add or subtract the nodes as processing increase or decrease. Spark runs better in Databricks, we don't need to worry about the infrastructure.
Databricks provides a unified, open platform for all your data. It empowers data scientists, data engineers and data analysts with a simple collaborative environment to run interactive and scheduled data analysis workloads.
Apache spark,DataFrame a distributed collection of rows under named columns. In simple terms, it is same as a table in relational database or an Excel sheet with Column headers.
Spark Dataframe supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, and CROSS, Spark SQL Joins are wider transformations that result in data shuffling over the network hence they have huge performance issues when not designed with care
so let's compare both Joins in spark and SQL
Prerequisite
I am using Databrick public edition for this explanation.
Now let's jump into Spark SQL Join examples, I have created two sample Dataframe in Databrick using pyspark
Inner Join
Inner join is the default join and it’s mostly used, It is used to join two DataFrames on key columns, and where keys don’t match the rows get dropped from both datasets.From the above example you can see the from the df1 the cid 4 is removed.
Left outer Join
Left Outer join returns all rows from the left DataFrame regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found.As you can see that cid from left table have no match in the right table ,so its value are shown as NULL
Right Outer Join
Right Outer join is opposite of left join, Returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match. It is also referred to as a right outer join.
Full Outer Join
Returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join.
Left semi Join
Left Semi join is similar to inner join difference being left semi join returns all columns from the left DataFrame and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets
Anti Semi Join
Anti Semi-join does the exact opposite of the Spark left semi join, left anti join returns only columns from the left DataFrame/Dataset for non-matched records
Cross Join
This join simply combines each row of the first table with each row of the second table. For example, we have m rows in one table and n rows in another, this gives us m*n rows in the resulting table
Full Outer Join
Returns all values from both relations, appending NULL values on the side that does not have a match. It is also referred to as a full outer join.
IS THERE A SELF JOIN?
Self Join is not available in Pyspark. you have to use inner join to create self Join .