DATABRICKS AS THE DATA SOURCE FOR TABLEAU
Photo by Matt W Newman on Unsplash
In the real world , there will large amount of data we need to process. So for handling this big data, we need a good Data Processing tool. Spark is the Data Processing tool in the world of big data, and Databricks was founded by the creators of Spark.
To do the data processing we will need a cluster of computers .A cluster means a group of computers(nodes) working together .when we distribute our work to the cluster , by doing parallel working a large amount of data processing will be done at high speed. This is the specialty of Apache Spark. But maintaining a cluster is not easy, installing, and configuring takes a lot of our time.
This is where Databricks comes into action. Databricks allows us to define a cluster and maintains the cluster when we are not using them the Cluster will be shut down. It has another feature called scaling,we can add or subtract the nodes as processing increase or decrease. Spark runs better in Databricks, we don't need to worry about the infrastructure.
Databricks can be used as ETL tool ,Data storage, Analysis data and Derives insights using SparkSQL, and build predictive models using SparkML, it also provides active connections to visualization tools such as PowerBI, Tableau.Databricks is integrated with Amazon Web Services, Microsoft Azure, and Google Cloud Platform making it easy to integrate with these major cloud computing infrastructures
Advantages of Databricks:
Databricks delivers a Unified Data Analytics Platform, data engineers, data scientists, data analysts, business analysts can all work in tandem on the same notebook.
Data reliability and scalability through delta lake.
Basic built-in visualizations.
Github and bitbucket integration
Pull all your data together into one place
Easily handle both batched data and real-time data streams
Transform and organize data
Support for the frameworks (scikit-learn, TensorFlow,Keras), libraries (matplotlib, pandas, numpy), and IDEs (JupyterLab, RStudio).
Use across different ecosystems – AWS, GCP, Azure.
Use the data for machine learning and AI
Analyze, query the Data
we can use SQL, Python, Scala, Java, and R as the scripting language
when we tell about the Databricks,we have to cover the topic Delta lake.It is a term used in Databricks for storage layer which can accommodate structured or unstructured, streaming or batch information. It’s a simple platform to store all data.In Databricks we can write or read files in different formats like CSV, ORC,Parquet ,there is an extra format as Delta format is also there ,it is an extension of Parquet format.
Delta lake has the feature of adding a “transaction log”, which is a list of all operations performed on your data. But the data itself remains in the well-known Parquet format and can be accessed without using Databricks or even Spark. Transaction log helps us to see what all things we did with the file, using these we can do version control and time travel in Databricks,“Time travel”, which means you can read or revert to older versions of your data
Using Delta Lake provides “ACID compliance” (atomicity, consistency, isolation and durability) to your stored data. The ability to read and write batches of data and streams of real-time data to the same place as .Schema enforcement or modification, as we do with a database
So now you guys got a rough understanding of Databricks let's come to our topic as "How to connect Databricks to Tableau".I am using Azure Databricks for my explanation.
We need to create Azure Account
Create a resource group, inside it create Azure Databricks,create Azure Blob .once the Azure Databrick is started, create a cluster, because it will take time for creating a cluster.
We need Tableau Desktop Licensed Version to connect to Databricks.
1: Open Tableau desktop>>to a server>>choose Databricks.
In the opened window we need to fill the
We will get these values from Azure Databricks, before that we need to install driver.
2. So for the connection to Azure Databricks from Tableau ,we need to install Databricks ODBC DRIVER for that click on the link.
3.It will take to Driver Download>> go to the option of Tableau Desktop Tableau Server >>click on Databricks >>Databrick Driver JDBC/ODBC Download >> we need to fill the form and submit the form
4. Once we fill and submit the page , we will reach the page to download the driver depending upon your operating system.
5.Install the SimbaSpark ODBC
6. Now we need to get the
for that go to Azure Databricks >>click on the option Clusters
7.Click on Advanced option>>JDBC/ODBC
8.Copy the server Hostname,HTTPpath from this.
9.Restart the Tableau again after installing the drivers.Then select Databrick ,use the Server name,HTTPpath to fill the form
10. We need the Username,Password,for that again go to Azure Databrick. We need to generate a Token,So Token will be used as Password and its name will be Username. To generate a Token>>Click on the Option list(right side of screen)>>User Setting
10.Click on Generate New Token>>then give name in the comment box that is Username of theToken that we need for Tableau
11.Once we click the Generate button>> a window with the Token will appear,copy the Token and paste it in a notepad ,so even if the screen close we won’t lost the key.
12.Now fill the Username,Password in the Tableau window.
13.Once the databrick is connected we can see the connection name,Click on schema >>select a default schema.currently there is no table in databrick.that why there is no table name to show.
But once we read a CSV file from Azure Blob to Databrick and save data back to Azure Blob.So tha data will be available in default db in Azure Databrick. Databricks Delta stores its metadata on the file system. They are just files in either json (for each transaction) or parquet format (for a snapshot of the table metadata at some version). The metadata is just stored with the data files. In the picture below you can see once's table is available ,we can just click and use it similar to EXCEL Files.
Databricks is a cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. Recently added to Azure, it’s the latest big data tool for the Microsoft cloud. Available to all organizations, it allows them to easily achieve the full potential of combining their data, ELT processes, and machine learning. So it is a very helpful tool for Data Analyst, Data scientists to tackle the Data.