What is Databricks?
Data Bricks is a consolidated cloud-based data storage platform. Nowadays, organizations are moving to cloud-based data storage to minimize their cost and efficiently maintaing the data flow. Data can be ingested by different teams such as data scientists, data analysts, data engineers, etc from one cloud based platform-Databricks.
Databricks can be used as a data source in Machine learning, visualization tools such as Tableau and power BI, Python, etc.
Benefits of Databricks:
Databricks behind GUI- The Architecture:
Databricks uses Delta tables for data storage. Data is ingested from different sources in a single platform, cleaned, pre-processed and visuals are created in cloud. Also, machine learning models can be created inside data bricks. Databricks is a secure and reliable platform inside which workflows can be scheduled and managed.
Databricks for different roles:
Getting Started with Databricks:
To get started with Databricks, go to https://www.databricks.com/ and sign up for free and choose your cloud provider. If you don't have a cloud account, you can try with community edition.
You will get a puzzle to solve once you click on community edition, and verify you email. You can set up your password now. Now go to https://community.cloud.databricks.com/ to get started. You will see the below interface after signing in.
Here we can see different components for importing, transforming data and Machine learning. W can create worspaces to collaborate, we can access data by creating clusters and manage workflows. To access table/data in databricks, first we need to create clusters.
Cluster in Databricks: Cluster in Databricks is a set of storage, CPU and memory, that is, components whci are used for data processing and analysis. We can run different workloads for Data Engineering, Data Science and Data Analytics in Databricks.
Let's start with creating a cluster:
Go to compute tab on the left side
After creating a compute, we will be able to use the notebook. Go to + icon on left side of the screen and select 'create notebook'.
Now give a name to your notebook and let's get started. We can use different languages here: Python, SQL, R and Scala
I am selecting SQL here.
To start with the project, first we need to upload the data here. Go to Data tab and create new table.
This is the data we will work in this demo.
After you click on 'create table', you will get the option of Uploading a file.
Click on 'Create table with UI'.
We can preview the uploaded table here.
Now click on create table.
We can see all tables by using 'Show tables' command.
We can look at the schema of the database.
Let's create a scenario to see count of Ingredient 1 accross the dataset. click on + sign in the cell you want to analyze.
Then select the columns to put in the chart.
There are different types of chart to choose from
We can publish the notebook as well.
To conclude, Databricks is one platform in cloud to ingest, clean and visualize big data. Databricks is a robust platform to derive insights from raw data, whether you are a data scientist, data analyst or data engineer.
Thanks for reading!