Are the phrases “data is the new gold” or “data is the new oil” just marketing slogans to entice the businesses to spend more money on technology? Or
Are these metaphors really describing a new paradigm that revolutionizes the world?
Few facts:
Both oil and gold were buried deep under the surface of the earth for millions of years. And both started to become valuable for humans. We learned how to dig them out of earth and process them as to become jewelry or fuel for light, heating, transportation, and energy.
Similarly, data has been around for ages, starting with early Sumerian writing scripts and continued through centuries of printed books to today’s digital databases in 20th century.
Data has been critical resource for understanding the past, making decisions in the present and planning for future .
Overview:
In this rapid digital evolution, businesses and industries face challenges for effectively storing , managing and analyzing vast amounts of data. They are seeking efficient ways to store, process and analyze their data.
Three prominent architectures have emerged to meet these needs.,
1. Data Warehouse
2. Data Lakes
3. Data Lakehouse
A data Warehouse, Data Lake and Data Lakehouse are related to each other but fundamentally different technologies, with different cloud storage solutions. Each serves a distinct purpose and offers unique advantages.
In this blog let’s explore these technologies, key features and difference and benefits in modern data strategies.
Data Warehouse stores data in a structured format. It is central repository of preprocessed data for analytics and business intelligence.
Data Lake is a central repository for raw data and unstructured data.
Data Lakehouse is an architectural approach which combines this both data warehouse and data lake features. It can store all types of data structured, semi-structured and unstructured. Aiming to provide best of both technologies.
What is Data Warehouse?
Data warehouse integrates data and information collected from various sources into comprehensive database. A data warehouse is a system that stores data from company’s operational databases as well as external sources.
Data warehouse platforms are different from operational databases because they store historical information. So, this makes easier for business leaders to analyze data over a specific period of time.
It is a type of data management system that facilitates and supports business intelligence (BI) , designed for reporting and analysis of data.
History of Data Warehouse
Listed below are the major things in the transformation of Data warehouse.
Dartmouth and General Mills collaborated to develop the concepts , components and statistics in 1960.
Nelsen and IRI launched dimensional data marts for retail sales in 1970.
Tera data corporation introduced a database management in 1983.
The business data ware was created by IBM employees Paul Murphy and Barry Devlin in late 1980’s
But Inmon Bill was the one who really articulated the idea. He was regarded as father of the data warehouse. For the construction, use and keep up the warehouse and the corporate information factory.
Fundamental Concepts of Data warehouse and Terminologies
ETL: It stands for Extract, transform and load. The technique of extracting the data from source file , transforming into suitable layout, loading into data warehouse
Ingestion : Process of collecting and importing data into data warehouse through external sources.
Data Lake : A data lake is a centralized region where data is in large quantities of structured, semi structured and unstructured data records are processed , stored and secured.
Data Transformation : Process of changing , cleaning , transforming from one structure to another for analysis functions.
Partitioning : Splitting complicated large tables into smaller chuncks.
Data Mining : One of the most useful technique to extract valuable information from hugesets of data , also known as KDD Knowledge Discovery in Database.
Data Quality : Data Quality is the state of data , reflected in its accuracy , completeness , reliability , relevance and timeliness.
Data cleaning : Process of identifying and correcting errors, inconsistencies and inaccuracies in datasets. It is also known as Data scrubbing
Metadata : Metadata is a data providing information about one or more aspects of the data.
Data Integration : Process of merging data records from different sources to an integrated warehouse.
From the diagram here.,
Suppose there is system called HR or any Sales transaction or flat files like CSV or Excel or from any other sources, are going to be stored into centralized Data warehouse.
This data warehouse can be divided into 4 stages as shown,
At Collect stage we fetch the data from different source systems in raw form and store it in a layer called Staging layer and this process is done by ETL. Analyze the data about insights to gain and then transform the data into a shape in a form where business users get some useful information. And finally, the consume stage, the layer where graphical visualizations are reported.
There are many tools to enable ETL like Informatics, SAP data services, Apache NiFi , Talend , Microsoft SQL Server Integration Services (SSIS) , AWS Glue , Google cloud Data flow , Azure Data Factory , Oracle Data Integrator , IBM Data Stage and many other tools which make them suitable for different types of their ETL needs and environments.
There are various tools for Graphical visualizations like, Tableau, Power BI, Plotly, Matplotlib, Seaborn, ggplot2 , Business objects where to explore the data and interactive capabilities according to requirements.
DWH Paradigm Components
Data sources,
ETL,
Data warehouse,
Data Modeling techniques like star schema snowflake schema, Data vault
Data integration and Management,
Data Governance, tools such as Informatica, Collibra, Alation
Graphical visualization
All this together, combined approach allows organizations to efficiently manage large volumes of data, derive meaningful insights and support decisions making across the enterprise.
Though Data warehouse have long history of maintaining the business intelligence applications and support in decision making, they were not suited and very expensive for handling unstructured and semi structured data. Then the concept of Data Lake is evolved as a solution to handle big data challenges.
It gained popularity around 2010 – 2012.
What is Data Lake?
Data Lake Philosophy – “Load First Think Later “
A Data Lake is a massive repository of structured and unstructured data which has been processed for defined purposes.
Think of it as a massive storage pool for data in its natural, raw state like a lake. It can store raw data in its native format until it needed for further analysis.
A data architecture can handle these huge volumes of data that most organizations produce without the need to structure it first. Data stored in data lake can be used to build data pipelines to make it available for data analytics tools to find insights that inform key business making decisions.
A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and then transforms into tasks such as reporting , visualization , advanced analytics and machine learning.
It can include structured data from relational databases like rows and columns. Semi structured data like CSV, XML, JSON , logs. Unstructured data like Emails, documents, PDFs and Binary data like Images , Audio & video.
A data lake can be established within the organization’s data centers “on premises”.
Or “in the cloud” (cloud services vendors such as Amazon, Microsoft or Google).
When the data is stored , data lake associates with
Identifiers and Meta data tags for faster retrieval.
Data Lakes are incredibly useful for Machine Learning (ML) and Artificial Intelligence (AI) applications.
There are numerous possibilities with data lake solutions .
Benefits with Data lake are very cost effective when compared with Data warehouse .
As the extraction and loading phase (ETL processing) of data lakes are super fast because while fetching the data the business logics are not considered in the beginning. We transform the dataset according to requirements in the further stages.
Data Lakes challenges :
There are Downsides of Data lakes that can potentially turn them into Data swamps
What is Data swamp ?
A data lake which is disorganized and poorly managed data lake .
Disorganized data when it is messy , duplicated and lacks its structure
Poor Governance with lack of security and data quality
Querying and analyzing data from data swamp is slow and inefficient.
Data swamps makes it hard to get the valuable insights.
For these reasons, traditional data lakes on its own not sufficient to meet the needs of the businesses looking to innovate, which is why businesses often operate in complex architectures with data siloed away in different data storage environments. Simplifying that architecture by unifying data in a data lake is first step for companies that aspire to harness the power of ML and AI. Data Lakes are still under development.
How Data Lake houses solve these challenges ?
Data Lakehouses are the most recent open architecture designed to combine best elements of data lakes and data warehouses by adding table metadata to files in object storage.
This added metadata provides additional features to data lakes including time travel , ACID transactions, better pruning and schema enforcement .
Lake houses uses similar data structures and data management features as those in data warehouses but instead runs them directly on cloud data lakes. Ultimately, Data Lakehouses allows traditional analytics , data science and machine learning to coexist in the same system all in the open format.
The answer to the challenges of data lakes in lake houses is, which adds transactional storage layer on top.
The added metadata feature, includes time travel, ACID transactions, schema enforcement features that are typically in data warehouses but are generally lacking in data lakes .
Storing data in open table format can be greatly beneficial for improved interoperability, but can result in greater tool version compatibility.
Data analysts can harvest rich insights by querying the data lake using SQL.
Data scientists can join and enrich datasets to generate ML models with greater accuracy.
Data engineers can build automated ETL pipelines.
Business Intelligence analysts can build visual dashboards and reporting tools faster and easier than before.
These use cases can be performed on the data lakes simultaneously without lifting and shifting the data.
Products like Microsoft Azure, Google cloud, AWS allows to build such data lake house very efficiently using cloud technology. Snowflake, Delta Lake, Hadoop and Data bricks are few examples of technologies that enable these solutions.
Cloud Platforms : Microsoft Azure , Google cloud, AWS
Data Lakehouse Solutions : Snowflake, Delta Lake, Hadoop, Data bricks, Amazon Redshift, Google Big query
Challenges with Data lakehouses
Data governance with data lakehouses has become a challenge again due to simultaneous management of data warehouse and data lakes along with the needs to ensuring the high quality in Metadata management . As it is relatively new concept data lakehouses are still under development.
This version captures the essence of challenges and developmental stages of data in the context of data governance.
Snow flake's Data cloud
Snowflake's Data can be used to build various architecture patterns that align with needs of various use cases. This gives customers ability to read and write data in cloud object storage , functioning as data lake query engine.
Regardless of pattern, Snowflake adheres to core tenets of strong security, governance, performance and simplicity.
A flexible platform like Snowflake allows traditional business intelligence tools and newer and more advanced technologies devoted to artificial intelligence , machine learning , data science and applications.
It's a single platform that can be used to power multiple types of workloads.
Conclusion :
Understanding these technologies and their respective advantages is crucial in developing modern data strategies that meet the diverse needs of today’s data businesses.
By leveraging the strengths of Data Warehouses, Data Lakes and Data Lake houses organizations can optimize their data management processes, gain valuable insights quickly and drive innovations in increasingly data driven world.