Database, Data Warehouse and Data Lake

We are generating crazy amount of data every single day and businesses are turning to this data for decision making. Storing this vast ocean of data is very important. Data Storage is a big deal!


In this blog, let’s understand what hot and cold data storage mean and then later let’s explore Database, Data Warehouse and Data lake.


Also, I suggest you read this article to understand Structured data, Unstructured data and Semi-structured data before we proceed.


Hot Storage

This kind of storage is used when data needs to be accessed right away. Hot data is typically business critical information which is used for quick decision making.

Hot storage is most expensive since hot data needs to be accessed frequently.


Cold Storage

On the other hand, cold storage is meant for data that is rarely used. Cold data mostly represents archived files and infrequently accessed project documents.

Purpose for cold storage could be for simple record keeping, compliance or for legal purposes. Data retrieval is much slower compared to Hot storage and hence less expensive.


Now let’s move onto Database, Data Warehouse and Data lake.


Database

Database is a storage location of structured data. Databases are of 2 kinds depending on how the data is stored,

  • Relational Database, which stores data in tables

  • Object Oriented database, which stores data in object classes

Database system follows the ACID compliance (Atomicity, Consistency, Isolation, and Durability). But what is ACID compliance? Let’s take a look.


Atomicity: Database transactions can be broken down into smaller parts like an atom. If one transaction doesn’t work then the other will fail as a result and vice versa. With atomicity, it’s either ‘all or nothing’.

Consistency: For a Database to operate it must follow appropriate data validation rules. Here consistency means data which follows those rules is permitted to be written to the database. If those rules are not followed, it will be rolled back to the previous state which complies with the rules. This ensures all data will be consistent.

Isolation: For a database, isolation refers to the ability to concurrently process multiple transactions in a way that one does not affect another. This guarantees that all transactions will occur in isolation.

Durability: Durability means once the transaction is committed, it will remain in the system. Databases that possess durability will ensure the data is saved even if a power outage or system failure occurs.


Popular Databases are MangoDB, PostgreSQL, Oracle, Apache Cassandra amongst many others.

Use cases: Reporting, analysis and automation


Data Warehouse

Data warehouse is a blend of technologies and components which collects huge amount of data from various sources (internal and external) to gather business insights. Data in Data Warehouse is usually structured (from relational database) but it can be unstructured too.


Popular Data Warehouses are Snowflake, Google Big Query, Amazon redshift, Teradata.

Use cases: Business professionals, Analysts and Operational users.


Data Lake

Data lake is a huge storage repository housing a large amount of raw data in its original form until we need it. It stores all types of data - structured, semi-structured, or unstructured.

Storing data in data lakes is much cheaper than in a data warehouse. Well, Data lakes are more flexible and cheaper compared to Data warehouse. However, they are not a replacement for data warehouses or relational databases.


Popular Data Lakes are Hadoop, Google Cloud Storage, Azure, Amazon S3.

Use cases: Data science and research


Note: When your data lake gets messy and is unmanageable, it becomes a Data swamp.


Let’s use an analogy of a Lake and bottled water in a supermarket to understand this concept better.


Water in a sealed bottle available in a supermarket is purified and packaged for consumption. This is similar to Database which stores only structured data and has a business requirement.


Supermarket sells different brands of bottled water in different shapes and sizes (in small cases and bulk). This is similar to Data Warehouse which is a blend of different technologies and stores huge data from various sources.


Lake is a huge water body. Water in Lakes comes from rain, snow, streams, groundwater seepage among others. Similarly, Data Lake holds data in its rawest form from all data sources.


Lakes are sources of food, has agricultural purposes, means of transportation, and are also a tourist attraction. Similarly, Data Lake architecture serves multiple purposes. It provides a mode to the business users and stakeholders to explore and derive actionable insights. A data scientist can use it to explore the data and discover hidden patterns and further use it for research purpose.


Now let’s compare Database, Data Warehouse and Data Lake.

  • Database and Data Warehouses can only store data that has been structured – Data that is processed and refined. A data lake, stores all types of data: structured, semi-structured, or unstructured – Data is in raw form – Some of the data might never be used.

  • Database is designed to record data whereas the Data warehouse is designed to analyze data. Data lake is like a dump.

  • Database and data Warehouse is less agile and has fixed configuration. Whereas Data Lake is highly agile - Configure and reconfigure is possible when required.

  • Database uses Online Transactional Processing (OLTP) whereas Data warehouse uses Online Analytical Processing (OLAP).

  • Database tables and joins are complicated because they are normalized whereas Data Warehouse tables and joins are easy because they are denormalized.

  • Data which is really required to make business decisions is kept on the Data Warehouse since it gets expensive for large data volumes. Data Lake is designed for low-cost storage. Data Lake is fairly economical and highly scalable.

  • All of them can handle hot and cold data, but cold data is usually best suited in Data lakes.

Conclusion


Typically, an organization will require a Data lake, Data warehouse and Database(s) for different use cases. As the amount of unstructured data rises, Data lake is becoming increasingly popular. But there will always be an essential place for Databases and Data warehouses.


60 views0 comments

Recent Posts

See All