Data Pipeline can be considered as a means of transport for data from one source data or multiple source data to a target or destination. The data can be modified based on the requirements, but it is not a compulsion in data pipelines.
Data pipelines performs data integration as they can collect data from multiple data sources and transform the data and load into target, which results in proper data for analytics. Let us consider an employee data where the company wants to analyze few insights. In this case the employee data is stored in different sources like employee personal information, employee performance, employee salary, employee department. To analyze and to draw some data insights we need all the data as a single dataset. Data pipelines helps in gathering all the required data from all these different sources and while performing transformations the data manipulations are performed according to the business needs. And the data is loaded into the target with only required columns. If the employee personal information have his FirstName, LastName, Email, Address, Phone number, Age, Marital Status as the columns and in this data requirement we need only FirstName, LastName, Age, Marital Status then using data pipelines we can select only above mentioned columns and transform that to the target.
Data Pipeline Architecture :
The main steps that involves in data pipelines are the Sources from where data can be collected, only specific data can be collected from the sources and these different data can be combined all together using join functions, Standardization can also be applied on data to make sure the data from different sources are in the same format. If the data have any errors or duplicates , they can be modified in the transformation. These cleansed data can be loaded either into a database or data warehouse or any target. Data pipelines can also be automated on schedule. The data which is transformed to the target can be used for analysis and to find the required data insights. In simple terms the data pipeline architecture involves Source, Transformations, Target.
Data Pipelines Processes:
ETL, Data Replication, Data Virtualization.
1. ETL : ETL stands for Extract, Transform, Load. The data is extracted from the source and the required transformations are performed like making sure the data is in good format or removing duplicates etc., and the the data is loaded into the target. Batch Process and Stream process can be performed by ETL.
Batch Processing : Different batches of the large data is loaded into the repositories according to the intervals scheduled. In batch process jobs run in a sequence of commands where the output of one job is taken as the input of next job. This sequence stops when the all the data transformations are completed and then the data is loaded into the repository.
Stream Processing : It can also be called as Real time processing. The data which can be extracted from streaming or real time data is considered in the stream process. In stream process the data is updated as soon it is added in the real time process.
2. Data Replication : In Data Replication a copy of the source data is created in a repository, and this data is used as the source data to build the data pipelines.
3. Data Virtualization : In Data Virtualization data is not moved physically and the users can see or analyze the data virtually instead of the original source.
Uses Cases of Data Pipelines :
The transformed data from data pipelines provides the business required dataset which can be used in business intelligence for analytics , reporting, data insights or business insights, Exploratory data analysis, Machine Learning etc.,