Libelle IT Glossary Part 20: What is a data pipeline?

AuthorMichael Schwenk

What is a data pipeline?

As the name suggests, data pipelines act as a "pipeline system" for data. It is a methodology for moving data from one system to another. These pipelines are the foundation for data-driven work in IT in many organizations.

Essentially, when data is being moved from the source system to the target system, it goes through the following steps:

Capturing and extracting the raw datasets.
Data management
Data transformation
Data processing and integration

We have explained these steps in more detail in our blog post "How does a data pipeline work?" explained in more detail. To perform these steps, there are different types of data pipelines.

What are the different types of data pipelines?

In order to achieve the goal of data integration, the two main types of data pipelines, batch processing and the use of streaming data, are most commonly used.

Batch processing

Batch processing is an important part of creating a reliable and scalable data infrastructure.

Batch processing, as the name implies, involves loading "batches" of data into a repository within specified time intervals. Care is taken to ensure that the time period is not during peak business hours, as the large volume of data from batch processing jobs could negatively impact other workloads. The batch processing method is optimal for data pipelines unless there is a direct need to analyze a specific set of data (e.g., monthly accounting). It is more commonly associated with the extract, transform, and load (ETL) data integration process.

Batch jobs are an automated workflow of sequence-bound commands. Here, the output of one command leads to the input of the next command.

For example, a command starts a data ingest, then the next command triggers filtering of specific columns, and then the subsequent command handles aggregation. This series continues until the data is fully transformed.

Streaming data

Streaming data is used as a method when data needs to be updated continuously. Especially in areas where apps or point-of-sale systems are used, real-time data must be used.

Example: A company wants to update the stock and sales history of their products, so salespeople can inform their consumers whether a product is in stock or not. Here, a single action, such as a product sale, is considered an "event" and related events, such as adding an item to checkout, are typically grouped as a "topic" or "data stream." To stream these events, messaging systems or message brokers, such as the open source Apache Kafka solution, are then used.

Streaming processing systems have lower latency than batch processing systems and are therefore more commonly used to process data events shortly after they occur.

Things to know about data pipelines

A wide variety of tools can be integrated into a data pipeline, such as when anonymizing data. In another blog post "Anonymized data in the data pipeline", there are two practical examples that explain the advantages of seamless integration of Libelle DataMasking in more detail.