Kafka with Apache Flink

Maya Horak

31. August 2023

Reading time: 0 min

What is Apache Flink?

Apache Flink is an open-source stream processing framework based on high-performance cluster computation. It runs self-contained streaming computations that can be deployed on resource managers like YARN, Mesos, or Kubernetes. Flink jobs take in streams and output data into streams, databases, or the stream processor itself. Flink can handle tens of millions of events per second in moderate clusters with a latency as low as a few 10s of milliseconds. Flink enables guaranteed exactly once semantics for application state and end-to-end delivery with supported sources and sinks, while creating precise results. Flink is based on a cluster architecture with master and worker nodes. Flink is HA ready and can be deployed standalone or with resource managers. This architecture allows Flink to utilize a lightweight checkpointing process to ensure exactly-once results in case of failures. It provides the ability to correct re-processing through savepoints without impacting latency or throughput. In addition to its stream processing capabilities (DataStream API) and batch processing capabilities (DataSet API), it also offers various higher-level APIs and libraries such as CEP (for Complex Event Processing), SQL and Table (for structured streams and tables), FlinkML for machine learning and Gelly for graph processing. Flink has expressive APIs, advanced operators, and low-level control. But Flink is also scalable in stateful applications, even for relatively complex streaming JOIN queries.

What is Apache Kafka?

Apache Kafka is an open-source, distributed, streaming platform for building real-time streaming data pipelines and applications. It is highly scalable, fault-tolerant, and very fast. Kafka is used by thousands of companies for a variety of use cases, including real-time data processing, stream processing, messaging, and data integration. Kafka is designed to be highly available and is sometimes described as a message broker, although it’s far more than that. Kafka Clusters used to be orchestrated by Apache Zookeeper, but from version 3.0 clusters have the ability to run self managed using the KRAFT mode. Kafka allows applications to exchange data between producers and consumers using a publish-subscribe model. Kafka is horizontally scalable and supports replication, meaning that it can be used to replicate data across multiple nodes. Kafka is designed to provide high throughput & low latency at the same time. This makes it an excellent choice for use cases such as log aggregation, stream processing, and data integration. Kafka provides an API that allows applications to interact with the cluster. Producers can publish data to topics and the Kafka broker will store the data until a consumer requests it. Consumers can subscribe to topics and receive data as it is published. Kafka also provides APIs for processing data streams, including a Streams API for building data pipelines and a KSQL API for querying streams. Kafka is often used for log aggregation, stream processing, messaging, data integration, and real-time data processingand data integration. Kafka provides an API that allows applications to interact with the system, making it an excellent choice for building data pipelines and applications that require real-time data processing.

Flink & Kafka – Powerful Couple

The combination of Apache Kafka and Apache Flink is a powerful combination for event streaming and data processing.

One major problem that Apache Kafka and Apache Flink can help to solve is the ability to process data from multiple sources in real-time. Apache Kafka can be used to ingest data from more than one source, and then Apache Flink can be used to process the data in a distributed system. This allows for a real-time processing of data from multiple sources, without having to wait for the data to be processed in a batch. Another problem that can be solved with the combination of Apache Kafka and Apache Flink is the ability to process data at scale.

From a Kafka centric point of view Flink is an alternative for Kafka’s own processing API called “Streams”. On the other hand, from Flinks perspective, Kafka is a storage layer for Flink, meaning Flink produces the results of it’s stream processing in a Kafka cluster to store it and make it accessible for receivers in a flexible way. Furthermore, Kafka can be perfectly used for data injection into the Flink cluster on the first hand.

Stream processing is a paradigm that continuously correlates events of one or more data sources. Data is processed in motion, in contrast to traditional processing at rest with a database and request-response API. Stream processing is either stateless or stateful. If single values are transformed while processing it is considered stateless. On the other hand, if processing is done with sliding windows or aggregations, we call it stateful. Especially state management is very challenging in a distributed stream processing application.

A major advantage of the Flink engine is its efficiency in stateful applications.

Apache Flink is a separate infrastructure from the Kafka cluster. This has various pros and cons. Obvious disadvantages are the need to maintain two platforms and the higher fault risk resulting from having an extra infrastructure.

The benefits are, that teams can deal with data processing logic, code & problems without interference of streaming logic, code and problems. Furthermore, when using Flink for data processing, the Kafka cluster isn’t hit again until the processing of a message is finished. Another advantage of the Kafka-Flink architecture is the possibility to merge topic streams from different Kafka Clusters.

If you’re yet used to Python, PyFlink will help you to get into Flink quite fast. With its help, you can easily build robust batch and streaming workloads, including real-time data processing pipelines, big-data exploration, Machine Learning pipelines, and ETL processes. With familiar Python libraries such as Pandas, you can enjoy the full potential of Flink’s ecosystem.

Kafka Streams vs Apache Flink

Kafka Streams is a Java library while Flink is a separate cluster infrastructure.

Apache Flink is more complex to deploy and to manage. It depends on the problem your team needs to solve and the data dimensions weather you should go with the lightweight Streams library or the clustered processing.

Some architectural aspects are very different in Kafka Streams and Flink. These need to be understood and can be a pro or con for your use case. For instance, Flink’s checkpointing has the advantage of getting a consistent snapshot, but the disadvantage is that every local error always stops the whole job and everything has to be rolled back to the last checkpoint. Kafka Streams does not have this concept. Local errors can be recovered locally (move the corresponding tasks somewhere else; the task/threads without errors just continue normally). Another example is Kafka Streams’ hot standby for high availability versus Flink’s fault-tolerant checkpointing system.

Conclusion

Freedom of choice of these two leading open-source technologies and the tight integration of Kafka with Flink enables any kind of stream processing use case. This includes hybrid, global and multi-cloud deployments, mission-critical transactional workloads, and powerful analytics with embedded machine learning. As always, understand the different options and choose the right tool for your use case and requirements.