Mastering Apache Kafka Monitoring: Key Metrics and Tools for Reliable Platform Performance

Felix Schneider

24. January 2025

Reading time: 18 min

Mastering Apache Kafka Monitoring: Key Metrics and Tools for Reliable Platform Performance

Imagine a critical e-commerce platform that relies on Apache Kafka to process orders, track inventory, and manage real-time notifications. Suddenly, a bottleneck in one of the Kafka brokers causes delays, leading to missed notifications and frustrated customers. Would you consider purchasing another item from that online store?

This is where effective monitoring comes in. Monitoring a Kafka platform is vital to ensure seamless operations, identify performance issues, and prevent downtime in such high-stakes environments. In this blog post, we’ll explore how to monitor each key component of the Kafka ecosystem—from brokers and ZooKeeper to Kafka Connect, Schema Registry, and more—using the right tools and strategies. By the end, you’ll understand how to keep your Kafka platform running smoothly and reliably.

The Importance of Monitoring: Ensuring Seamless Operations

Monitoring a Kafka platform is essential for maintaining a robust and efficient event-based system. One of the primary reasons is to ensure system reliability. Without proper monitoring, issues like broker failures, high latencies, or message loss can go unnoticed, leading to disruptions in critical workflows. By proactively identifying and resolving these issues, you can keep your system running smoothly.

Another vital reason is to prevent data loss. Kafka handles vast amounts of data, and any misconfiguration, resource exhaustion, or hardware failure can lead to lost messages. Monitoring the health of producers, consumers, and brokers ensures that data flows reliably through the system. Alongside data loss prevention, monitoring allows you to optimize performance by tracking metrics like throughput, consumer lag, and partition distribution. This enables you to fine-tune Kafka to handle high traffic and data loads effectively.

When problems do occur, monitoring enables you to troubleshoot issues quickly. Real-time alerts and log insights help diagnose problems such as partition under-replication, unresponsive topics, or ZooKeeper connectivity issues, minimizing downtime. Additionally, monitoring supports scalability, helping you identify bottlenecks, resource contention, or partition imbalances as your Kafka deployment grows.

For organizations with strict service-level agreements (SLAs), monitoring ensures compliance by tracking critical metrics such as message delivery time, uptime, and overall system health. This also aids in capacity planning, allowing you to forecast resource needs like storage, memory, and CPU to avoid overuse or underutilization of infrastructure.

Monitoring also plays a key role in security and compliance, detecting unauthorized access or unusual patterns that may indicate security breaches. By keeping a close eye on system activity, you can maintain compliance with regulatory requirements and safeguard sensitive data. Moreover, monitoring provides insights into usage patterns, offering valuable data to improve system performance and inform business decisions.

Finally, monitoring supports high availability, a critical feature for distributed systems like Kafka. With proper monitoring, you can ensure swift failover and recovery during node failures, maintaining uninterrupted service.

In summary, monitoring your Kafka platform is not just about maintaining system health—it’s about building a resilient, scalable, and secure infrastructure that supports your organization’s needs.

Essential Tools for Effective Kafka Monitoring

Whether you’re using the Confluent platform or managing your own Kafka deployment, it’s highly recommended to use Prometheus for collecting metrics across your platform components. Prometheus is a powerful open-source monitoring and alerting toolkit designed for reliability and scalability. A Prometheus exporter is a component that collects metrics from a specific system or service and exposes them in a format that Prometheus can scrape and store. With Prometheus, you can also set up alerting rules to be notified when certain thresholds are met or anomalies occur. Confluent incorporates the JMX exporter, which, together with the KMinion exporter, gathers a comprehensive set of metrics. Additionally, Strimzi includes a kafka exporter to enhance metric collection for your Kafka clusters. Our suggestion is to use one of the mentioned exporters, such as the JMX, KMinion, or Kafka exporter, depending on your setup.

To visualize your metrics, Grafana is often the go-to tool. Why? Grafana is an open-source analytics and monitoring platform that allows you to create interactive and customizable dashboards. While it may have some usability challenges, once you’ve set up your initial dashboards and alerts, it provides reliable and stable monitoring for your platform. There are numerous pre-built dashboards available for all components of your Kafka platform, eliminating the need to create one from scratch. That’s great—less effort required on our part!

Essential General Metrics to Monitor Across Your Kafka Platform

Now that we’ve covered why monitoring your Kafka platform is essential and explored the tools you can use, let’s focus on the metrics you should be tracking. First, there are general metrics that apply across all parts of the platform and are not tied to any specific component. Before diving into the specialized metrics, let’s take a look at these foundational metrics that are crucial to monitor for every part of the platform:

CPU: The components of your Kafka platform are resource-intensive and rely on sufficient processing power to handle workloads efficiently. Monitoring CPU usage helps prevent performance degradation caused by spikes or sustained high utilization.

RAM: Memory is vital for caching and smooth operation across platform components. Insufficient RAM can lead to bottlenecks, forcing processes to rely on slower disk operations and reducing overall efficiency.

Storage: Components such as brokers and connectors often require significant disk space for data storage. Monitoring storage usage ensures you avoid running out of space, which could result in system disruptions or data issues.

I/O Usage: Input/output operations are fundamental for handling data across the platform. Monitoring I/O usage helps identify bottlenecks in disk or network operations, which could impact data throughput and platform responsiveness.

JVM Metrics: Many Kafka platform components run on the Java Virtual Machine (JVM), making JVM metrics like garbage collection, heap memory usage, and thread counts critical to monitor. High garbage collection times or excessive memory usage can lead to performance issues and downtime if not addressed.

By tracking these metrics, you can proactively address potential resource constraints and maintain the reliability and performance of your Kafka components.

Apache Kafka: The Backbone of Your Event-Based System

Next, let’s take a closer look at the key metrics for Kafka’s core components: brokers, consumers, and producers. The most important metrics for operating the Kafka cluster are as follows:

Brokers Online: Monitoring the number of brokers that are online ensures that your Kafka cluster has sufficient resources to handle incoming traffic. A decrease in the number of online brokers could indicate potential failures or connectivity issues that can impact system availability and reliability.

Active Controller: The active controller is responsible for managing partition leadership and overall cluster coordination. Monitoring this ensures there is always an active controller available to manage the cluster’s health, preventing potential system inconsistencies or leadership issues.

Offline Partitions: If partitions go offline, they can become inaccessible, leading to data unavailability. Tracking offline partitions allows you to identify and resolve issues before they impact data accessibility or the consumer experience.

Under Replicated Partitions: Kafka relies on replication to ensure fault tolerance and data availability. Monitoring under-replicated partitions helps identify situations where data is not adequately replicated, which could lead to data loss if a broker fails.

Unclean Leader Elections: Unclean leader elections occur when a partition leader is chosen from replicas that are not fully in sync. This can result in data inconsistencies or loss. Monitoring for unclean leader elections helps ensure that data integrity is maintained and that leadership decisions are made from fully synchronized replicas.

Consumer Lag: Consumer lag indicates the delay between when a message is produced and when it is consumed. High consumer lag can point to issues like slow consumers, insufficient resources, or network bottlenecks, and monitoring this metric helps ensure that consumers can keep up with incoming messages to avoid data processing delays.

It’s also a good idea to set up alerting rules for these critical metrics. To enhance your system over time, particularly for addressing scalability issues, there are additional metrics that are worth monitoring:

Message Conversions: Kafka Message Conversion is a built-in feature that allows Kafka to convert messages between different formats or protocols when producers and consumers are using incompatible formats. While this ensures compatibility, it introduces processing overhead that can affect system performance and latency. Monitoring this metric helps track the efficiency of message format handling and avoid unnecessary conversion overhead.

Messages In: This metric shows the volume of messages being produced to the Kafka cluster. Monitoring messages in helps ensure that the system is not overwhelmed by excessive input, which can lead to performance degradation or resource exhaustion. It also assists in capacity planning.

Bytes In / Out: Monitoring the amount of data being ingested (Bytes In) and consumed (Bytes Out) provides insights into overall data throughput. If either of these metrics spikes unexpectedly, it may indicate a resource bottleneck or network issues that could impact Kafka’s performance.

Failed Fetch / Produce Requests: A high number of failed fetch or produce requests can indicate underlying issues with the brokers, network, or client configurations. Tracking these failures helps identify problems early, preventing data loss and ensuring smooth message flow between producers and consumers.

Total Fetch / Produce Requests: Monitoring the total number of fetch and produce requests provides a clear view of the load on your Kafka brokers. It helps assess whether the brokers are efficiently handling the traffic and if any adjustments are needed to optimize throughput.

Log Size: The size of Kafka logs directly affects storage capacity and data retention. Keeping track of log size helps you avoid running out of storage and allows you to manage data retention policies effectively. Large log sizes may also signal that messages are not being consumed quickly enough, potentially leading to lag and inefficiencies.

By tracking these metrics, you gain better insights into the operational health of your Kafka core components, helping you identify potential performance bottlenecks and take proactive measures for scalability and optimization.

Kafka Bridge & REST Proxy: Bridging the Gap

Now, let’s take a look at REST proxies for Kafka. Strimzi offers the Kafka Bridge, while Confluent provides its own REST Proxy. A REST proxy allows clients to interact with Kafka using simple HTTP requests, enabling communication with Kafka without the need for native Kafka clients. However, it is recommended to use native Kafka clients for better performance and more advanced features when possible. Here’s why monitoring the following metrics for REST proxies is important:

HTTP Connections: Monitoring the number of active HTTP connections provides insights into the load on your REST proxy. A high number of connections can indicate increased traffic, which may require scaling or optimizations to handle the demand. Conversely, a drop in active connections might indicate connectivity issues or underutilization of resources.

HTTP Status Codes: Tracking HTTP status codes helps identify the health and reliability of the REST proxy. A high rate of 4xx or 5xx status codes could indicate problems such as client errors, unauthorized access, or server failures, while 2xx codes indicate successful interactions. This metric allows you to quickly detect and address issues before they impact consumers or producers.

Latency: Latency measures the time it takes for a request to be processed by the REST proxy and returned to the client. High latency can indicate performance bottlenecks, network issues, or resource constraints. Monitoring latency helps ensure smooth and responsive communication between clients and Kafka through the REST proxy, leading to a better overall user experience.

By keeping an eye on these metrics, you can ensure the health and performance of your REST proxy and promptly address any issues that arise.

Schema Registry: Managing and Enforcing Data Standards

Let’s shift focus to the Schema Registry. The Schema Registry is a centralized service that manages the schemas used to serialize and deserialize data in Kafka, ensuring that data is structured and validated consistently across producers and consumers. Just a few metrics are important to consider here:

Active Connections: Monitoring the number of active connections to the Schema Registry helps ensure that the service is handling traffic efficiently. A sudden increase in active connections could indicate high demand or potential scaling needs, while a decrease might point to connectivity issues or underutilization of resources.

Request Latency: Request latency tracks the time it takes for the Schema Registry to respond to a request, such as fetching or registering schemas. High latency can impact the performance of schema validation and storage, leading to delays in producer and consumer operations. Monitoring this metric helps identify potential performance bottlenecks and ensure timely schema operations.

Registered Schemas: The number of registered schemas provides insights into the volume and complexity of the data being processed. A sharp increase in registered schemas might indicate changes in the system’s data structure or schema evolution. Tracking this metric helps manage the growth of schemas and ensures that the registry is performing well as the number of schemas increases.

Schema Types: Monitoring the distribution of schema types (e.g., Avro, JSON) helps assess the variety of data formats being used in your Kafka platform. It provides insights into how data is being serialized and whether there are any issues related to specific formats. This metric can also inform decisions about standardizing or optimizing schema types across the system.

By keeping track of these key Schema Registry metrics, you can ensure that your schema management processes remain efficient, responsive, and scalable, ultimately supporting smooth data flow and consistency across your Kafka platform.

Kafka Connect: Integrating with the Wider Ecosystem

Next, let’s dive into Kafka Connect. Kafka Connect is a tool for integrating Kafka with external systems, allowing for easy and scalable data import and export between Kafka and various data sources and sinks. Here are some metrics to consider when monitoring Kafka Connect:

Cluster Online Nodes: Monitoring the number of online nodes in the Kafka Connect cluster helps ensure high availability and fault tolerance. A decrease in online nodes can indicate node failures or connectivity issues, affecting data integration processes.

Task Count: The number of active tasks indicates how many parallel operations are being performed by Kafka Connect connectors. Monitoring task count helps assess the load and performance of the cluster, ensuring the system can handle the required data throughput.

Task Failure: Tracking task failures is crucial for identifying issues in specific connectors or tasks. Frequent task failures can point to misconfigurations, resource limitations, or connectivity problems that need to be addressed to maintain smooth data integration.

Task Status: Monitoring task status gives insights into whether tasks are running, paused, or in error states. This metric helps quickly identify stalled or misbehaving tasks that might impact data processing and need immediate attention.

Task Read / Write: These metrics track how much data is being read from and written to external systems. Monitoring read/write operations helps evaluate data throughput and ensure that Kafka Connect is keeping up with data processing requirements without bottlenecks.

Task Commit Time: Task commit time measures the time taken for Kafka Connect to commit data changes to external systems. High commit times can signal performance issues or inefficiencies in the connector, leading to delays in data synchronization.

Task Commit Failure: This metric tracks failures in committing data, which can result in data loss or incomplete synchronization. Monitoring commit failures helps quickly detect and resolve issues that could jeopardize data integrity between Kafka and external systems.

Task Batch Size: Task batch size reflects how much data Kafka Connect processes in a single operation. Monitoring this metric helps optimize throughput and resource utilization. Small batch sizes may increase overhead, while large ones could strain system resources or impact latency.

Connector Count: The number of active connectors indicates the scale and scope of your data integration tasks. A sudden increase or decrease in connector count may require adjustments in the Kafka Connect cluster to ensure resources are adequately allocated.

Connector Failure: Tracking connector failures helps identify systemic issues with specific connectors. Frequent failures can indicate problems with configurations, external systems, or Kafka Connect itself, requiring attention to ensure reliable data transfer.

Connector Status: Connector status shows whether connectors are in a healthy state, paused, or in error. This metric is essential for identifying and troubleshooting issues, ensuring that all data integration tasks are running as expected.

Connector Type: Monitoring the types of connectors being used (e.g., source, sink, or transformation) helps in understanding the diversity of data flows and detecting any issues related to specific types of integration tasks. It can also help with optimizing connector configurations based on their function.

By monitoring these Kafka Connect metrics, you ensure the reliability, performance, and scalability of your data integration processes, enabling smooth and efficient data movement between Kafka and external systems.

ksqlDB: Real-Time Data Processing Made Easy

Next, let’s explore ksqlDB. ksqlDB is a streaming SQL engine for Apache Kafka that enables real-time data processing and analytics using SQL-like queries on Kafka topics. It simplifies the creation of real-time applications and analytics without the need for custom code. To effectively monitor its performance, here are some key metrics to consider:

Cache Usage: ksqlDB uses an internal cache to improve the performance of queries by storing intermediate results. Monitoring cache usage helps ensure that the cache is being used efficiently and that there’s enough capacity to handle query loads. High cache usage can indicate potential performance improvements, while low usage might signal inefficient queries or underutilization of resources.

Active / Running / Errored Queries: Tracking the number of active, running, and errored queries helps assess the load on the ksqlDB engine and ensures that queries are running as expected. A high number of active or running queries can indicate increased demand, whereas errored queries suggest issues with query execution or resource limitations that need attention to maintain smooth performance.

Messages Consumed / Produced: Monitoring the number of messages consumed and produced by ksqlDB provides insights into data flow and throughput. These metrics help gauge how effectively ksqlDB is processing data and whether it’s keeping up with the incoming stream. A significant mismatch between consumed and produced messages may point to bottlenecks or inefficiencies that require optimization.

By monitoring these key metrics, you can ensure that ksqlDB operates efficiently and continues to provide reliable real-time data processing. Now, let’s move on to the final component of the Kafka platform: ZooKeeper.

ZooKeeper: Coordinating Kafka Behind the Scenes

Nearly done! The last part to consider is ZooKeeper. ZooKeeper is a distributed coordination service that Kafka used for managing cluster metadata and broker coordination. In newer Kafka versions, ZooKeeper is deprecated and has been replaced by the KRaft (Kafka Raft) consensus algorithm, which provides performance improvements and simplifies Kafka’s architecture. However, if you’re still running an older Kafka implementation with ZooKeeper, it’s important to monitor the following metrics:

Online Nodes: Tracking the number of online nodes in the ZooKeeper ensemble helps ensure that the service is highly available and functioning properly. A decrease in the number of online nodes could indicate node failures or communication issues, which might affect Kafka’s overall stability and performance.

Active Connections: Monitoring active connections to ZooKeeper provides insights into how many clients (e.g., Kafka brokers) are connected and interacting with the ZooKeeper service. High or low connection counts can indicate issues such as network congestion, client misconfigurations, or abnormal load.

Disconnects: The disconnects metric tracks the number of times a client disconnects from ZooKeeper. Frequent disconnects could point to underlying connectivity or network issues that might impact Kafka’s ability to maintain coordination and metadata consistency.

Auth Failures: Monitoring authentication failures helps detect potential security issues or misconfigurations in your ZooKeeper setup. A high number of failed authentication attempts could indicate unauthorized access attempts or issues with client credentials that need to be resolved to maintain secure access.

Sync Connects: Sync connects represent connections made to the ZooKeeper ensemble that require synchronization between servers. Monitoring sync connects helps you understand how frequently your clients need to synchronize with ZooKeeper and whether this could be a potential performance bottleneck. High sync connect counts might indicate high demand or latency issues in the coordination process.

Session State: Session state reflects the health of ZooKeeper sessions, which manage the state of client connections. Monitoring session states helps ensure that sessions are active and healthy, and it can help detect issues such as expired or failed sessions that might indicate underlying system problems.

After exploring the depths of Kafka metrics, let’s summarize our key findings.

Conclusion

Monitoring is crucial because it ensures the health, performance, and reliability of your Kafka platform, enabling you to quickly identify and resolve issues. There’s no need to reinvent the wheel, as there are powerful tools available to monitor every aspect of the Kafka ecosystem. While numerous metrics can be considered to cover all components, this blog post focuses on the key ones. These include general metrics that apply across the platform, as well as specific metrics for each component. And don’t forget to set up alerts—having proactive notifications in place is essential for staying on top of any issues as they arise. By effectively monitoring your Kafka platform and setting up the right alerts, you can ensure smooth operations and quickly address any challenges, keeping your data streaming seamlessly and reliably.