In the first part of this blog, we covered the basics of the Kafka ecosystem and explored the options for exporting Kafka metrics—first using the Jolokia JVM agent and then via the Prometheus JMX agent. Here in this post, we’ll go through some key Kafka metrics that are available on Grafana for building visualizations and alerts. Although Kafka provides hundreds of metrics, as described here, we are going to cover the most important ones to monitor.
Monitoring Key Kafka Performance Metrics
While monitoring Kafka, it is important to cover all of its components, such as the Broker, Producer, and Consumer as well as ZooKeeper. Kafka metrics are available in the form of MBean, a managed Java object that follows the design patterns established by JMX specifications. But for the purpose of this blog, we’re less concerned with the original naming conventions of the metrics and will simply use the metric names available on Grafana, as exported by Prometheus’ JMX agent.
Kafka Server (Broker) Metrics
Kafka Broker metrics are available in multiple categories, such as across clusters, per Broker, and per topic. Furthermore, in Kafka terminology, these metrics are categorized into several types including ReplicaManager, Partition, KafkaController, BrokerTopicMetrics, and many others.
The partition of topics is replicated on every Broker node. When a Broker becomes unavailable, the value of UnderReplicatedPartitions is increased, and an alert should be generated as soon as the value is greater than zero.
There should be only one controller in a cluster, so an alert should be generated if the value of this metric exceeds 1.
This is a very critical metric that tells you how many partitions don’t have an active leader. An alert must be generated if the value of this metric exceeds zero since partitions without a leader can be used for reads or writes.
Here, we have the total time taken in milliseconds to serve a request. The request can be anything from Produce or FetchConsumer to FetchFollower.
BytesInPerSec and BytesOutPerSec
These metrics help you trace any performance bottleneck due to network congestion. The common issues that may impact the network throughput of Kafka Brokers are a slow network, high number of consumers, data synchronization for replication after a lag, etc.
Requests per second is calculated as a total of the request rates of producers, consumers, and followers.
RequestQueueSize and ResponseQueueSize
Both of these queues belong to the Kafka request channel. An increase in the queue size may halt the processing of requests and result in congestion, delayed response, and memory pressure on the Brokers.
A Purgatory is a placeholder for both produce and fetch requests and can help you identify latency issues. The more requests are added to a Purgatory, the greater the delay in processing those requests.
For fetch requests, Purgatory size is high if not enough data is available for consumers to fetch because you’ve set a large value for either fetch.min.bytes or fetch.wait.max.ms.
For producer requests, this is usually a non-zero value if Producers use ack=all. If Producers set acks=-1, then all the produce requests will be kept in the Purgatory until the leader of the Partition gets an acknowledgment from all of its followers.
Additional Metrics for Brokers
Since a Kafka server is written in Scala and runs inside the JVM, you have to continuously track the JVM usage with jvm_memory_bytes_used. Garbage Collection metrics also need to be monitored since a high utilization of Kafka Brokers may cause a long pause during garbage collection and interrupt the communication with ZooKeeper.
You can use two metrics to monitor Garbage Collection stats: jvm_gc_collection_seconds_count and jvm_gc_collection_seconds_sum. You may also take a look at JVM optimizations if you’re seeing too many frequent and large garbage collections. Check out some guidelines for JVM optimizations from the official Kafka documentation.
In addition to the core Kafka metrics, monitoring system stats such as disk utilization, free disk, disk I/O, CPU utilization, and system load is equally important to know the performance of your Kafka cluster and individual Brokers.
It is very important to keep an eye on disk usage since Kafka saves data on the disk; if the disk is full, Kafka will not work. Each topic in Kafka is configurable based on how much data needs to be saved before an automatic expiry.
Along with Kafka metrics, you need to monitor the health of ZooKeeper as well as its communication with Kafka about any state transition of ZooKeeper or disconnects. You can use the following metrics to do this:
Kafka Producer Metrics
The performance of a Kafka cluster greatly impacts how Producers communicate with Brokers. There are many scenarios where you can optimize the Producer based on the use case and cluster capacity. We’ll take a look at the important Producer metrics here below.
This metric gives you the average number of requests sent by Producers per second. A sudden spike in this metric may impact the performance of Brokers or cause an increase in latency during request processing.
This metric depicts the average number of responses sent by Brokers to Producers. A response is sent as an acknowledgment and is based on the Producer settings for (acks) acknowledgment.
Request Latency Average
Latency is the time taken from when Producers send a message to a Broker to when the Producer receives an acknowledgment from that Broker. Note that high latency is not always a bad thing. Most of the time, latency is related to throughput. If producers send large batch-sizes of messages and have the linger.ms set to a higher value to achieve high throughput, then latency will increase.
Here, you have the total number of active connections created by Producers. This helps you determine how many Producers are being connected at a given time.
I/O Wait Time
This measures the average time spent for I/O while waiting for available sockets to send the data. A high I/O wait time is an indication of resource saturation, either because of slow disks or the network bandwidth level.
Average Batch Size
This is the average number of documents that need to be accumulated before being sent in a single request. A high batch size increases throughput but also may affect overall latency.
This metric indicates the rate of average compression per batch for a topic. A high compression rate ensures good performance, whereas a lower compression rate may be an indication that there are some issues with message structures or that some Producers are not using compression at all.
Average Byte Rate Produced
The average number of bytes sent by Producers per topic helps you determine which topics are the busiest; it can also be a good way to find network bottlenecks in a system.
Kafka Consumer Metrics
Connection Count gives you the total number of active connections created by consumers, which helps you determine how many Consumers are being connected at a given time.
Records Lag Max
An indicator of the maximum lag between a Producer and Consumer; an increasing number here is an indication that Consumers are not keeping up with the speed of messages being produced.
Average Bytes Consumed per Second
Similar to the same metric for Producers, the average data size consumed by consumers can help you figure out the reason behind network slowness if there is a sudden decrease in the value.
Average Records Consumed Per Second
This metric gives you a good idea about the average number of records being consumed and can also help in establishing a data consumption trend and setting up alerts accordingly.
Fetch Requests per Second
The number of fetch requests per second, this number should always be non-zero. A decreasing number may indicate Consumers are failing.
Let’s take a look at a Grafana dashboard with some of the metrics we’ve discussed above:
Monitoring Kafka is a tedious task because of the number of metrics available and the many components available in its ecosystem. Therefore, it becomes important to know the critical ones to keep an eye on when setting up basic monitoring.
In this post, we explained all the key metrics that need to be monitored continuously to keep Kafka in good health, find out about bottlenecks, or get an indication on when to scale your Kafka components. There are still more metrics on Grafana that aren’t covered in this post; you are free to explore these and may find something useful for your particular use case.