In the first part of this blog, we covered the basics of the Kafka ecosystem and explored the options for exporting Kafka metrics—first using the Jolokia JVM agent and then via the Prometheus JMX agent. Here in this post, we’ll go through some key Kafka metrics that are available on Grafana for building visualizations and alerts. Although Kafka provides hundreds of metrics, as described here, we are going to cover the most important ones to monitor.

Monitoring Key Kafka Performance Metrics

While monitoring Kafka, it is important to cover all of its components, such as the Broker, Producer, and Consumer as well as ZooKeeper. Kafka metrics are available in the form of MBean, a managed Java object that follows the design patterns established by JMX specifications. But for the purpose of this blog, we’re less concerned with the original naming conventions of the metrics and will simply use the metric names available on Grafana, as exported by Prometheus’ JMX agent.

Kafka Server (Broker) Metrics

Kafka Broker metrics are available in multiple categories, such as across clusters, per Broker, and per topic. Furthermore, in Kafka terminology, these metrics are categorized into several types including ReplicaManager, Partition, KafkaController, BrokerTopicMetrics, and many others.

UnderReplicatedPartitions

kafka_server_replicamanager_underreplicatedpartitions

The partition of topics is replicated on every Broker node. When a Broker becomes unavailable, the value of UnderReplicatedPartitions is increased, and an alert should be generated as soon as the value is greater than zero.

ActiveControllerCount

kafka_controller_kafkacontroller_activecontrollercount

There should be only one controller in a cluster, so an alert should be generated if the value of this metric exceeds 1.

OfflinePartitionsCount

kafka_controller_kafkacontroller_offlinepartitionscount

This is a very critical metric that tells you how many partitions don’t have an active leader. An alert must be generated if the value of this metric exceeds zero since partitions without a leader can be used for reads or writes.

TotalTimeMs

kafka_network_requestmetrics_totaltimems

Here, we have the total time taken in milliseconds to serve a request. The request can be anything from Produce or FetchConsumer to FetchFollower.

BytesInPerSec and BytesOutPerSec

kafka_server_Brokertopicmetrics_bytesout_total kafka_server_Brokertopicmetrics_bytesin_total

These metrics help you trace any performance bottleneck due to network congestion. The common issues that may impact the network throughput of Kafka Brokers are a slow network, high number of consumers, data synchronization for replication after a lag, etc.

RequestsPerSec

kafka_network_requestmetrics_requests_total

Requests per second is calculated as a total of the request rates of producers, consumers, and followers.

RequestQueueSize and ResponseQueueSize

kafka_network_requestchannel_requestqueuesize

kafka_network_requestchannel_responsequeuesize

Both of these queues belong to the Kafka request channel. An increase in the queue size may halt the processing of requests and result in congestion, delayed response, and memory pressure on the Brokers.

PurgatorySize

kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_fetch kafka_server_delayedoperationpurgatory_purgatorysize_delayedoperation_produce

A Purgatory is a placeholder for both produce and fetch requests and can help you identify latency issues. The more requests are added to a Purgatory, the greater the delay in processing those requests. 

For fetch requests, Purgatory size is high if not enough data is available for consumers to fetch because you’ve set a large value for either fetch.min.bytes or fetch.wait.max.ms.

For producer requests, this is usually a non-zero value if Producers use ack=all. If Producers set acks=-1, then all the produce requests will be kept in the Purgatory until the leader of the Partition gets an acknowledgment from all of its followers.

Additional Metrics for Brokers

JVM Metrics

Since a Kafka server is written in Scala and runs inside the JVM, you have to continuously track the JVM usage with jvm_memory_bytes_used. Garbage Collection metrics also need to be monitored since a high utilization of Kafka Brokers may cause a long pause during garbage collection and interrupt the communication with ZooKeeper.

You can use two metrics to monitor Garbage Collection stats: jvm_gc_collection_seconds_count and jvm_gc_collection_seconds_sum. You may also take a look at JVM optimizations if you’re seeing too many frequent and large garbage collections. Check out some guidelines for JVM optimizations from the official Kafka documentation.

System Stats

In addition to the core Kafka metrics, monitoring system stats such as disk utilization, free disk, disk I/O, CPU utilization, and system load is equally important to know the performance of your Kafka cluster and individual Brokers. 

It is very important to keep an eye on disk usage since Kafka saves data on the disk; if the disk is full, Kafka will not work. Each topic in Kafka is configurable based on how much data needs to be saved before an automatic expiry.

ZooKeeper Metrics

Along with Kafka metrics, you need to monitor the health of ZooKeeper as well as its communication with Kafka about any state transition of ZooKeeper or disconnects. You can use the following metrics to do this:

kafka_server_sessionexpirelistener_zookeeperauthfailures_total

kafka_server_sessionexpirelistener_zookeeperdisconnects_total

kafka_server_sessionexpirelistener_zookeeperexpires_total

kafka_server_sessionexpirelistener_zookeeperreadonlyconnects_total

kafka_server_sessionexpirelistener_zookeepersaslauthentications_total

kafka_server_sessionexpirelistener_zookeepersyncconnects_total

kafka_server_zookeeperclientmetrics_zookeeperrequestlatencyms

Kafka Producer Metrics

The performance of a Kafka cluster greatly impacts how Producers communicate with Brokers. There are many scenarios where you can optimize the Producer based on the use case and cluster capacity. We’ll take a look at the important Producer metrics here below.

Request Rate

kafka_producer_request_rate

This metric gives you the average number of requests sent by Producers per second. A sudden spike in this metric may impact the performance of Brokers or cause an increase in latency during request processing. 

Response Rate

kafka_producer_response_rate

This metric depicts the average number of responses sent by Brokers to Producers. A response is sent as an acknowledgment and is based on the Producer settings for (acks) acknowledgment.

Request Latency Average

kafka_producer_request_latency_avg

Latency is the time taken from when Producers send a message to a Broker to when the Producer receives an acknowledgment from that Broker. Note that high latency is not always a bad thing. Most of the time, latency is related to throughput. If producers send large batch-sizes of messages and have the linger.ms set to a higher value to achieve high throughput, then latency will increase.

Connection Count

kafka_producer_connection_count

Here, you have the total number of active connections created by Producers. This helps you determine how many Producers are being connected at a given time.

I/O Wait Time

kafka_consumer_io_wait_time_ns_avg

This measures the average time spent for I/O while waiting for available sockets to send the data. A high I/O wait time is an indication of resource saturation, either because of slow disks or the network bandwidth level.

Average Batch Size

kafka_producer_batch_size_avg

This is the average number of documents that need to be accumulated before being sent in a single request. A high batch size increases throughput but also may affect overall latency.

Compression Rate

kafka_producer_compression_rate_avg

This metric indicates the rate of average compression per batch for a topic. A high compression rate ensures good performance, whereas a lower compression rate may be an indication that there are some issues with message structures or that some Producers are not using compression at all.

Average Byte Rate Produced

kafka_producer_outgoing_byte_rate

The average number of bytes sent by Producers per topic helps you determine which topics are the busiest; it can also be a good way to find network bottlenecks in a system.

Kafka Consumer Metrics

Connection Count

kafka_consumer_connection_count

Connection Count gives you the total number of active connections created by consumers, which helps you determine how many Consumers are being connected at a given time.

Records Lag Max

kafka_consumer_records_lag_max

An indicator of the maximum lag between a Producer and Consumer; an increasing number here is an indication that Consumers are not keeping up with the speed of messages being produced.

Average Bytes Consumed per Second

kafka_consumer_bytes_consumed_rate

Similar to the same metric for Producers, the average data size consumed by consumers can help you figure out the reason behind network slowness if there is a sudden decrease in the value.

Average Records Consumed Per Second

kafka_consumer_records_consumed_rate

This metric gives you a good idea about the average number of records being consumed and can also help in establishing a data consumption trend and setting up alerts accordingly.

Fetch Requests per Second

kafka_consumer_fetch_rate

The number of fetch requests per second, this number should always be non-zero. A decreasing number may indicate Consumers are failing.

Let’s take a look at a Grafana dashboard with some of the metrics we’ve discussed above:

 

Figure 1: Kafka Broker metrics visualizations

 

Figure 2: Kafka Broker metrics visualizations

 

Summary

Monitoring Kafka is a tedious task because of the number of metrics available and the many components available in its ecosystem. Therefore, it becomes important to know the critical ones to keep an eye on when setting up basic monitoring. 

In this post, we explained all the key metrics that need to be monitored continuously to keep Kafka in good health, find out about bottlenecks, or get an indication on when to scale your Kafka components. There are still more metrics on Grafana that aren’t covered in this post; you are free to explore these and may find something useful for your particular use case.

Read More:

Monitoring Managed Cloud Services with Distributed Tracing

Slack Outage: Third-Party Dependencies and Accountability