Network partitions and Kafka

15 October, 2024
by Mark Teisman

I'm building a system that uses a log-compacted Kafka topic as the data store. One of the requirements is that during a network partition, reads should be possible in both partitions, and that writes should at least be possible in the majority partition. There should furthermore be automatic reconciliation if the network heals from the partition. In this post I'll summarize my learnings of Kafka's behavior and its related to network partitions.

A Kafka topic is configured to have a certain number of partitions (not to be confused with network partitions, which occur when there's a loss of connectivity of one part of the network from the other, and you effectively end up with two isolated networks).

Kafka is furthermore configured to aim for a certain level of replication, let's call this N. N-1 reflects the amount of Kafka nodes you can lose, without having data loss. Each partition of a topic has one leader, and N-1 followers that store the data. Writes always go through the leader of the partition, and are replicated to the followers. The level of consistency of a write is configurable through the acks configuration, where acks=all implies that the write is acknowledged when all followers have a copy of the data. With acks=0 the producer won't wait for any acknowledgement, and with acks=1 the producer receives an acknowledgement when the leader has written the message to its local log. It should be clear to you how these values affect consistency.

When the network partition happens, there are two possible scenarios.

If the partition-leader may exist in the majority partition, then Kafka continues to operate as it normally would for that partition, though with reduced replication. The cluster in the majority partition will notice unreachable hosts in the minority partition, and remove the hosts from the in-sync replicas (ISR). The time until a host is removed from ISR is configurable with setting such as replica.lag.time.max.ms. Clients connected to the majority partition can still produce and consume messages.

Writes with acks=all will succeed as long as the min.insync.replicas requirement is still met after nodes are removed from the ISR. Once a node is removed from the in-sync replicas, its acknowledgement is no longer needed. However, if the ISR shrinks below the min.insync.replicas threshold (which defaults to 1, but is often set to 2 in production), producers will receive a NotEnoughReplicasException and writes will fail until sufficient replicas are back in sync.

If the partition-leader exists in the minority partition, then Kafka will attempt elect a new leader from one of the followers in the majority partition. All followers that are also ISR could technically be eligible for clean leader election, but only followers in the majority partition could gain quorum and be elected the leader. If no available follower is ISR, then whether a leader can be elected depends on whether unclean.leader.election.enable=true is set. If a new leader is elected, client operations will proceed with the new leader. If no leader is elected, then the partition becomes unavailable for both reads and writes and clients may fail with e.g. LeaderNotAvailableException. It should be clear that if both leaders and followers are in the minority partition, that no new leader can be elected in the majority partition. If there is permanent loss of one or more nodes, a permanent partition reassignment is necessary. The bin/kafka-reassign-partitions.sh tool allows you to reassign partitions to different brokers or disks.

With the most typical configurations, all former leaders in the minority partition will soon lose quorum, and from that point on clients in the minority partition will get NotLeaderForPartitionException or similar errors.

When a network partition occurs in a Kafka cluster, and the leader changes, there may be a situation where the old leader has some logs that have not been written to the new leader. This could either happen because the node in the majority cluster had some replication delay at the time of the network partition, or because the former leader continued accepting writes from within the minority partition. When the network heals, the old leader which has some unsynchronized logs will attempt to synchronize them until they converge. Concretely, when the old leader rejoins, it will truncate any uncommitted messages that diverged from the new leader's log, and the new leader's log becomes the source of truth.

The rewinding of commit logs of nodes in the minority partition, and the possible election of non-ISR followers, have interesting implications for consumers. Consumers may find themselves having consumed messages from a log that are no longer to be found inside that log.