15 October, 2024
by Mark Teisman
I'm building a system that uses a log-compacted Kafka topic as the data store. One of the requirements is that during a network partition, reads should be possible in both partitions, and that writes should at least be possible in the majority partition. There should furthermore be automatic reconciliation if the network heals from the partition. In this post I'll summarize my learnings of Kafka's behavior and its related to network partitions.
A Kafka topic is configured to have a certain number of partitions (not to be confused with network partitions, which occur when there's a loss of connectivity of one part of the network from the other, and you effectively end up with two isolated networks).
Kafka is furthermore configured to aim for a certain level of replication, let's call this N. N reflects the amount of Kafka nodes you can lose, without having data loss. Each partition of a topic has one leader, and N-1 followers that store the data. Writes always go through the leader of the partition, and are replicated to the followers. The level of consistency of a write is configurable through the acks
configuration, where acks=all
implies that the write is acknowledged when all followers have a copy of the data. With acks=0
the producer won't wait for any acknowledgement, and with acks=1
the producer receives an acknowledgement when the leader has written the message to its local log. It should be clear to you how these values affect consistency.
When the network partition happens, there are two possible scenarios.
If the partition-leader may exist in the majority partition, then Kafka continues to operate as it normally would for that partition, though with reduced replication. The cluster in the majority partition will notice unreachable hosts in the minority partition, and remove the hosts from the in-sync replicas (ISR). The time until a host is removed from ISR is configurable with setting such as replica.lag.time.max.ms. Clients connected to the majority partition can still produce and consume messages. Even writes with acks=all
will succeed, because once the node is removed from the in-sync replicas, it's acknowledgement is no longer needed for the acknowledgement to happen.
If the partition-leader exists in the minority partition, then Kafka will attempt elect a new leader from one of the followers in the majority partition. All followers in the majority partition that are also ISR qualify. If no such followers are ISR, then whether a leader can be elected depends on whether unclean.leader.election.enable=true
is set. If a new leader is elected, client operations will proceed with the new leader. If no leader is elected, then the partition becomes unavailable for both reads and writes and clients may fail with e.g. LeaderNotAvailableException
. It should be clear that if both leaders and followers are in the minority partition, that no new leader can be elected in the majority partition. If there is permanent loss of one or more nodes, a permanent partition reassignent is necessary. The bin/kafka-reassign-partitions.sh
tool allows you to reassign partitions to different brokers or disks.
When a network partition occurs in a Kafka cluster, and the leader changes, there may be a situation where the old leader has some logs that have not been written to the new leader. This could either happen because the node in the majority cluster had some replication delay at the time of the network partition, or because the former leader continued accepting writes from within the minority partition. When the network heals, the old leader which has some unsynchronized logs will attempt to synchronize them until they converge. If there are conflicts, Kafka typically uses a resolution strategy in which last write wins.