15 October, 2024
by Mark Teisman
I'm building a system that uses a log-compacted Kafka topic as the data store. One of the requirements is that during a network partition, reads should be possible in both partitions, and that writes should at least be possible in the majority partition. There should furthermore be automatic reconciliation if the network heals from the partition. In this post I'll summarize my learnings of Kafka's behavior and its related to network partitions.
A Kafka topic is configured to have a certain number of partitions (not to be confused with network partitions, which occur when there's a loss of connectivity of one part of the network from the other, and you effectively end up with two isolated networks).
Kafka is furthermore configured to aim for a certain level of
replication, let's call this N. N-1 reflects the amount of Kafka nodes you
can lose, without having data loss. Each partition of a topic has one
leader, and N-1 followers that store the data. Writes always go through
the leader of the partition, and are replicated to the followers. The
level of consistency of a write is configurable through the acks
configuration, where acks=all implies that the write is acknowledged
when all followers have a copy of the data. With acks=0 the producer
won't wait for any acknowledgement, and with acks=1 the producer
receives an acknowledgement when the leader has written the message to
its local log. It should be clear to you how these values affect
consistency.
When the network partition happens, there are two possible scenarios.
If the partition-leader may exist in the majority partition, then Kafka
continues to operate as it normally would for that partition, though
with reduced replication. The cluster in the majority partition will
notice unreachable hosts in the minority partition, and remove the hosts
from the in-sync replicas (ISR). The time until a host is removed from
ISR is configurable with setting such as
replica.lag.time.max.ms.
Clients connected to the majority partition can still produce and
consume messages.
Writes with acks=all will succeed as long as the min.insync.replicas requirement is still met after nodes are removed from the ISR. Once a node is removed from the in-sync replicas, its acknowledgement is no longer needed. However, if the ISR shrinks below the min.insync.replicas threshold (which defaults to 1, but is often set to 2 in production), producers will receive a NotEnoughReplicasException and writes will fail until sufficient replicas are back in sync.
If the partition-leader exists in the minority partition, then Kafka
will attempt elect a new leader from one of the followers in the
majority partition. All followers that are
also ISR could technically be eligible for clean leader election, but only followers in the majority partition could gain quorum and be elected the leader. If no available follower is ISR, then whether a leader
can be elected depends on whether unclean.leader.election.enable=true
is set. If a new leader is elected, client operations will proceed with
the new leader. If no leader is elected, then the partition becomes
unavailable for both reads and writes and clients may fail with e.g.
LeaderNotAvailableException. It should be clear that if both leaders
and followers are in the minority partition, that no new leader can be
elected in the majority partition. If there is permanent loss of one or
more nodes, a permanent partition reassignment is necessary. The
bin/kafka-reassign-partitions.sh tool allows you to reassign
partitions to different brokers or disks.
With the most typical configurations, all former leaders in the minority partition will soon lose quorum, and from that point on clients in the minority partition will get NotLeaderForPartitionException or similar errors.
When a network partition occurs in a Kafka cluster, and the leader changes, there may be a situation where the old leader has some logs that have not been written to the new leader. This could either happen because the node in the majority cluster had some replication delay at the time of the network partition, or because the former leader continued accepting writes from within the minority partition. When the network heals, the old leader which has some unsynchronized logs will attempt to synchronize them until they converge. Concretely, when the old leader rejoins, it will truncate any uncommitted messages that diverged from the new leader's log, and the new leader's log becomes the source of truth.
The rewinding of commit logs of nodes in the minority partition, and the possible election of non-ISR followers, have interesting implications for consumers. Consumers may find themselves having consumed messages from a log that are no longer to be found inside that log.