Kafka ZooKeeper to KRaft: The Next Chapter in Apache Kafka (MSK and Kafka 3.9 Support)

TL;DR  

Kafka consensus ensures coordination, leader election, and metadata consistency. ZooKeeper handled this but added complexity and scaling limits. KRaft embeds Raft directly in Kafka, offering simpler operations, faster failover, and higher scalability. For MSK, migrate by creating a new KRaft cluster, replicating data, updating clients, testing, cutting over, and decommissioning the old cluster. 

Table of Contents

Why Kafka Needs Consensus (and Why It’s Hard)  

For years, ZooKeeper has been a necessary but operationally heavy dependency for Kafka. With Kafka 3.9, that finally changes, especially for Amazon MSK users. 

At the heart of any distributed system lies the problem of agreement: multiple independent components must coordinate and agree on the system’s state even in the face of failures, network partitions, or latency.  

This is critical for tasks such as: 

  • Determining leader nodes for managing shared responsibilities (diagram below). 
  • Ensuring that all nodes agree on cluster membership. 
  • Maintaining consistent metadata about topics, partitions, and configuration. 
Kafka ZooKeeper to KRaft - Consensus Diagram

Leader Node Selection

Without a reliable consensus algorithm, systems can suffer from the pains of  

  • Split-brain scenarios 
  • Inconsistent state 
  • Data loss 


Protocols such as
Raft and Paxos were introduced to guarantee safety, liveness, and fault tolerance in distributed environments, ensuring a single source of truth despite failures and concurrent updates. 

In Apache Kafka’s, metadata such as topic configurations, partition placements, and ACLs must be shared and agreed upon by all brokers. Achieving this efficiently and reliably is what a consensus layer enables. 

 

What ZooKeeper Did for Kafka (and Where It Fell Short) 

From Kafka’s early versions up through 3.9, Apache ZooKeeper served as the authoritative consensus and coordination layer.  

ZooKeeper’s responsibilities included: 

  • Maintaining cluster membership (which brokers are alive). 
  • Performing controller election to select a leader for cluster coordination. 
  • Storing critical metadata about topics, partitions, and replicas. 

ZooKeeper uses its own consensus algorithm (Zab) to replicate state across an ensemble. For many years, Kafka depended on this external ensemble to bootstrap and coordinate distributed state, which worked exceptionally well, but introduced operational complexity and scaling limitations as Kafka clusters grew. 

Key operational challenges of ZooKeeper in Kafka included: 

  • Separate system to manage: Operators needed to configure, monitor, and maintain a ZooKeeper cluster in addition to Kafka. 
  • Metadata bottleneck: ZooKeeper’s ability to serve metadata was a limiting factor for cluster scaling as partition counts grew. 
  • Failover complexity: Controller election and metadata propagation relied on cross-system interactions, potentially slowing down recovery. Amazon Web Services, Inc. 
“KRaft embeds Raft consensus directly within Kafka brokers—no ZooKeeper, faster failover, higher scalability.”

What Is KRaft and How It Replaces ZooKeeper 

KRaft stands for Kafka Raft, an internal consensus protocol that replaces ZooKeeper within Kafka’s architecture. 

Key Features of KRaft 

Integrated Consensus Within Kafka 

KRaft embeds the consensus mechanism directly within Kafka brokers. A group of controller nodes forms a Raft quorum, responsible for metadata storage and replication. Metadata is stored as a special Kafka topic (e.g., __cluster_metadata), and changes are replicated using Raft instead of external ZooKeeper ensembles. AWS Documentation+1 

Simplified Architecture 

Kafka no longer requires a separate ZooKeeper ensemble. This reduces operational complexity and lowers the maintenance burden associated with a separate distributed coordination service. 

Improved Scalability 

With Raft, Kafka clusters can scale beyond the partition and broker limits imposed by ZooKeeper-based metadata bottlenecks. In Amazon MSK, for example, KRaft mode enables up to 60 brokers per cluster, compared to 30 brokers with ZooKeeper mode by default. Amazon Web Services, Inc. 

Faster Failover and Recovery 

Raft’s built-in leader election and metadata replication provide quicker controller failover and metadata propagation, improving availability during broker restarts and topology changes. 

Unified Metadata Handling 

Metadata is treated as just another Kafka log, leveraging Kafka’s replication, partitioning, and log management semantics. This improves consistency and throughput for metadata operations. 

 

ZooKeeper vs. KRaft in Apache Kafka 

The differences between ZooKeeper and KRaft become clear when compared side by side. 

Aspect 

ZooKeeper Mode 

KRaft Mode 

Architecture 

External ZooKeeper cluster manages metadata and leadership. 

Integrated Raft quorum inside Kafka brokers manages metadata. 

Consensus Protocol 

ZooKeeper’s Zab protocol. 

Raft consensus tailored for Kafka. 

Operational Complexity 

Requires managing ZooKeeper nodes separately. 

No separate ZooKeeper → simpler operations. 

Metadata Storage 

Stored in ZooKeeper znodes. 

Stored as Kafka log (__cluster_metadata). 

Scaling (Broker Count) 

Limited (e.g., 30 brokers default on MSK). 

Higher scalability (e.g., 60 brokers on MSK). Amazon Web Services, Inc. 

Performance 

Requires cross-system coordination → more latency. 

Metadata local to Kafka brokers → lower latency and faster failure recovery. 

Lifecycle 

Dependent on external system health. 

Internal and unified Kafka lifecycle. 

“Kafka’s move to KRaft simplifies operations while unlocking future innovations for real-time workloads.”

Why KRaft Is the Better Approach 

In summary, Kafka’s move to KRaft offers: 

  • Lower operational overhead: Single system to manage, replacing two. 
  • Improved metadata performance and failover times. 
  • Higher scalability limits—important for enterprises with intense streaming workloads. 
  • Simplified client connections: Modern Kafka clients now use bootstrap.servers exclusively, with the older ZooKeeper connection string deprecated. Amazon Web Services, Inc. 

This evolution simplifies both development and operations while positioning Kafka for future innovations that depend on a robust internal consensus.

 

Practical Migration Plan for Amazon MSK 

Because ZooKeeper mode and KRaft mode represent fundamentally different metadata architectures, you cannot convert an existing MSK ZooKeeper cluster in place. Instead, follow a practical migration strategy: 

1. Prepare a New MSK KRaft Cluster 

  • Create a new MSK cluster with Kafka version 3.9 (or later), specifying KRaft mode at provisioning. AWS Documentation 
  • Configure your brokers and controller count based on expected throughput and partition count.

2. Synchronise Data

  • Use MirrorMaker 2 (or similar replication tools) to mirror topics and consumer group state from the old ZooKeeper cluster to the new KRaft cluster. 
  • Validate topic configurations, security settings, and ACLs in the target cluster. 

3. Update Clients

  • Ensure producer and consumer applications are updated to use bootstrap.servers for connection. 

4. Test and Validate

  • Perform extensive integration testing. 
  • Validate lag, throughput, and end-to-end behaviour under load. 

5. Switch Production Traffic

  • Once validation is complete, switch clients to point to the new cluster.
  • Monitor key health and performance metrics during the cut-over window.

6. Decommission Old Cluster

  • After successful migration and validation, safely decommission the ZooKeeper-based MSK cluster. 

Note: Because you are provisioning a new cluster, expect some planning around capacity, testing, and potential cut-over coordination. This phased approach minimises risk and maximises confidence in the migration. 

 

Closing Thoughts 

Apache Kafka’s shift from ZooKeeper to KRaft represents a pivotal moment in the evolution of distributed streaming platforms. With simplified architecture, better scalability, and faster metadata operations, Kafka becomes easier to operate and more capable of supporting tomorrow’s real-time workloads. 

For AWS MSK users, Kafka 3.9 with KRaft means you can now harness these benefits in a fully managed environment, provided you plan and execute a thoughtful migration strategy. 

Should you choose to adopt KRaft today? Yes, especially for new clusters.  

Cevo can help assess your Kafka readiness for KRaft and design a low-risk MSK migration plan.  Get in touch with our team to find out how.

Enjoyed this blog?

Share it with your network!

Move faster with confidence