Kafka Architecture Migration: Zookeeper to KRaft
Kafka Architecture Migration: Zookeeper to KRaft
- Author: - Prathik Shetty(@pshettydev)
- Date: - 2025-04-09
- Status: Accepted
Abstract
This document details the architectural decision and implementation process for migrating the Apache Kafka cluster coordination mechanism within the Automation Service from a Zookeeper-based quorum to the Kafka Raft (KRaft) consensus protocol. This change eliminates the external Zookeeper dependency, simplifying the infrastructure stack and leveraging the native consensus mechanism introduced in recent Kafka versions.
1. Introduction
Apache Kafka has historically relied on Apache Zookeeper for critical cluster metadata management, including controller election, topic configuration, and access control lists (ACLs). While robust, this dependency introduces operational complexity, requiring the management and maintenance of a separate distributed system alongside Kafka.
Recent Kafka versions have introduced KRaft, a built-in consensus protocol based on the Raft algorithm. KRaft allows Kafka brokers to manage cluster metadata internally, removing the need for Zookeeper. This migration represents a strategic shift towards a more streamlined, self-contained Kafka deployment.
2. Motivation
The primary motivations for migrating from a Zookeeper-based Kafka deployment to KRaft are:
- Simplified Architecture: Eliminating Zookeeper removes an entire distributed system component, reducing infrastructure footprint, configuration overhead, and potential points of failure.
- Operational Efficiency: Managing a single system (Kafka in KRaft mode) is simpler than managing two separate systems (Kafka and Zookeeper). This simplifies deployment, monitoring, upgrades, and troubleshooting.
- Improved Scalability & Performance: KRaft is designed to handle a larger number of partitions and scale more efficiently than Zookeeper-based metadata management, potentially offering faster controller failover times.
- Future-Proofing: KRaft is the future direction for Kafka cluster management. Adopting it ensures alignment with the latest Kafka advancements and community best practices.
3. Problem Statement / Context
The previous architecture relied on an external Zookeeper ensemble to manage the Kafka cluster's state and metadata. This involved:
- Deploying and configuring separate Zookeeper nodes (or a single node in development environments).
- Configuring Kafka brokers to connect to the Zookeeper ensemble (
KAFKA_ZOOKEEPER_CONNECT). - Managing potential inconsistencies or operational issues arising from the interaction between the two separate systems.
- Increased resource consumption due to running an additional service.
This dependency added complexity, particularly for development and testing environments, and represented an operational overhead that could be eliminated with KRaft.
4. Previous Architecture: Zookeeper-based Kafka
In the Zookeeper-based model:
- Metadata Storage: All critical cluster metadata (broker status, topic configurations, partition assignments, ACLs) resided within Zookeeper.
- Controller Election: Zookeeper managed the election process for the Kafka controller broker, which is responsible for managing partition leadership and broker state.
- Broker Discovery: Brokers registered themselves in Zookeeper, allowing them and clients to discover active members of the cluster.
- Configuration: Kafka brokers required the
KAFKA_ZOOKEEPER_CONNECTsetting to locate the Zookeeper ensemble. Clients sometimes also interacted directly with Zookeeper, although this is less common now. docker-compose.yml: The setup included a dedicatedzookeeperservice, and thekafkaservice haddepends_on: [zookeeper]and theKAFKA_ZOOKEEPER_CONNECTenvironment variable configured.
Old docker-compose:
5. New Architecture: KRaft-based Kafka
What is KRaft and How It Works
Kafka KRaft (Kafka Raft) is Apache Kafka's new metadata management system that eliminates the dependency on ZooKeeper. In your configuration, you're setting up a single-node Kafka cluster in KRaft mode.
Key Components and Process Flow:
-
Combined Roles: Your configuration runs a single node with both broker and controller roles:
- Broker Role: Handles client requests, manages message storage
- Controller Role: Manages cluster metadata (replaces ZooKeeper)
-
Initialization Process:
- Generate a unique cluster ID
- Format storage with this cluster ID before first startup
- Start Kafka with controller and broker roles activated
-
Communication Flow:
- External clients connect via EXTERNAL listener (port 29092)
- Internal broker communication uses PLAINTEXT listener (port 9092)
- Controller communication uses CONTROLLER listener (port 9093)
In the KRaft-based model:
- Metadata Storage: Cluster metadata is stored internally within a dedicated Kafka topic (
__cluster_metadata) replicated across a quorum of controller nodes using the Raft consensus protocol. - Controller Quorum: Instead of a single controller elected via Zookeeper, KRaft uses a quorum of nodes designated with the
controllerrole. These nodes manage the cluster state using the Raft algorithm for consensus. Brokers designated with thebrokerrole fetch metadata updates directly from the active controller within the quorum. (In our single-node setup, one instance fulfills bothbrokerandcontrollerroles). - Self-Contained: Kafka operates independently without any external coordination service like Zookeeper.
- Configuration:
- The
process.rolesproperty (orKAFKA_PROCESS_ROLESenv var) defines whether a node acts as abroker,controller, or both. controller.quorum.voters(orKAFKA_CONTROLLER_QUORUM_VOTERS) specifies the nodes participating in the controller quorum.- A unique
cluster.id(orCLUSTER_ID) must be generated and assigned to the cluster before its first startup. - The Kafka storage directory must be formatted using the
kafka-storage formatcommand before the first startup.
- The
docker-compose.yml:- The
zookeeperservice is removed. - The
kafkaservice no longer depends on Zookeeper. - The
KAFKA_ZOOKEEPER_CONNECTvariable is removed. - New KRaft-specific environment variables (
KAFKA_PROCESS_ROLES,KAFKA_NODE_ID,KAFKA_CONTROLLER_QUORUM_VOTERS,KAFKA_LISTENERS,KAFKA_CONTROLLER_LISTENER_NAMES, etc.) are added. - Comments highlight the need for manual
CLUSTER_IDgeneration and storage formatting before initial startup. - A persistent volume (
kafka_data) is now strongly recommended and configured to ensure metadata stored by KRaft persists across restarts. - The
kafka-uiservice configuration is updated to remove the Zookeeper connection string, connecting directly via bootstrap servers.
- The
Flow Diagram to better understand the process:
6. Benefits of KRaft Migration
- Reduced Complexity: Single system management simplifies operations.
- Lower Resource Usage: No separate Zookeeper service consuming resources.
- Faster Recovery: Potentially faster controller failover times compared to Zookeeper-based election.
- Enhanced Scalability: Designed to handle significantly more topics and partitions.
- Alignment with Kafka Roadmap: Positions the service to leverage future Kafka enhancements built upon KRaft.
7. Implementation Details & Considerations
KRaft Configuration Variables
| Variable | Description | Importance |
|---|---|---|
KAFKA_NODE_ID | Unique identifier for the node (uses KAFKA_BROKER_ID or defaults to 1) | Critical: Every node must have a unique ID |
KAFKA_PROCESS_ROLES | Defines node's responsibilities (broker,controller) | Critical: Determines whether node acts as broker, controller, or both |
KAFKA_CONTROLLER_QUORUM_VOTERS | List of controller nodes forming the quorum | Critical: Defines the consensus group for metadata management |
CLUSTER_ID | Unique identifier for the entire Kafka cluster | Critical: Must be generated before first start and remain consistent |
Listener Configuration Variables
| Variable | Description | Importance |
|---|---|---|
KAFKA_LISTENERS | Internal socket addresses Kafka binds to | Critical: Defines all communication endpoints |
KAFKA_ADVERTISED_LISTENERS | Addresses clients will use to connect | Critical: Must be accessible to clients |
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP | Maps listener names to security protocols | High: Defines security for each listener |
KAFKA_INTER_BROKER_LISTENER_NAME | Listener used for broker-to-broker communication | High: Must be properly configured for broker communication |
KAFKA_CONTROLLER_LISTENER_NAMES | Listener used for controller-to-controller communication | High: Required for controller quorum operation |
General Kafka Settings
| Variable | Description | Importance |
|---|---|---|
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR | Replication factor for consumer offset topic | Medium: Set to 1 for single node; higher in production |
KAFKA_NUM_PARTITIONS | Default partition count for auto-created topics | Medium: Affects parallelism and throughput |
KAFKA_AUTO_CREATE_TOPICS_ENABLE | Whether topics can be auto-created | Medium: Convenient in dev but can be risky in production |
Port Mappings
| Variable | Description | Importance |
|---|---|---|
KAFKA_INTERNAL_PORT | Maps to internal broker port (9092) | High: Used for internal communication |
KAFKA_EXTERNAL_PORT | Maps to external client port (29092) | High: This is how clients connect to your Kafka |
Storage Configuration
The volume mapping kafka_data:/var/lib/kafka/data is critically important as it persists Kafka data between container restarts, especially important with KRaft where both message data and metadata are stored by Kafka itself.
8. Important Setup Steps
Before first startup:
- Generate a cluster ID using
kafka-storage random-uuid - Set this ID in
KAFKA_CLUSTER_ID - Format storage with
kafka-storage format -t <YOUR_CLUSTER_ID> -c /etc/kafka/kafka.properties
This configuration gives you a single-node Kafka cluster in KRaft mode, which is suitable for development and testing. For production, you would typically have multiple nodes with separated controller and broker roles.
9. Sample Docker Compose
docker-compose of the kafka image with kraft configurations:
10. Conclusion
The migration from Zookeeper to KRaft modernizes the Automation Service's Kafka infrastructure, simplifying its architecture and operational management. While requiring specific initialization steps (cluster ID generation and storage formatting), the long-term benefits of reduced complexity, improved scalability, and alignment with Kafka's future direction justify the transition.