Cluster Balancing

Cluster balancing is crucial for optimal performance. Unbalanced clusters can saturate resources on one or more brokers, impacting throughput and latency. Furthermore, a cluster with replicas on a down broker risks availability loss if more brokers fail, and a cluster that keeps losing brokers without healing eventually risks data loss. Redpanda provides various topic-aware tools to balance clusters for best performance.

Topic-aware data balancer Description

Partition leadership balancing

This balancer transfers the leadership of a broker’s partitions to other replicas to avoid topic leadership hotspots on one or a few specific brokers in your cluster.

The partition leader regularly sends heartbeats to its followers. If a follower does not receive a heartbeat within a timeout, it triggers a new leader election. Redpanda also provides leadership balancing when brokers are added or decommissioned.

Partition replica balancing

This balancer moves partition replicas to avoid topic replica hotspots on one or a few specific brokers in your cluster.

Redpanda prioritizes balancing a topic’s partition replica count evenly across all brokers while it’s balancing the cluster’s overall partition count. Because different topics in a cluster can have vastly different load profiles, this better distributes the workload evenly across brokers.

Redpanda provides partition replica balancing when brokers are added or decommissioned.

Intra-broker partition balancing

This balancer moves partition replicas across CPU cores in an individual broker. Redpanda maintains balanced partition replica assignments between cores to avoid topic hotspots on one or a few specific cores within a broker.

Continuous Intra-Broker Partition Balancing (core_balancing_continuous) requires an enterprise license.

Continuous Data Balancing

This balancer monitors broker and rack availability, as well as disk usage, to avoid topic hotspots when moving data off brokers with fuller disks. Continuous Data Balancing enables self-healing clusters that dynamically balance partitions. It also ensures adherence to rack-aware replica placement policy and self-heals after rack (or availability zone) failure or replacement. This balancer does not keep the relative fullness of each broker within a defined range, it just prevents hitting the fullness threshold of each individual broker.

Continuous Data Balancing requires an enterprise license.

If a topic already has messages and you add partitions, the existing messages won’t be redistributed to the new partitions. If you require messages to be redistributed, then you must create a new topic with the new partition count, then stream the messages from the old topic to the new topic so they are appropriately distributed according to the new partition hashing.

Partition leadership balancing

Every Redpanda topic partition forms a Raft group with a single elected leader. This leader manages all writes for the partition. Raft uses a heartbeat mechanism to maintain leadership authority and to initiate leader elections. The partition leader regularly sends heartbeats (raft_heartbeat_interval_ms) to its followers. If a follower does not receive a heartbeat within a timeout (raft_heartbeat_timeout_ms), it triggers a new leader election. For more information, see Raft consensus algorithm and partition leadership elections.

By default, Redpanda enables topic-aware leadership balancing with the enable_leader_balancer property. Automatic partition leadership balancing improves cluster performance by transferring partition leadership from one broker to other replicas. This transfer changes where data is written to first, but leadership transfer does not involve any data transfer.

The leader_balancer_mode property ensures that each shard in a cluster has an equal number of partitions. It determines the movement of leadership for the replica set of a partition. It supports two modes:

  • random_hill_climbing: This mode randomly searches for potential leadership movements. If it identifies a movement that improves the balance of leaders per shard and the leaders of a given topic per broker, then the leadership change is applied to the cluster. This is the default.

  • greedy_balanced_shards: This mode uses a heuristic approach to find leadership movements that better balance leaders per shard. It applies any beneficial movements it finds.

In addition to the periodic heartbeat, leadership balancing can also occur when a broker restarts or when the controller leader changes (such as when a controller partition changes leader). The controller leader manages the entire cluster. For example, when decommissioning a broker, the controller leader creates a reallocation plan for all partition replicas allocated to that broker. The partition leader then handles the reconfiguration for its Raft group.

Manually change leadership

Despite an even distribution of leaders, sometimes the write pattern is not even across topics, and a set of traffic-heavy partitions could land on one broker and cause a latency spike. For information about metrics to monitor, see Partition health.

To manually change leadership, use the Admin API:

curl -X POST http://<broker_address>:9644/v1/partitions/kafka/<topic>/<partition>/transfer_leadership?target=<destination-broker-id>

For example, to change leadership to broker 2 for partition 0 on topic test:

curl -X POST "http://localhost:9644/v1/partitions/kafka/test/0/transfer_leadership?target=2"
In Kubernetes, run the transfer_leadership request on the Pod that is running the current partition leader.

Partition replica balancing

While leadership balancing doesn’t move any data, partition balancing does move partition replicas to alleviate disk pressure and maintain the configured replication factor across brokers and the additional redundancy across failure domains (such as racks). Depending on the data volume, this process may take some time. Partition balancing is invoked periodically as determined by the partition_autobalancing_tick_interval_ms property.

For predictable and stable performance, Redpanda ensures an even distribution of a topic’s partition replicas across all brokers in a cluster. It allocates partitions to random healthy brokers to prevent topic hotspots, without waiting for a batch of moves to finish before scheduling the next batch.

Different topics in a cluster can have vastly different load profiles, but partitions of a single topic are often similar. To distribute the workload of heavily-used topics, Redpanda prioritizes an equal count of a topic’s partition replicas on each broker while it aims for an equal overall partition count on each broker.

For example, suppose you have a Redpanda cluster that has 3 brokers and 1 topic, light, with 100 partitions and replication factor=3. Partitions are balanced, so each broker hosts 100 replicas of partitions of topic light. You expand the cluster with 3 new brokers and, before partition balancing moves partitions to the new broker, you create another topic, heavy, which serves 10x more produce/consume requests than light. The heavy topic also has 100 partitions and replication factor=3. With topic-aware partition balancing, when creating heavy, Redpanda prioritizes an equal count of partitions of that topic: brokers 1, 2, and 3 host 100 replicas of light and 50 replicas of heavy, and brokers 4, 5, and 6 host 50 replicas of heavy. Partition balancing then kicks in and evens out the distribution of replicas of light, so that there are 50 replicas of light and 50 replicas of heavy on each broker.

This topic-aware partition balancing is enabled by default with the partition_autobalancing_topic_aware property.

Redpanda supports flexible use of network bandwidth for replicating under-replicated partitions. For example, if only one partition is moving, it can use the entire bandwidth for the broker. Redpanda detects which shards are idle, so other shards can essentially steal bandwidth from them. Total bandwidth is controlled by the raft_learner_recovery_rate property.

Redpanda’s default partition balancing includes the following:

  • When a broker is added to the cluster, some replicas are moved from other brokers to the new broker to take advantage of the additional capacity.

  • When a broker is down for a configured timeout, existing online replicas are used to construct a replacement replica on a new broker.

  • When a broker’s free storage space drops below its low disk space threshold, some of the replicas from the broker with low disk space are moved to other brokers.

Monitoring unavailable brokers lets Redpanda self-heal clusters by moving partitions from a failed broker to a healthy broker. Monitoring low disk space lets Redpanda distribute partitions across brokers with enough disk space. If free disk space reaches a critically low level, Redpanda blocks clients from producing. For information about the disk space threshold and alert, see Handle full disks.

Partition balancing settings

Select your partition balancing setting with the partition_autobalancing_mode property.

Setting Description

node_add

Partition balancing happens when brokers (nodes) are added. To avoid hotspots, Redpanda allocates brokers to random healthy brokers.

This is the default setting.

continuous

Redpanda continuously monitors the cluster for broker failures and high disk usage and automatically redistributes partitions to maintain optimal performance and availability. It also monitors rack availability after failures, and for a given partition, it tries to move excess replicas from racks that have more than one replica to racks where there are none. See Configure Continuous Data Balancing.

This requires an enterprise license.

off

All partition balancing from Redpanda is turned off.

This mode is not recommended for production clusters. Only set to off if you need to move partitions manually.

Intra-broker partition balancing

In Redpanda, every partition replica is assigned to a CPU core on a broker. While Redpanda’s default partition balancing monitors cluster-level events, such as the addition of new brokers or broker failure to balance partition assignments, it does not account for the distribution of partitions within an individual broker.

Topic-aware intra-broker partition balancing allows for dynamically reassigning partitions within a broker. Redpanda prioritizes an even distribution of a topic’s partition replicas across all cores in a broker. If a broker’s core count changes, when the broker starts back up, Redpanda can check partition assignments across the broker’s cores and reassign partitions, so that a balanced assignment is maintained across all cores. Redpanda can also check partition assignments when partitions are added to or removed from a broker, and rebalance the remaining partitions between cores.

To determine when to use intra-broker partition balancing, use the public metrics for CPU usage described in the Monitoring guide.

Configure the following properties to trigger intra-broker partition balancing:

Cluster configuration property Description

core_balancing_on_core_count_change

Set to true to rebalance partition assignments across cores after broker startup, if core count increases or decreases. Default value: true.

core_balancing_continuous

Set to true to rebalance partition assignments across cores in runtime, for example when partitions are moved to or away from brokers. Default value: false.

This requires an enterprise license.

You can also manually trigger intra-broker partition balancing with the Admin API:

curl -X POST http://localhost:9644/v1/partitions/rebalance_cores

To check the new partition assignments, make a GET request to the /v1/partitions Admin API endpoint:

curl http://localhost:9644/v1/partitions

Manually move partitions

As an alternative to Redpanda partition balancing, you can change partition assignments explicitly with rpk cluster partitions move.

To reassign partitions with rpk:

  1. Set the partition_autobalancing_mode property to off. If Redpanda partition balancing is enabled, Redpanda may change partition assignments regardless of what you do with rpk.

    rpk cluster config set partition_autobalancing_mode off
  2. Show initial replica sets. For example, for topic test:

    rpk topic describe test -p
    PARTITION  LEADER  EPOCH  REPLICAS  LOG-START-OFFSET  HIGH-WATERMARK
    0          1       1      [1 2 3]   0                 645
    1          1       1      [0 1 2]   0                 682
    2          3       1      [0 1 3]   0                 672
  3. Change partition assignments. For example, to change the replica set of partition 1 from [0 1 2] to [3 1 2], and to change the replica set of partition 2 from [0 1 3] to [2 1 3], run:

    rpk cluster partitions move test -p 1:3,1,2 -p 2:2,1,3
    NAMESPACE  TOPIC  PARTITION  OLD-REPLICAS     NEW-REPLICAS      ERROR
    kafka      test   1          [0-1, 1-1, 2-0]  [1-1, 2-0, 3-0]
    kafka      test   2          [0-0, 1-0, 3-1]  [1-0, 2-0, 3-1]
    
    Successfully began 2 partition movement(s).
    
    Check the movement status with 'rpk cluster partitions move-status' or see new assignments with 'rpk topic describe -p TOPIC'.

    or

    rpk cluster partitions move -p test/1:3,1,2 -p test/2:2,1,3
  4. Verify that the reassignment is complete with move-status:

    rpk cluster partitions move-status
    ONGOING PARTITION MOVEMENTS
    ===========================
    NAMESPACE-TOPIC  PARTITION  MOVING-FROM  MOVING-TO  COMPLETION-%  PARTITION-SIZE  BYTES-MOVED  BYTES-REMAINING
    kafka/test       1          [0 1 2]      [1 2 3]    57            87369012        50426326     36942686
    kafka/test       2          [0 1 3]      [1 2 3]    52            83407045        43817575     39589470

    Alternatively, run rpk topic describe again to show your reassigned replica sets:

    rpk topic describe test -p
    PARTITION  LEADER  EPOCH  REPLICAS  LOG-START-OFFSET  HIGH-WATERMARK
    0          1       2      [1 2 3]   0                 645
    1          1       2      [1 2 3]   0                 682
    2          3       1      [1 2 3]   0                 672

    To cancel all in-progress partition reassignments, run move-cancel:

    rpk cluster partitions move-cancel

    To cancel specific movements to or from a given node, run:

    rpk cluster partitions move-cancel --node 2
If you prefer, Redpanda also supports the use of the AlterPartitionAssignments Kafka API and using standard kafka tools such as kafka-reassign-partitions.sh.

Differences in partition balancing between Redpanda and Kafka

  • In a partition reassignment, you must provide the broker ID for each replica. Kafka validates the broker ID for any new replica that wasn’t in the previous replica set against the list of alive brokers. Redpanda validates all replicas against the list of alive brokers.

  • When there are two identical partition reassignment requests, Kafka cancels the first one without returning an error code, while Redpanda rejects the second one with Partition configuration update in progress or update_in_progress.

  • In Kafka, attempts to add partitions to a topic during in-progress reassignments result in a reassignment_in_progress error, while Redpanda successfully adds partitions to the topic.

  • Kafka doesn’t support shard-level (core) partition assignments, but Redpanda does. For help specifying a shard for partition assignments, see rpk cluster partitions move --help.

Assign partitions at topic creation

To manually assign partitions at topic creation, run:

kafka-topics.sh --create --bootstrap-server 127.0.0.1:9092 --topic custom-assignment --replica-assignment 0:1:2,0:1:2,0:1:2