Decommission Brokers

When you decommission a broker, its partition replicas are reallocated across the remaining brokers and it is removed from the cluster. You may want to decommission a broker in the following circumstances:

  • The broker has lost its storage and you need a new broker with a new node ID (broker ID).

  • You are replacing a broker, for example to upgrade the Linux kernel or to replace the hardware.

  • You are removing a broker to decrease the size of the cluster.

When a broker is decommissioned, it cannot rejoin the cluster. If a broker with the same ID tries to rejoin the cluster, it is rejected.

What happens when a broker is decommissioned?

When a broker is decommissioned, the controller leader creates a reallocation plan for all partition replicas that are allocated to that broker. By default, this reallocation is done in batches of 50 to avoid overwhelming the remaining brokers with Raft recovery. See partition_autobalancing_concurrent_moves.

The reallocation of each partition is translated into a Raft group reconfiguration and executed by the controller leader. The partition leader then handles the reconfiguration for its Raft group. After the reallocation for a partition is complete, it is recorded in the controller log and the status is updated in the topic tables of each broker.

The decommissioning process is successful only when all partition reallocations have been completed successfully. The controller leader polls for the status of all the partition-level reallocations to ensure that everything completes as expected.

During the decommissioning process, new partitions are not allocated to the broker that is being decommissioned. After all the reallocations have been completed successfully, the broker is removed from the cluster.

The decommissioning process is designed to tolerate controller leadership transfers.
This guide uses jq to make parsing JSON output easier. For additional details, see jq downloads.

Should you decommission brokers?

There are several considerations for determining your cluster’s minimum broker count, and whether or not to decommission any brokers. For the purposes of this section, the focus is on a cluster with seven brokers. In subsequent sections, the output from the given commands provides additional details to help you determine the minimum number of brokers.

Availability

You should have a sufficient number of brokers to properly span across each rack or availability zone. Run the following command to determine whether rack awareness is enabled in your cluster:

rpk cluster config get enable_rack_awareness
true

When enabled, you can view which rack each broker is assigned to by running the following command:

rpk cluster info
Example output
CLUSTER
=======
redpanda.560e2403-3fd6-448c-b720-7b456d0aa78c

BROKERS
=======
ID    HOST                          PORT   RACK
0     redpanda-0.testcluster.local  32180  A
1     redpanda-1.testcluster.local  32180  A
4     redpanda-3.testcluster.local  32180  B
5*    redpanda-2.testcluster.local  32180  B
6     redpanda-4.testcluster.local  32180  C
8     redpanda-6.testcluster.local  32180  C
9     redpanda-5.testcluster.local  32180  D

The output shows four racks (A/B/C/D), so you might want to have at least four brokers to make use of all racks.

Rack awareness is just one aspect of availability. Check out High Availability for details on deploying Redpanda for high availability.

Cost

Infrastructure costs increase with each broker, so adding a broker means an additional instance to pay for. In this example we deploy to GKE on seven n2-standard-8 GCP instances. This means that the instance cost of the cluster is around $1.9K per month. Dropping down to 5 brokers would save over $500 per month, and dropping down to 3 brokers would save around $1100 per month. Of course, there are other costs to consider, but they won’t be as impacted by changing the broker count.

Data retention

Local data retention is determined by the storage capability of each broker and how much data is being produced over a given period (that is, producer throughput). When decommissioning, storage capability must take into account both the free storage space and amount of space already used by existing partitions.

Run the following command to determine how much storage is being used (in bytes) on each broker:

rpk cluster logdirs describe --aggregate-into broker
Example output
BROKER  SIZE          ERROR
0       263882790656
1       256177979648
2       257698037504
3       259934992896
4       254087316992
5       258369126144
6       255227998208

The example output shows that each broker contains roughly 240GB of data, which means scaling down to five brokers would require each broker to have at least 337GB to hold current data.

Throughput is the primary measurement required to calculate future data storage requirements. In the example cluster there is a throughput of 200MB/sec, which means it will generate 0.72TB/hour (or 17.28TB/day, or 120.96TB/wk). Divide this amount by the target number of brokers to get an estimate of how much storage is needed to retain that much data for various periods of time:

Retention Disk size (on each of the 5 brokers)

30mins

(200MB/sec * 30mins * 1.1) = 0.396TB / 5 brokers = 79.2GB

6hrs

(200MB/sec * 6hrs * 1.1) = = 4.752TB / 5 brokers = 950.4GB

1d

(200MB/sec * 1d * 1.1) = 19.008TB / 5 brokers = 3.8TB

3d

(200MB/sec * 3d * 1.1) = 57.024TB / 5 brokers = 11.4TB

In the example cluster, only 6 hours of data locally must be retained (any older data is moved to Tiered Storage with a retention of 1 year). So each broker should have available storage of around 1.2TB, taking into account both throughput and current data.

Cost and use case requirements dictate how much to spend on local disk capacity. Tiered Storage can help to both decrease costs and expand data retention capabilities. For details, see Tiered Storage.

At this point in the example, it remains unclear whether or not it makes sense to scale down to five brokers. Current calculations are based on five brokers. You can consider other broker counts later as needed.

Additionally, assumptions have been made regarding a constant throughput and perfect data balancing. Throughput fluctuates across all partitions, which causes data imbalance. The calculations above attempt to accommodate for this by padding disk size by 1%. You can increase this buffer (for example, in the case of expected hot spot partitions). For details on sizing, see Sizing Guidelines.

Durability

The brokers in a Redpanda cluster are part of a Raft group that requires sufficient brokers to form a quorum-based majority (minimally, three brokers). Each topic’s partitions are also Raft groups, so your cluster also needs to have at least as many brokers as the lowest replication factor across all topics. One way to find the max replication factor across all topics in a cluster is to run the following command:

rpk topic list | tail -n +2 | awk '{print $3}' | sort -n | tail -1
5

In this example the highest replication factor is 5, which means at least 5 brokers are required in this cluster.

Generally, a cluster can withstand a higher number of brokers going down if there are more brokers in the cluster. For details, see Raft consensus algorithm.

Partition count

It is a best practice to make sure the total partition count does not exceed 1K per core. This max partition count depends on many other factors (such as memory per core, CPU performance, throughput, and latency requirements). Exceeding 1K partitions per core can lead to increased latency, increased number of partition leadership elections, and general reduced stability. Run the following command to get the total partition count:

curl -sk http://<broker-url>:<admin-api-port>/v1/partitions/local_summary | jq .count
3018

To determine the number of cores that are available across the remaining brokers:

rpk redpanda admin brokers list
Example output
NODE-ID  NUM-CORES  MEMBERSHIP-STATUS  IS-ALIVE  BROKER-VERSION
0        8          active             true      v23.1.8
1        8          active             true      v23.1.8
2        8          active             true      v23.1.8
3        8          active             true      v23.1.8
4        8          active             true      v23.1.8
5        8          active             true      v23.1.8
6        8          active             true      v23.1.8

In this example each broker has 8 cores available. If you plan to scale down to five brokers, then you would have 40 cores available, which means that your cluster is limited by core count to 40K partitions (well above the current 3018 partitions).

To best ensure the stability of the cluster, stay under 50K partitions per cluster.

Decommission assessment

The considerations tested above yield the following:

  • At least four brokers are required based on availability.

  • Cost is not a limiting factor in this example, but lower cost (and lower broker count) is always best.

  • At least 1.2TB of data resides on each broker (if spread across five brokers). This falls within the 1.5TB of local storage available in this example.

  • At least five brokers are required based on the highest replication factor across all topics.

  • At 3018 partitions, the partition count is so low as to not be a determining factor in broker count (a single broker in this example environment could handle many more partitions).

So the primary limitation consideration is the replication factor of five, meaning that you could scale down to five brokers at minimum.

Decommission a broker

  1. List your brokers and their associated broker IDs:

    rpk cluster info \
      -X brokers=<broker-url>:<kafka-api-port>
  2. Decommission the broker with your selected broker ID:

    rpk redpanda admin brokers decommission <broker-id> \
      --hosts <broker-url>:<admin-api-port> \
      --force
    The --force flag is required only if the broker is not running.

    If you see Success, broker <broker-id> has been decommissioned!, the broker is decommissioned. Otherwise, the decommissioning process is still in progress. You can monitor the decommissioning status to follow its progress.

  3. Monitor the decommissioning status:

    rpk redpanda admin brokers decommission-status <broker-id> \
      -X admin.hosts=<broker-url>:<admin-api-port>

    The output uses cached cluster health data that is refreshed every 10 seconds.

    When the completion column for all rows is 100%, the broker is decommissioned.

If you add a new broker, make sure to give it a unique ID. Do not reuse the ID of the decommissioned broker.

Troubleshooting

If the decommissioning process is not making progress, investigate the following potential issues:

  • Absence of a controller leader or partition leader: The controller leader serves as the orchestrator for decommissioning. Additionally, if one of the partitions undergoing reconfiguration does not have a leader, the reconfiguration process may stall. Make sure that an elected leader is present for all partitions.

  • Bandwidth limitations for partition recovery: Try increasing the value of raft_learner_recovery_rate, and monitor the status using the redpanda_raft_recovery_partition_movement_available_bandwidth metric.

If these steps do not allow the decommissioning process to complete, enable TRACE level logging on the controller leader to investigate any other issues.