Decommission Brokers in Kubernetes

Decommissioning a broker is the safe and controlled way to remove a Redpanda broker from the cluster without risking data loss or causing instability. By decommissioning, you ensure that partition replicas are reallocated across the remaining brokers so that you can then safely shut down the broker.

You may want to decommission a broker in the following situations:

You are removing a broker to decrease the size of the cluster, also known as scaling down.
The broker has lost its storage and you need a new broker with a new node ID (broker ID).
You are replacing a worker node, for example, by upgrading the Kubernetes cluster or replacing the hardware.

When a broker is decommissioned, it cannot rejoin the cluster. If a broker with the same ID tries to rejoin the cluster, it is rejected.

Prerequisites

You must have the following:

Kubernetes cluster: Ensure you have a running Kubernetes cluster, either locally, with minikube or kind, or remotely.
Kubectl: Ensure you have the kubectl command-line tool installed and configured to communicate with your cluster.
jq: This guide uses jq make parsing JSON output easier.

What happens when a broker is decommissioned?

When a broker is decommissioned, the controller leader creates a reallocation plan for all partition replicas that are allocated to that broker. By default, this reallocation is done in batches of 50 to avoid overwhelming the remaining brokers with Raft recovery. See partition_autobalancing_concurrent_moves.

The reallocation of each partition is translated into a Raft group reconfiguration and executed by the controller leader. The partition leader then handles the reconfiguration for its Raft group. After the reallocation for a partition is complete, it is recorded in the controller log and the status is updated in the topic tables of each broker.

The decommissioning process is successful only when all partition reallocations have been completed. The controller leader polls for the status of all the partition-level reallocations to ensure that everything completes as expected.

During the decommissioning process, new partitions are not allocated to the broker that is being decommissioned. After all the reallocations have been completed successfully, the broker is removed from the cluster.

The decommissioning process is designed to tolerate controller leadership transfers.

Should you decommission brokers?

Deciding whether to decommission brokers requires careful evaluation of various factors that contribute to the overall health your cluster. For the purposes of this section, the focus is on a cluster with seven brokers. In subsequent sections, the output from the given commands provides additional details to help you determine the minimum number of brokers required in a cluster before it’s safe to decommission brokers.

Availability

You should have enough brokers to span across each rack or availability zone. Run the following command to determine whether rack awareness is enabled in your cluster:

rpk cluster config get enable_rack_awareness

When rack awareness is enabled, you can view which rack each broker is assigned to by running the following command:

rpk cluster info

Example output

CLUSTER
=======
redpanda.560e2403-3fd6-448c-b720-7b456d0aa78c

BROKERS
=======
ID    HOST                          PORT   RACK
0     redpanda-0.testcluster.local  32180  A
1     redpanda-1.testcluster.local  32180  A
4     redpanda-3.testcluster.local  32180  B
5*    redpanda-2.testcluster.local  32180  B
6     redpanda-4.testcluster.local  32180  C
8     redpanda-6.testcluster.local  32180  C
9     redpanda-5.testcluster.local  32180  D

The output shows four racks (A/B/C/D), so you might want to have at least four brokers to use all racks.

Rack awareness is just one aspect of availability. Refer to High Availability for more details on deploying Redpanda for high availability.

Cost

Infrastructure costs increase with each broker because each broker requires a dedicated node (instance), so adding a broker means an additional instance cost. For example, if the instance cost is $1925 per month in a cluster with seven brokers, the instance cost for each broker is $275. Reducing the number of brokers from seven to five would save $550 per month ($275 x 2), and reducing it to three brokers would save $1100 per month. You must also consider other costs, but they won’t be as impacted by changing the broker count.

Data retention

Local data retention is determined by the storage capability of each broker and producer throughput, which is the amount of data produced over a given period. When decommissioning, storage capability must consider both the free storage space and the amount of space already in use by existing partitions.

Run the following command to determine how much storage is being used, in bytes, on each broker:

rpk cluster logdirs describe --aggregate-into broker

Example output

BROKER  SIZE          ERROR
0       263882790656
1       256177979648
2       257698037504
3       259934992896
4       254087316992
5       258369126144
6       255227998208

This example shows that each broker has roughly 240GB of data. This means scaling in to five brokers would require each broker to have at least 337GB to store that same data.

Keep in mind that the actual space used on disk will be greater than the data size reported by Redpanda. Redpanda reserves some data on disk per partition and reserves less space per partition as available disk space decreases. Incoming data for each partition is then written to disk as segments (files). The time when segments are written to disk is based on a number of factors, including the topic’s segment configuration, broker restarts, and changes in Raft leadership.

Throughput is the primary measurement required to calculate future data storage requirements. For example, if the throughput is 200MB/sec, the application will generate 0.72TB/hour (17.28TB/day, or 120.96TB/wk). Divide this amount by the target number of brokers to get an estimate of how much storage is needed to retain that much data for various periods of time:

Retention Disk size (on each of the 5 brokers)

Retention	Disk size (on each of the 5 brokers)
30mins	`(200MB/sec * 30mins * 1.1) = 0.396TB / 5 brokers = 79.2GB`
6hrs	`(200MB/sec * 6hrs * 1.1) = = 4.752TB / 5 brokers = 950.4GB`
1d	`(200MB/sec * 1d * 1.1) = 19.008TB / 5 brokers = 3.8TB`
3d	`(200MB/sec * 3d * 1.1) = 57.024TB / 5 brokers = 11.4TB`

30mins

(200MB/sec * 30mins * 1.1) = 0.396TB / 5 brokers = 79.2GB

6hrs

(200MB/sec * 6hrs * 1.1) = = 4.752TB / 5 brokers = 950.4GB

(200MB/sec * 1d * 1.1) = 19.008TB / 5 brokers = 3.8TB

(200MB/sec * 3d * 1.1) = 57.024TB / 5 brokers = 11.4TB

In the example cluster, only six hours of data locally must be retained. Any older data can be moved to Tiered Storage with a retention of one year. So each broker should have 1.2TB of storage available, taking into account both throughput and current data.

Cost and use case requirements determine how much to spend on local disk capacity. Tiered Storage can help to both decrease costs and expand data retention capabilities.

At this point in the example, it remains unclear whether it is safe to scale down to five brokers. Current calculations are based on five brokers.

Additionally, some assumptions have been made regarding a constant throughput and perfect data balancing. Throughput fluctuates across all partitions, which causes data imbalance. The calculations presented as examples attempt to accommodate for this by padding disk size by 1%. You can increase this buffer, for example in the case of expected hot spot partitions. For details on sizing, see Sizing Guidelines.

Durability

The brokers in a Redpanda cluster are part of a Raft group that requires at least enough brokers to form a quorum-based majority (three brokers minimally). Each topic’s partitions are also Raft groups, so your cluster also needs to have at least as many brokers as the lowest replication factor across all topics. To find the maximum replication factor across all topics in a cluster, run the following command:

rpk topic list | tail -n +2 | awk '{print $3}' | sort -n | tail -1

Example output:

In this example, the highest replication factor is five, which means that at least five brokers are required in this cluster.

Generally, a cluster can withstand a higher number of brokers going down if more brokers exist in the cluster. For details, see Raft consensus algorithm.

Partition count

It is best practice to make sure the total partition count does not exceed 1K per core. This maximum partition count depends on many other factors, such as memory per core, CPU performance, throughput, and latency requirements. Exceeding 1K partitions per core can lead to increased latency, an increased number of partition leadership elections, and generally reduced stability.

Run the following command to get the total partition count for your cluster:

curl -sk http://<broker-url>:<admin-api-port>/v1/partitions/local_summary | jq .count

Example output:

Next, determine the number of cores that are available across the remaining brokers:

rpk redpanda admin brokers list

Example output

NODE-ID  NUM-CORES  MEMBERSHIP-STATUS  IS-ALIVE  BROKER-VERSION
0        8          active             true      v23.1.8
1        8          active             true      v23.1.8
2        8          active             true      v23.1.8
3        8          active             true      v23.1.8
4        8          active             true      v23.1.8
5        8          active             true      v23.1.8
6        8          active             true      v23.1.8

In this example, each broker has eight cores available. If you plan to scale down to five brokers, then you would have 40 cores available, which means that your cluster is limited by core count to 40K partitions, which exceeds the current 3018 partitions.

To best ensure the stability of the cluster, maintain less than 50K partitions per cluster.

Decommission assessment

The considerations tested above yield the following for the example case:

At least four brokers are required based on availability.
Cost is not a limiting factor in this example, but lower cost and lower broker count is always best.
At least 1.2TB of data resides on each broker when spread across five brokers. This falls within the 1.5TB of local storage available in the example.
At least five brokers are required based on the highest replication factor across all topics.
At 3018 partitions, the partition count is so low as to not be a determining factor in broker count (a single broker in this example environment could handle many more partitions).

So the primary limitation consideration is the replication factor of five, meaning that you could scale down to five brokers at minimum.

Decommission a broker

To decommission a broker, you can use one of the following methods:

Manually decommission a broker: Use rpk to decommission one broker at a time.
Use the BrokerDecommissioner: Use the BrokerDecommissioner to automatically decommission brokers whenever you reduce the number of StatefulSet replicas.

Manually decommission a broker

Follow this workflow to manually decommission a broker before reducing the number of StatefulSet replicas:

flowchart TB %% Define classes classDef userAction stroke:#374D7C, fill:#E2EBFF, font-weight:bold,rx:5,ry:5 A[Start Manual Scale-In]:::userAction --> B["Identify broker to R=remove
(highest Pod ordinal)"]:::userAction B --> C[Decommission broker running on Pod with highest ordinal]:::userAction C --> D[Monitor decommission status]:::userAction D --> E{Is broker removed?}:::userAction E -- No --> D E -- Yes --> F[Decrease StatefulSet replicas by 1]:::userAction F --> G[Wait for rolling update and cluster health]:::userAction G --> H{More brokers to remove?}:::userAction H -- Yes --> B H -- No --> I[Done]:::userAction

List your brokers and their associated broker IDs:

kubectl --namespace <namespace> exec -ti redpanda-0 -c redpanda -- \
  rpk cluster info

Example output

CLUSTER
=======
redpanda.560e2403-3fd6-448c-b720-7b456d0aa78c

BROKERS
=======
ID    HOST                          PORT   RACK
0     redpanda-0.testcluster.local  32180  A
1     redpanda-1.testcluster.local  32180  A
4     redpanda-3.testcluster.local  32180  B
5*    redpanda-2.testcluster.local  32180  B
6     redpanda-4.testcluster.local  32180  C
8     redpanda-6.testcluster.local  32180  C
9     redpanda-5.testcluster.local  32180  D

The output shows that the IDs don’t match the StatefulSet ordinal, which appears in the hostname. In this example, two brokers will be decommissioned: redpanda-6 (ID 8) and redpanda-5 (ID 9).

When scaling in a cluster, you cannot choose which broker is removed. Redpanda is deployed as a StatefulSet in Kubernetes. The StatefulSet controls which Pods are destroyed and always starts with the Pod that has the highest ordinal. So the first broker to be removed when updating the StatefulSet in this example is redpanda-6 (ID 8).

Decommission the broker with the highest Pod ordinal:

kubectl --namespace <namespace> exec -ti <pod-name> -c <container-name> -- \
  rpk redpanda admin brokers decommission <broker-id>

This message is displayed before the decommission process is complete.

Success, broker <broker-id> has been decommissioned!

If the broker is not running, use the --force flag.

Monitor the decommissioning status:

kubectl --namespace <namespace> exec -ti <pod-name> -c <container-name> -- \
  rpk redpanda admin brokers decommission-status <broker-id>

The output uses cached cluster health data that is refreshed every 10 seconds. When the completion column for all rows is 100%, the broker is decommissioned.

Another way to verify decommission is complete is by running the following command:

kubectl --namespace <namespace> exec -ti <pod-name> -c <container-name> -- \
  rpk cluster health

Be sure to verify that the decommissioned broker’s ID does not appear in the list of IDs. In this example, ID 9 is missing, which means the decommission is complete.

CLUSTER HEALTH OVERVIEW
=======================
Healthy:               true
Controller ID:               0
All nodes:                   [4 1 0 5 6 8]
Nodes down:                  []
Leaderless partitions:       []
Under-replicated partitions: []

Decrease the number of replicas by one to remove the Pod with the highest ordinal (the one you just decommissioned).

When scaling in (removing brokers), remove only one broker at a time. If you reduce the StatefulSet replicas by more than one, Kubernetes can terminate multiple Pods simultaneously, causing quorum loss and cluster unavailability.

Operator
Helm

redpanda-cluster.yaml

apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
  name: redpanda
spec:
  chartRef: {}
  clusterSpec:
    statefulset:
      replicas: <number-of-replicas>

Apply the Redpanda resource:

kubectl apply -f redpanda-cluster.yaml --namespace <namespace>

--values
--set

decommission.yaml

statefulset:
  replicas: <number-of-replicas>

helm upgrade redpanda redpanda/redpanda --namespace <namespace> --wait --reuse-values --set statefulset.replicas=<number-of-replicas>

This process triggers a rolling restart of each Pod so that each broker has an up-to-date seed_servers configuration to reflect the new list of brokers.

You can repeat this procedure to continue to scale down.

Use the BrokerDecommissioner

The BrokerDecommissioner is responsible for monitoring the StatefulSet for changes in the number replicas. When the number of replicas is reduced, the controller decommissions brokers, starting from the highest Pod ordinal, until the number of brokers matches the number of replicas.

flowchart TB %% Define classes classDef userAction stroke:#374D7C, fill:#E2EBFF, font-weight:bold,rx:5,ry:5 classDef systemAction fill:#F6FBF6,stroke:#25855a,stroke-width:2px,color:#20293c,rx:5,ry:5 %% Legend subgraph Legend direction TB UA([User action]):::userAction SE([System event]):::systemAction end Legend ~~~ Workflow %% Main workflow subgraph Workflow direction TB A[Start automated scale-in]:::userAction --> B[Decrease StatefulSet
replicas by 1]:::userAction B --> C[BrokerDecommissioner
detects reduced replicas]:::systemEvent C --> D[Controller marks
highest ordinal Pod for removal]:::systemEvent D --> E[Controller orchestrates
broker decommission]:::systemEvent E --> F[Partitions reallocate
under controller supervision]:::systemEvent F --> G[Check cluster health]:::systemEvent G --> H{Broker fully removed?}:::systemEvent H -- No --> F H -- Yes --> I[Done,
or repeat if further scale-in needed]:::userAction end

For example, you have a Redpanda cluster with the following brokers:

ID    HOST
0     redpanda-0.testcluster.local
1     redpanda-1.testcluster.local
4     redpanda-3.testcluster.local
5*    redpanda-2.testcluster.local
6     redpanda-4.testcluster.local
8     redpanda-6.testcluster.local
9     redpanda-5.testcluster.local

The IDs are the broker IDs. The output shows that the IDs don’t match the StatefulSet ordinal, which appears in the hostname. In this example, the Pod with the highest ordinal is redpanda-6 (ID 8).

You cannot choose which broker is decommissioned. Redpanda is deployed as a StatefulSet in Kubernetes. The StatefulSet controls which Pods are destroyed and always starts with the Pod that has the highest ordinal. So the first broker to be destroyed when the controller decommissions the brokers in this example is redpanda-6 (ID 8).

When you reduce the number of replicas, the controller terminates the Pod with the highest ordinal, removes its PVC, and then attempts to set the reclaim policy of the PV to Retain. Finally, the controller waits for the cluster state to become healthy before committing to decommissioning the broker that was running in the terminated Pod.

Always decommission one broker at a time.

Enable the BrokerDecommissioner:

Operator
Helm

Deploy the Redpanda Operator with the Decommission controller:

helm repo add redpanda https://charts.redpanda.com
helm repo update
helm upgrade --install redpanda-controller redpanda/operator \
  --namespace <namespace> \
  --set image.tag=v2.4.2 \
  --create-namespace \
  --set additionalCmdFlags={--additional-controllers="decommission"} \
  --set rbac.createAdditionalControllerCRs=true

--additional-controllers="decommission": Enables the Decommission controller.
rbac.createAdditionalControllerCRs=true: Creates the required RBAC rules for the Redpanda Operator to monitor the StatefulSet and update PVCs and PVs.

Configure a Redpanda resource with seven Redpanda brokers:
redpanda-cluster.yaml
```
apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
  name: redpanda
spec:
  chartRef: {}
  clusterSpec:
    statefulset:
      replicas: 7
```
- statefulset.replicas: This example starts with a seven-broker Redpanda cluster.

Apply the Redpanda resource:

kubectl apply -f redpanda-cluster.yaml --namespace <namespace>

--values
--set

decommission-controller.yaml

statefulset:
  replicas: 7 (1)
  sideCars:
    brokerDecommissioner:
      enabled: true (2)
      decommissionAfter: 60s (3)
      decommissionRequeueTimeout: 10s (4)
rbac:
  enabled: true (5)

1	`statefulset.replicas`: This example starts with a seven-broker Redpanda cluster.
2	`statefulset.sideCars.brokerDecommissioner.enabled`: Enables the sidecar.
3	`statefulset.sideCars.brokerDecommissioner.decommissionAfter`: The sidecar will wait 60 seconds after detecting a failed broker before triggering decommissioning.
4	`statefulset.sideCars.brokerDecommissioner.decommissionRequeueTimeout`: How often the sidecar rechecks the cluster for invalid brokers.
5	`rbac.enabled`: Creates the required RBAC rules for the controller to monitor the StatefulSet and update PVCs and PVs.

helm upgrade --install redpanda redpanda/redpanda \
  --namespace <namespace> \
  --create-namespace \
  --set statefulset.replicas=7 \ (1)
  --set statefulset.sideCars.brokerDecommissioner.enabled=true \ (2)
  --set statefulset.sideCars.brokerDecommissioner.decommissionAfter=60s \ (3)
  --set statefulset.sideCars.brokerDecommissioner.decommissionRequeueTimeout=10s \ (4)
  --set rbac.enabled=true(5)

1	`statefulset.replicas`: This example starts with a seven-broker Redpanda cluster.
2	`statefulset.sideCars.brokerDecommissioner.enabled`: Enables the sidecar.
3	`statefulset.sideCars.brokerDecommissioner.decommissionAfter`: The sidecar will wait 60 seconds after detecting a failed broker before triggering decommissioning.
4	`statefulset.sideCars.brokerDecommissioner.decommissionRequeueTimeout`: How often the sidecar rechecks the cluster for invalid brokers.
5	`rbac.enabled`: Creates the required RBAC rules for the controller to monitor the StatefulSet and update PVCs and PVs.

Verify that your cluster is in a healthy state:

kubectl exec redpanda-0 --namespace <namespace> -- rpk cluster health

Decrease the number of replicas by one.

Operator
Helm

redpanda-cluster.yaml

apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
  name: redpanda
spec:
  chartRef: {}
  clusterSpec:
    statefulset:
      replicas: 6
      sideCars:
        brokerDecommissioner:
          enabled: true
          decommissionAfter: 60s
          decommissionRequeueTimeout: 10s
    rbac:
      enabled: true

kubectl apply -f redpanda-cluster.yaml --namespace <namespace>

--values
--set

replicas.yaml

statefulset:
  replicas: 6

helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --values replicas.yaml --reuse-values

helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --set statefulset.replicas=6

The BrokerDecommissioner detects when the number of replicas decreases and decommissions the brokers, starting from the Pod with the highest ordinal. This process triggers a rolling restart of each Pod so that each broker has an up-to-date seed_servers configuration to reflect the new list of brokers.

Verify that your cluster is in a healthy state:
```
kubectl exec redpanda-0 --namespace <namespace> -- rpk cluster health
```
It may take some time for the BrokerDecommissioner to reconcile. You can check the progress by looking at the logs:
```
kubectl logs <pod-name> --namespace <namespace> -c sidecars
```

You can repeat this procedure to continue to scale down.

Troubleshooting

If the decommissioning process is not making progress, investigate the following potential issues:

Absence of a controller leader or partition leader: The controller leader serves as the orchestrator for decommissioning. Additionally, if one of the partitions undergoing reconfiguration does not have a leader, the reconfiguration process may stall. Make sure that an elected leader is present for all partitions.
Bandwidth limitations for partition recovery: Try increasing the value of raft_learner_recovery_rate, and monitor the status using the redpanda_raft_recovery_partition_movement_available_bandwidth metric.

If these steps do not allow the decommissioning process to complete, enable TRACE level logging in the Helm chart to investigate any other issues.

For default values and documentation for configuration options, see the values.yaml file.

Next steps

If you have rack awareness enabled, you may want to reassign the remaining brokers to appropriate racks after the decommission process is complete. See Enable Rack Awareness in Kubernetes.

What do you think of this page?

Let us know more:

Let us contact you about your feedback:

Decommission Brokers in Kubernetes

Prerequisites

What happens when a broker is decommissioned?

Should you decommission brokers?

Availability

Cost

Data retention

Durability

Partition count

Decommission assessment

Decommission a broker

Manually decommission a broker

Use the BrokerDecommissioner

Troubleshooting

Next steps

Suggested reading

Simple online edits

Contribution guide