Decommission Brokers in Kubernetes

When you decommission a broker, its partition replicas are reallocated across the remaining brokers and it is removed from the cluster. You may want to decommission a broker in the following circumstances:

  • You are removing a broker to decrease the size of the cluster, also known as scaling down.

  • The broker has lost its storage and you need a new broker with a new node ID (broker ID).

  • You are replacing a worker node, for example to upgrade the Kubernetes cluster or to replace the hardware.

When a broker is decommissioned, it cannot rejoin the cluster. If a broker with the same ID tries to rejoin the cluster, it is rejected.

Prerequisites

You must have the following:

  • Kubernetes cluster: Ensure you have a running Kubernetes cluster, either locally, such as with minikube or kind, or remotely.

  • Kubectl: Ensure you have the kubectl command-line tool installed and configured to communicate with your cluster.

  • jq: This guide uses jq make parsing JSON output easier.

What happens when a broker is decommissioned?

When a broker is decommissioned, the controller leader creates a reallocation plan for all partition replicas that are allocated to that broker. By default, this reallocation is done in batches of 50 to avoid overwhelming the remaining brokers with Raft recovery. See partition_autobalancing_concurrent_moves.

The reallocation of each partition is translated into a Raft group reconfiguration and executed by the controller leader. The partition leader then handles the reconfiguration for its Raft group. After the reallocation for a partition is complete, it is recorded in the controller log and the status is updated in the topic tables of each broker.

The decommissioning process is successful only when all partition reallocations have been completed successfully. The controller leader polls for the status of all the partition-level reallocations to ensure that everything completes as expected.

During the decommissioning process, new partitions are not allocated to the broker that is being decommissioned. After all the reallocations have been completed successfully, the broker is removed from the cluster.

The decommissioning process is designed to tolerate controller leadership transfers.

Should you decommission brokers?

Deciding whether to decommission brokers requires careful evaluation of various factors that contribute to the overall health your cluster. For the purposes of this section, the focus is on a cluster with seven brokers. In subsequent sections, the output from the given commands provides additional details to help you determine the minimum number of brokers required in a cluster before it’s safe to decommission brokers.

Availability

You should have enough brokers to span across each rack or availability zone. Run the following command to determine whether rack awareness is enabled in your cluster:

rpk cluster config get enable_rack_awareness

When rack awareness is enabled, you can view which rack each broker is assigned to by running the following command:

rpk cluster info
Example output
CLUSTER
=======
redpanda.560e2403-3fd6-448c-b720-7b456d0aa78c

BROKERS
=======
ID    HOST                          PORT   RACK
0     redpanda-0.testcluster.local  32180  A
1     redpanda-1.testcluster.local  32180  A
4     redpanda-3.testcluster.local  32180  B
5*    redpanda-2.testcluster.local  32180  B
6     redpanda-4.testcluster.local  32180  C
8     redpanda-6.testcluster.local  32180  C
9     redpanda-5.testcluster.local  32180  D

The output shows four racks (A/B/C/D), so you might want to have at least four brokers to make use of all racks.

Rack awareness is just one aspect of availability. Refer to High Availability for more details on deploying Redpanda for high availability.

Cost

Infrastructure costs increase with each broker, so adding a broker means an additional instance cost. For example, if you deploy Redpanda on GKE on n2-standard-8 GCP instances, the instance cost of the cluster is $1925 per month. Reducing the number of brokers to five would save $550 per month, and reducing it further to three brokers would save $1100 per month. You must also consider other costs, but they won’t be as impacted by changing the broker count.

Data retention

Local data retention is determined by the storage capability of each broker and producer throughput, which is the amount of data being produced over a given period. When decommissioning, storage capability must take into account both the free storage space and the amount of space already in use by existing partitions.

Run the following command to determine how much storage is being used, in bytes, on each broker:

rpk cluster logdirs describe --aggregate-into broker
Example output
BROKER  SIZE          ERROR
0       263882790656
1       256177979648
2       257698037504
3       259934992896
4       254087316992
5       258369126144
6       255227998208

This example shows that each broker has roughly 240GB of data. This means scaling in to five brokers would require each broker to have at least 337GB to store that same data.

Keep in mind that actual space used on disk will be greater than the data size reported by Redpanda. Redpanda reserves some data on disk per partition, and reserves less space per partition as available disk space decreases. Incoming data for each partition is then written to disk in the form of segments (files). The time when segments are written to disk is based on a number of factors, including the topic’s segment configuration, broker restarts, and changes in Raft leadership.

Throughput is the primary measurement required to calculate future data storage requirements. For example, if throughput is at 200MB/sec, the application will generate 0.72TB/hour (or 17.28TB/day, or 120.96TB/wk). Divide this amount by the target number of brokers to get an estimate of how much storage is needed to retain that much data for various periods of time:

Retention Disk size (on each of the 5 brokers)

30mins

(200MB/sec * 30mins * 1.1) = 0.396TB / 5 brokers = 79.2GB

6hrs

(200MB/sec * 6hrs * 1.1) = = 4.752TB / 5 brokers = 950.4GB

1d

(200MB/sec * 1d * 1.1) = 19.008TB / 5 brokers = 3.8TB

3d

(200MB/sec * 3d * 1.1) = 57.024TB / 5 brokers = 11.4TB

In the example cluster, only six hours of data locally must be retained. Any older data can be moved to Tiered Storage with a retention of one year. So each broker should have 1.2TB of storage available, taking into account both throughput and current data.

Cost and use case requirements determine how much to spend on local disk capacity. Tiered Storage can help to both decrease costs and expand data retention capabilities.

At this point in the example, it remains unclear whether it is safe to scale down to five brokers. Current calculations are based on five brokers.

Additionally, some assumptions have been made regarding a constant throughput and perfect data balancing. Throughput fluctuates across all partitions, which causes data imbalance. The calculations presented as examples attempt to accommodate for this by padding disk size by 1%. You can increase this buffer, for example in the case of expected hot spot partitions. For details on sizing, see Sizing Guidelines.

Durability

The brokers in a Redpanda cluster are part of a Raft group that requires at least enough brokers to form a quorum-based majority (three brokers minimally). Each topic’s partitions are also Raft groups, so your cluster also needs to have at least as many brokers as the lowest replication factor across all topics. To find the maximum replication factor across all topics in a cluster, run the following command:

rpk topic list | tail -n +2 | awk '{print $3}' | sort -n | tail -1

Example output:

5

In this example the highest replication factor is five, which means at least five brokers are required in this cluster.

Generally, a cluster can withstand a higher number of brokers going down if more brokers exist in the cluster. For details, see Raft consensus algorithm.

Partition count

It is best practice to make sure the total partition count does not exceed 1K per core. This maximum partition count depends on many other factors, such as memory per core, CPU performance, throughput, and latency requirements. Exceeding 1K partitions per core can lead to increased latency, increased number of partition leadership elections, and general reduced stability.

Run the following command to get the total partition count for your cluster:

curl -sk http://<broker-url>:<admin-api-port>/v1/partitions/local_summary | jq .count

Example output:

3018

Next, determine the number of cores that are available across the remaining brokers:

rpk redpanda admin brokers list
Example output
NODE-ID  NUM-CORES  MEMBERSHIP-STATUS  IS-ALIVE  BROKER-VERSION
0        8          active             true      v23.1.8
1        8          active             true      v23.1.8
2        8          active             true      v23.1.8
3        8          active             true      v23.1.8
4        8          active             true      v23.1.8
5        8          active             true      v23.1.8
6        8          active             true      v23.1.8

In this example, each broker has eight cores available. If you plan to scale down to five brokers, then you would have 40 cores available, which means that your cluster is limited by core count to 40K partitions, which exceeds the current 3018 partitions.

To best ensure the stability of the cluster, maintain less than 50K partitions per cluster.

Decommission assessment

The considerations tested above yield the following for the example case:

  • At least four brokers are required based on availability.

  • Cost is not a limiting factor in this example, but lower cost and lower broker count is always best.

  • At least 1.2TB of data resides on each broker when spread across five brokers. This falls within the 1.5TB of local storage available in the example.

  • At least five brokers are required based on the highest replication factor across all topics.

  • At 3018 partitions, the partition count is so low as to not be a determining factor in broker count (a single broker in this example environment could handle many more partitions).

So the primary limitation consideration is the replication factor of five, meaning that you could scale down to five brokers at minimum.

Decommission a broker

To decommission a broker, you can use one of the following methods:

This example shows how to scale a cluster from seven brokers to five brokers.

Use the Decommission controller

The Decommission controller is responsible for monitoring the StatefulSet for changes in the number replicas. When the number of replicas is reduced, the controller decommissions brokers, starting from the highest Pod ordinal, until the number of brokers matches the number of replicas. For example, you have a Redpanda cluster with the following brokers:

ID    HOST
0     redpanda-0.testcluster.local
1     redpanda-1.testcluster.local
4     redpanda-3.testcluster.local
5*    redpanda-2.testcluster.local
6     redpanda-4.testcluster.local
8     redpanda-6.testcluster.local
9     redpanda-5.testcluster.local

The IDs are the broker IDs. The output shows that the IDs don’t match the StatefulSet ordinal, which appears in the hostname. In this example, the Pod with the highest ordinal is redpanda-6 (ID 8).

You cannot choose which broker is decommissioned. Redpanda is deployed as a StatefulSet in Kubernetes. The StatefulSet controls which Pods are destroyed and always starts with the Pod that has the highest ordinal. So the first broker to be destroyed when the controller decommissions the brokers in this example is redpanda-6 (ID 8).

When you reduce the number of replicas, the controller terminates the Pod with the highest ordinal, removes its PVC, and then attempts to set the reclaim policy of the PV to Retain. Finally, the controller waits for the cluster state to become healthy before committing to decommissioning the broker that was running in the terminated Pod.

Always decommission one broker at a time.
  1. Install the Decommission controller:

    • Helm + Operator

    • Helm

    You can install the Decommission controller as part of the Redpanda Operator or as a sidecar on each Pod that runs a Redpanda broker.

    When you install the controller as part of the Redpanda Operator, it monitors all Redpanda clusters running in the same namespace as the Redpanda Operator.

    If you want the controller to manage only a single Redpanda cluster, install it as a sidecar on each Pod that runs a Redpanda broker, using the Redpanda resource.

    To install the Decommission controller as part of the Redpanda Operator:

    1. Deploy the Redpanda Operator with the Decommission controller:

      helm repo add redpanda https://charts.redpanda.com
      helm upgrade --install redpanda-controller redpanda/operator \
        --namespace <namespace> \
        --set image.tag=v2.1.14-23.3.4 \
        --create-namespace \
        --set additionalCmdFlags={--additional-controllers="decommission"} \
        --set rbac.createAdditionalControllerCRs=true
      • --additional-controllers="decommission": Enables the Decommission controller.

      • rbac.createAdditionalControllerCRs=true: Creates the required RBAC rules for the Redpanda Operator to monitor the StatefulSet and update PVCs and PVs.

    2. Configure a Redpanda resource with seven Redpanda brokers:

      redpanda-cluster.yaml
      apiVersion: cluster.redpanda.com/v1alpha1
      kind: Redpanda
      metadata:
        name: redpanda
      spec:
        chartRef: {}
        clusterSpec:
          statefulset:
            replicas: 7
      • statefulset.replicas: This example starts with a seven-broker Redpanda cluster.

    3. Apply the Redpanda resource:

      kubectl apply -f redpanda-cluster.yaml --namespace <namespace>

    To install the Decommission controller as a sidecar:

    1. Configure a Redpanda resource with the sidecar controller enabled:

      redpanda-cluster.yaml
      apiVersion: cluster.redpanda.com/v1alpha1
      kind: Redpanda
      metadata:
        name: redpanda
      spec:
        chartRef: {}
        clusterSpec:
          statefulset:
            replicas: 7
            sideCars:
              controllers:
                enabled: true
              run:
                - "decommission"
          rbac:
            enabled: true
      • statefulset.replicas: This example starts with a seven-broker Redpanda cluster.

      • statefulset.sideCars.controllers.enabled: Enables the controllers sidecar.

      • statefulset.sideCars.controllers.run: Enables the Decommission controller.

      • rbac.enabled: Creates the required RBAC rules for the controller to monitor the StatefulSet and update PVCs and PVs.

    2. Apply the Redpanda resource:

      kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
    If you deploy the Redpanda Helm chart with Argo CD, you cannot use the Decommission controller.
    • --values

    • --set

    decommission-controller.yaml
    statefulset:
      replicas: 7
      sideCars:
        controllers:
          enabled: true
          run:
            - "decommission"
    rbac:
      enabled: true
    • statefulset.replicas: This example starts with a seven-broker Redpanda cluster.

    • statefulset.sideCars.controllers.enabled: Enables the controllers sidecar.

    • statefulset.sideCars.controllers.run: Enables the Decommission controller.

    • rbac.enabled: Creates the required RBAC rules for the controller to monitor the StatefulSet and update PVCs and PVs.

    helm upgrade --install redpanda redpanda/redpanda \
      --namespace <namespace> \
      --create-namespace \
      --set statefulset.replicas=7 \
      --set statefulset.sideCars.controllers.enabled=true \
      --set statefulset.sideCars.controllers.run={"decommission"} \
      --set rbac.enabled=true
    • statefulset.replicas: This example starts with a seven-broker Redpanda cluster.

    • statefulset.sideCars.controllers.enabled: Enables the controllers sidecar.

    • statefulset.sideCars.controllers.run: Enables the Decommission controller.

    • rbac.enabled: Creates the required RBAC rules for the controller to monitor the StatefulSet and update PVCs and PVs.

  2. Verify that your cluster is in a healthy state:

    kubectl exec redpanda-0 --namespace <namespace> -- rpk cluster health
  3. Decrease the number of replicas by one:

    • Helm + Operator

    • Helm

    redpanda-cluster.yaml
    apiVersion: cluster.redpanda.com/v1alpha1
    kind: Redpanda
    metadata:
      name: redpanda
    spec:
      chartRef: {}
      clusterSpec:
        statefulset:
          replicas: 6
          sideCars:
            controllers:
              enabled: true
              run:
                - "decommission"
        rbac:
          enabled: true
    kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
    • --values

    • --set

    replicas.yaml
    statefulset:
      replicas: 6
    helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
      --values replicas.yaml --reuse-values
    helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
      --set statefulset.replicas=6 \
      --set statefulset.sideCars.controllers.enabled=true \
      --set statefulset.sideCars.controllers.run={"decommission"} \
      --set rbac.enabled=true

    The Decommission controller detects when the number of replicas decreases and decommissions the brokers, starting from the Pod with the highest ordinal. This process triggers a rolling restart of each Pod so that each broker has an up-to-date seed_servers configuration to reflect the new list of brokers.

  4. Verify that your cluster is in a healthy state:

    kubectl exec redpanda-0 --namespace <namespace> -- rpk cluster health

    It may take some time for the Decommission controller to reconcile. You can check the progress by looking at the Decommission controller logs:

    If you’re running the Decommission controller as part of the Redpanda Operator:

    kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace>

    If you’re running the Decommission controller as a sidecar:

    kubectl logs <pod-name> --namespace <namespace> -c redpanda-controllers

You can repeat this procedure to scale down to 5 brokers.

Manually decommission a broker

If you don’t want to use the Decommission controller, follow these steps to manually decommission a broker before reducing the number of StatefulSet replicas:

  1. List your brokers and their associated broker IDs:

    kubectl --namespace <namespace> exec -ti redpanda-0 -c redpanda -- \
      rpk cluster info
    Example output
    CLUSTER
    =======
    redpanda.560e2403-3fd6-448c-b720-7b456d0aa78c
    
    BROKERS
    =======
    ID    HOST                          PORT   RACK
    0     redpanda-0.testcluster.local  32180  A
    1     redpanda-1.testcluster.local  32180  A
    4     redpanda-3.testcluster.local  32180  B
    5*    redpanda-2.testcluster.local  32180  B
    6     redpanda-4.testcluster.local  32180  C
    8     redpanda-6.testcluster.local  32180  C
    9     redpanda-5.testcluster.local  32180  D

    The output shows that the IDs don’t match the StatefulSet ordinal, which appears in the hostname. In this example, two brokers will be decommissioned: redpanda-6 (ID 8) and redpanda-5 (ID 9).

    When scaling in a cluster, you cannot choose which broker is decommissioned. Redpanda is deployed as a StatefulSet in Kubernetes. The StatefulSet controls which Pods are destroyed and always starts with the Pod that has the highest ordinal. So the first broker to be destroyed when updating the StatefulSet in this example is redpanda-6 (ID 8).
  2. Decommission the broker with your selected broker ID:

    kubectl --namespace <namespace> exec -ti <pod-name> -c <container-name> -- \
      rpk redpanda admin brokers decommission <broker-id>

    This message is displayed before the decommission process is complete.

    Success, broker <broker-id> has been decommissioned!
    If the broker is not running, use the --force flag.
  3. Monitor the decommissioning status:

    kubectl --namespace <namespace> exec -ti <pod-name> -c <container-name> -- \
      rpk redpanda admin brokers decommission-status <broker-id>

    The output uses cached cluster health data that is refreshed every 10 seconds. When the completion column for all rows is 100%, the broker is decommissioned.

    Another way to verify decommission is complete is by running the following command:

    kubectl --namespace <namespace> exec -ti <pod-name> -c <container-name> -- \
      rpk cluster health

    Be sure to verify that the decommissioned broker’s ID does not appear in the list of IDs. In this example, ID 9 is missing, which means the decommission is complete.

    CLUSTER HEALTH OVERVIEW
    =======================
    Healthy:               true
    Controller ID:               0
    All nodes:                   [4 1 0 5 6 8]
    Nodes down:                  []
    Leaderless partitions:       []
    Under-replicated partitions: []
  4. Decommission any other brokers.

    After decommissioning one broker and verifying that the process is complete, continue decommissioning another broker by repeating the previous two steps.

    Be sure to take into account everything in this section, and that you have verified that your cluster and use cases will not be negatively impacted by losing brokers.
  5. Update the StatefulSet replica value.

    The last step is to update the StatefulSet replica value to reflect the new broker count. In this example the count was updated to five. If you deployed with the Helm chart, then run following command:

    helm upgrade redpanda redpanda/redpanda --namespace <namespace> --wait --reuse-values --set statefulset.replicas=5

    This process triggers a rolling restart of each Pod so that each broker has an up-to-date seed_servers configuration to reflect the new list of brokers.

Troubleshooting

If the decommissioning process is not making progress, investigate the following potential issues:

  • Absence of a controller leader or partition leader: The controller leader serves as the orchestrator for decommissioning. Additionally, if one of the partitions undergoing reconfiguration does not have a leader, the reconfiguration process may stall. Make sure that an elected leader is present for all partitions.

  • Bandwidth limitations for partition recovery: Try increasing the value of raft_learner_recovery_rate, and monitor the status using the redpanda_raft_recovery_partition_movement_available_bandwidth metric.

If these steps do not allow the decommissioning process to complete, enable TRACE level logging in the Helm chart to investigate any other issues.

For default values and documentation for configuration options, see the values.yaml file.

Next steps

If you have rack awareness enabled, you may want to reassign the remaining brokers to appropriate racks after the decommission process is complete. See Enable Rack Awareness in Kubernetes.