Decommission Brokers in Kubernetes

When you decommission a broker, its partition replicas are reallocated across the remaining brokers and it is removed from the cluster. You may want to decommission a broker in the following circumstances:

  • The broker has lost its storage and you need a new broker with a new node ID (broker ID).

  • You are replacing a worker node, for example to upgrade the Kubernetes cluster or to replace the hardware.

  • You are removing a broker to decrease the size of the cluster.

When a broker is decommissioned, it cannot rejoin the cluster. If a broker with the same ID tries to rejoin the cluster, it is rejected.

What happens when a broker is decommissioned?

When a broker is decommissioned, the controller leader creates a reallocation plan for all partition replicas that are allocated to that broker. By default, this reallocation is done in batches of 50 to avoid overwhelming the remaining brokers with Raft recovery. See partition_autobalancing_concurrent_moves.

The reallocation of each partition is translated into a Raft group reconfiguration and executed by the controller leader. The partition leader then handles the reconfiguration for its Raft group. After the reallocation for a partition is complete, it is recorded in the controller log and the status is updated in the topic tables of each broker.

The decommissioning process is successful only when all partition reallocations have been completed successfully. The controller leader polls for the status of all the partition-level reallocations to ensure that everything completes as expected.

During the decommissioning process, new partitions are not allocated to the broker that is being decommissioned. After all the reallocations have been completed successfully, the broker is removed from the cluster.

The decommissioning process is designed to tolerate controller leadership transfers.
This guide uses jq to make parsing JSON output easier. For details, see jq downloads.

Should you decommission brokers?

There are several considerations for determining your cluster’s minimum broker count, and whether or not to decommission any brokers. For the purposes of this section, the focus is on a cluster with seven brokers. In subsequent sections, the output from the given commands provides additional details to help you determine the minimum number of brokers.

Availability

You should have a sufficient number of brokers to properly span across each rack or availability zone. Run the following command to determine whether rack awareness is enabled in your cluster:

rpk cluster config get enable_rack_awareness
true

When enabled, you can view which rack each broker is assigned to by running the following command:

rpk cluster info
Example output
CLUSTER
=======
redpanda.560e2403-3fd6-448c-b720-7b456d0aa78c

BROKERS
=======
ID    HOST                          PORT   RACK
0     redpanda-0.testcluster.local  32180  A
1     redpanda-1.testcluster.local  32180  A
4     redpanda-3.testcluster.local  32180  B
5*    redpanda-2.testcluster.local  32180  B
6     redpanda-4.testcluster.local  32180  C
8     redpanda-6.testcluster.local  32180  C
9     redpanda-5.testcluster.local  32180  D

The output shows four racks (A/B/C/D), so you might want to have at least four brokers to make use of all racks.

Rack awareness is just one aspect of availability. Refer to High Availability for more details on deploying Redpanda for high availability.

Cost

Infrastructure costs increase with each broker, so adding a broker means an additional instance to pay for. In this example we deploy to GKE on n2-standard-8 GCP instances. This means that the instance cost of the cluster is $1925 per month. Dropping down to 5 brokers would save $550 per month, and dropping down to 3 brokers would save $1100 per month. Of course, there are other costs to consider, but they won’t be as impacted by changing the broker count.

Data retention

Local data retention is determined by the storage capability of each broker and how much data is being produced over a given period (that is, producer throughput). When decommissioning, storage capability must take into account both the free storage space and the amount of space already used by existing partitions.

Run the following command to determine how much storage is being used (in bytes) on each broker:

rpk cluster logdirs describe --aggregate-into broker
Example output
BROKER  SIZE          ERROR
0       263882790656
1       256177979648
2       257698037504
3       259934992896
4       254087316992
5       258369126144
6       255227998208

The example output shows that each broker contains roughly 240GB of data. This means scaling down to five brokers would require each broker to have at least 337GB to hold current data.

Keep in mind that actual space used on disk will be greater than the data size reported by Redpanda. Redpanda reserves a certain amount of data on disk per partition, and reserves less space per partition as available disk space decreases. Incoming data for each partition is then written to disk in the form of segments (actual files). The time when segments are written to disk is based on a number of factors, including the topic’s segment configuration, broker restarts, and changes in Raft leadership.

Throughput is the primary measurement required to calculate future data storage requirements. In the example cluster there is a throughput of 200MB/sec, which means it will generate 0.72TB/hour (or 17.28TB/day, or 120.96TB/wk). Divide this amount by the target number of brokers to get an estimate of how much storage is needed to retain that much data for various periods of time:

Retention Disk size (on each of the 5 brokers)

30mins

(200MB/sec * 30mins * 1.1) = 0.396TB / 5 brokers = 79.2GB

6hrs

(200MB/sec * 6hrs * 1.1) = = 4.752TB / 5 brokers = 950.4GB

1d

(200MB/sec * 1d * 1.1) = 19.008TB / 5 brokers = 3.8TB

3d

(200MB/sec * 3d * 1.1) = 57.024TB / 5 brokers = 11.4TB

In the example cluster, only six hours of data locally must be retained (any older data is moved to Tiered Storage with a retention of one year). So each broker should have available storage of around 1.2TB (taking into account both throughput and current data).

Cost and use case requirements will dictate how much to spend on local disk capacity. Tiered Storage can help to both decrease costs and expand data retention capabilities. For details, see Tiered Storage.

At this point in the example, it remains unclear that we can scale down to five brokers. Current calculations are based on five brokers. Other broker counts can be used later as needed.

Additionally, some assumptions have been made regarding a constant throughput and perfect data balancing. Throughput fluctuates across all partitions, which causes data imbalance. The calculations above attempt to accommodate for this by padding disk size by 1%. You can increase this buffer (for example, in the case of expected hot spot partitions). For details on sizing, see Sizing Guidelines.

Durability

The brokers in a Redpanda cluster are part of a Raft group that requires at least enough brokers to form a quorum-based majority (three brokers minimally). Each topic’s partitions are also Raft groups, so your cluster also needs to have at least as many brokers as the lowest replication factor across all topics. To find the max replication factor across all topics in a cluster is to run the following command:

rpk topic list | tail -n +2 | awk '{print $3}' | sort -n | tail -1
5

In this example the highest replication factor is five, which means at least five brokers are required in this cluster.

Generally, a cluster can withstand a higher number of brokers going down if there are more brokers in the cluster. For details, see Raft consensus algorithm.

Partition count

It is a best practice to make sure the total partition count does not exceed 1K per core. This max partition count depends on many other factors (such as memory per core, CPU performance, throughput, and latency requirements). Exceeding 1K partitions per core can lead to increased latency, increased number of partition leadership elections, and general reduced stability. Run the following command to get the total partition count:

curl -sk http://<broker-url>:<admin-api-port>/v1/partitions/local_summary | jq .count
3018

Next, determine the number of cores that are available across the remaining brokers:

rpk redpanda admin brokers list
Example output
NODE-ID  NUM-CORES  MEMBERSHIP-STATUS  IS-ALIVE  BROKER-VERSION
0        8          active             true      v23.1.8
1        8          active             true      v23.1.8
2        8          active             true      v23.1.8
3        8          active             true      v23.1.8
4        8          active             true      v23.1.8
5        8          active             true      v23.1.8
6        8          active             true      v23.1.8

In this example each broker has 12 cores available. If you plan to scale down to five broker, then you would have 40 cores available, which means that your cluster is limited by core count to 40K partitions (well above the current 3018 partitions).

To best ensure the stability of the cluster, keep under 50K partitions per cluster.

Decommission assessment

The considerations tested above yield the following
  • At least four brokers are required based on availability.

  • Cost is not a limiting factor in this example, but lower cost (and lower broker count) is always best.

  • At least 1.2TB of data resides on each broker (if spread across five brokers). This falls within the 1.5TB of local storage available in the example.

  • At least five brokers are required based on the highest replication factor across all topics.

  • At 3018 partitions, the partition count is so low as to not be a determining factor in broker count (a single broker in this example environment could handle many more partitions).

So the primary limitation consideration is the replication factor of five, meaning that you could scale down to five brokers at minimum.

Decommission a broker

This example should help you understand how many brokers you can scale down to. Here, there were seven brokers, which were scaled down to five.

  1. List your brokers and their associated broker IDs:

    • TLS Enabled

    • TLS Disabled

    kubectl -n redpanda exec -ti redpanda-0 -c redpanda -- \
      rpk cluster info \
        --brokers <broker-url>:<kafka-api-port> \
        --tls-enabled \
        --tls-truststore <path-to-kafka-api-ca-certificate>
    kubectl -n redpanda exec -ti redpanda-0 -c redpanda -- \
      rpk cluster info \
        --brokers <broker-url>:<kafka-api-port>
    Example output
    CLUSTER
    =======
    redpanda.560e2403-3fd6-448c-b720-7b456d0aa78c
    
    BROKERS
    =======
    ID    HOST                          PORT   RACK
    0     redpanda-0.testcluster.local  32180  A
    1     redpanda-1.testcluster.local  32180  A
    4     redpanda-3.testcluster.local  32180  B
    5*    redpanda-2.testcluster.local  32180  B
    6     redpanda-4.testcluster.local  32180  C
    8     redpanda-6.testcluster.local  32180  C
    9     redpanda-5.testcluster.local  32180  D

    The output shows that the IDs don’t match the StatefulSet ordinal, which appears in the hostname. In this example, two brokers are decommissioned: redpanda-6 (ID 8) and redpanda-5 (ID 9).

    When decommissioning brokers, you may want to selectively decommission brokers (possibly to keep brokers assigned to different racks), but this may be an issue because Redpanda is deployed within a StatefulSet on Kubernetes. The StatefulSet controls which Pods are destroyed and always starts with the highest ordinal. In this example, the StatefulSet ordinals match the number in the hostname. So the first broker to be destroyed when updating the StatefulSet is ID 8 (or redpanda-6.testcluster.local). A better approach is to reassign the remaining brokers to appropriate racks after the decommission process is complete.
  2. Decommission the broker with your selected broker ID:

    • TLS Enabled

    • TLS Disabled

    kubectl -n redpanda exec -ti redpanda-0 -c redpanda -- \
      rpk redpanda admin brokers decommission <broker-id> \
        --admin-api-tls-enabled \
        --admin-api-tls-truststore <path-to-admin-api-ca-certificate> \
        --hosts <broker-url>:<admin-api-port> \
    kubectl -n redpanda exec -ti redpanda-0 -c redpanda -- \
      rpk redpanda admin brokers decommission <broker-id> \
        --hosts <broker-url>:<admin-api-port> \

    This message is displayed before the decommission process is complete.

    Success, broker <broker-id> has been decommissioned!
    If the broker is not running, use the --force flag.
  3. Monitor the decommissioning status:

    • TLS Enabled

    • TLS Disabled

    kubectl -n redpanda exec -ti redpanda-0 -c redpanda -- \
      rpk redpanda admin brokers decommission-status <broker-id> \
        --admin-api-tls-enabled \
        --admin-api-tls-truststore <path-to-admin-api-ca-certificate> \
        --api-urls <broker-url>:<admin-api-port>
    kubectl -n redpanda exec -ti redpanda-0 -c redpanda -- \
      rpk redpanda admin brokers decommission-status <broker-id> \
        --api-urls <broker-url>:<admin-api-port>

    The output uses cached cluster health data that is refreshed every 10 seconds. When the completion column for all rows is 100%, the broker is decommissioned.

    Another way to verify decommission is complete is by running the following command:

    • TLS Enabled

    • TLS Disabled

    kubectl -n redpanda exec -ti redpanda-0 -c redpanda -- \
      rpk cluster health \
        --admin-api-tls-enabled \
        --admin-api-tls-truststore <path-to-admin-api-ca-certificate> \
        --api-urls <broker-url>:<admin-api-port>
    kubectl -n redpanda exec -ti redpanda-0 -c redpanda -- \
      rpk cluster health \
        --api-urls <broker-url>:<admin-api-port>

    Be sure to verify that the decommissioned broker’s ID does not appear in the list of IDs. In this example, ID 9 is missing, which means the decommission is complete.

    CLUSTER HEALTH OVERVIEW
    =======================
    Healthy:               true
    Controller ID:               0
    All nodes:                   [4 1 0 5 6 8]
    Nodes down:                  []
    Leaderless partitions:       []
    Under-replicated partitions: []
  4. Decommission any other brokers.

    After decommissioning one broker and verifying that the process is complete, either continue decommissioning another broker (by repeating the previous two steps) or continue to the next step. In this example, two two brokers (8 and 9) were decommissioned, so repeat the previous two steps. You can repeat as many times as needed to get to the number of desired brokers.

    Be sure to take into account everything in this section, and that you have verified that your cluster and use cases will not be negatively impacted by losing brokers.
  5. Update the StatefulSet replica value.

    The last step is to update the StatefulSet replica value to reflect the new broker count. In this example the count was updated to five. If you deployed with the Helm chart, then run following command:

    helm upgrade redpanda redpanda/redpanda -n redpanda --wait --reuse-values --set statefulset.replicas=5

    This will trigger a rolling restart of each pod, which is required because each broker has its configuration (seed_servers) updated to reflect the new broker list.

Troubleshooting

If the decommissioning process is not making progress, investigate the following potential issues:

  • Absence of a controller leader or partition leader: The controller leader serves as the orchestrator for decommissioning. Additionally, if one of the partitions undergoing reconfiguration does not have a leader, the reconfiguration process may stall. Make sure that an elected leader is present for all partitions.

  • Bandwidth limitations for partition recovery: Try increasing the value of raft_learner_recovery_rate, and monitor the status using the redpanda_raft_recovery_partition_movement_available_bandwidth metric.

If these steps do not allow the decommissioning process to complete, enable TRACE level logging in the Helm chart to investigate any other issues.

For default values and documentation for configuration options, see the values.yaml file.