Upgrade Redpanda in Kubernetes

To benefit from Redpanda’s new features and enhancements, use rolling upgrades to upgrade to the latest version. New features are available after all brokers (Pods) in the cluster are upgraded and restarted.

Redpanda platform version numbers follow the convention AB.C.D, where AB is the two digit year, C is the feature release, and D is the patch release. For example, version 22.3.1 indicates the first patch release on the third feature release of the year 2022. Patch releases include bug fixes and minor improvements, with no change to user-facing behavior. New and enhanced features are documented with each feature release.

  • New features are enabled after all brokers (nodes) in the cluster are upgraded. You can stop the upgrade process and roll back to the original version as long as you have not upgraded every broker and restarted the cluster.

  • Redpanda supports upgrading only one sequential feature release at a time. For example, you can upgrade from the 22.2 feature release to 22.3. You cannot skip feature releases.

  • Redpanda supports downgrading only between patch releases of the same feature release. For example, you can downgrade from the 22.2.2 patch release to 22.2.1, but you cannot downgrade to 22.1.7.

  • Tiered Storage: When upgrading to Redpanda 23.2, uploads to object storage are paused until all brokers in the cluster are upgraded. If the cluster gets stuck while upgrading, roll it back to the original version. In a mixed-version state, the cluster could run out of disk space. If you need to force a mixed-version cluster to upload, move partition leadership to brokers running the original version.

  • Remote Read Replicas: Upgrade the Remote Read Replica cluster before upgrading the origin cluster. The Remote Read Replica cluster must run on the same version of Redpanda as the origin cluster, or just one feature release ahead of the origin cluster. When upgrading to Redpanda 23.2, metadata from object storage is not synchronized until all brokers in the cluster are upgraded. If you need to force a mixed-version cluster to sync read replicas, move partition leadership to brokers running the original version.

  • Controller snapshots are disabled in upgraded clusters. To enable them, contact Redpanda Support.

Prerequisites

  • A running Redpanda cluster.

  • jq for listing available versions.

  • An understanding of the impact of broker restarts on clients, node CPU, and any alerting systems you use.

  • Review incompatible changes in new versions:

Review incompatible changes

Patch releases in 22.3.14 and 23.1.2 changed the behavior when remote read is disabled and the requested Raft term falls below the local log’s beginning. In earlier versions, Redpanda returned an offset -1. With the patch, when you request a value older than the lowest offset, Redpanda returns the lowest offset, not -1.

Find a new version

Before you perform a rolling upgrade, you must find out which Redpanda version you are currently running, whether you can upgrade straight to the new version, and what’s changed since your original version.

  1. Find your current version:

    • TLS Enabled

    • TLS Disabled

    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk redpanda admin brokers list \
        -X admin.tls.enabled=true \
        -X admin.tls.ca=<path-to-admin-api-ca-certificate> \
        -X admin.hosts=<broker-url>:<admin-api-port>
    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk redpanda admin brokers list \
        -X admin.hosts=<broker-url>:<admin-api-port> \

    For all available flags, see the rpk redpanda admin brokers list command reference.

    Expected output:

    The Redpanda version for each broker is listed under BROKER-VERSION.

    NODE-ID  BROKER-VERSION
    0        v22.2.10
    1        v22.2.10
    2        v22.2.10
  2. Find the Redpanda version that’s used in the latest Redpanda Helm chart:

    helm repo update && \
    helm show chart redpanda/redpanda | grep appVersion

    Example output:

    appVersion:	v22.2.10

    If your current version is more than one feature release behind the version in the latest Redpanda Helm chart, you must first upgrade to an intermediate version. To list all available versions:

    curl -s 's://hub.docker.com/v2/repositories/redpandadata/redpanda/tags/?ordering=last_updated&page=1&page_size=50' | jq -r '.results[].name'
  3. Check the release notes to find information about what has changed between Redpanda versions.

Impact of broker restarts

When brokers restart, clients may experience higher latency, nodes may experience CPU spikes when the broker becomes available again, and you may receive alerts about under-replicated partitions. Topics that weren’t using replication (that is, topics that had replication.factor=1) will be unavailable.

Temporary increase in latency on clients (producers and consumers)

When you restart one or more brokers in a cluster, clients (consumers and producers) may experience higher latency due to partition leadership reassignment. Because clients must communicate with the leader of a partition, they may send a request to a broker whose leadership has been transferred, and receive NOT_LEADER_FOR_PARTITION. In this case, clients must request metadata from the cluster to find out the address of the new leader. Clients refresh their metadata periodically, or when the client receives some retryable errors that indicate that the metadata may be stale. For example:

  1. Broker A shuts down.

  2. Client sends a request to broker A, and receives NOT_LEADER_FOR_PARTITION.

  3. Client requests metadata, and learns that the new leader is broker B.

  4. Client sends the request to broker B.

CPU spikes upon broker restart

When a restarted broker becomes available again, you may see your nodes' CPU usage increase temporarily. This temporary increase in CPU usage is due to the cluster rebalancing the partition replicas.

Under-replicated partitions

When a broker is in maintenance mode, Redpanda continues to replicate updates to that broker. When a broker is taken offline during a restart, partitions with replicas on the broker could become out of sync until it is brought back online. Once the broker is available again, data is copied to its under-replicated replicas until all affected partitions are in sync with the partition leader.

Perform a rolling upgrade

A rolling upgrade involves putting a broker into maintenance mode, upgrading the broker, taking the broker out of maintenance mode, and then repeating the process on the next broker in the cluster. Placing brokers into maintenance mode ensures a smooth upgrade of your cluster while reducing the risk of interruption or degradation in service.

When a broker is placed into maintenance mode, it reassigns its partition leadership to other brokers for all topics that have a replication factor greater than one. Reassigning partition leadership involves draining leadership from the broker and transferring that leadership to another broker.

  1. Check for topics that have a replication factor greater than one.

    If you have topics with replication.factor=1, and if you have sufficient disk space, Redpanda Data recommends temporarily increasing the replication factor. This can help limit outages for these topics during the rolling upgrade. Do this before the upgrade to make sure there’s time for the data to replicate to other brokers. For more information, see Change topic replication factor.

  2. Deploy an upgraded StatefulSet with your desired Redpanda version.

  3. Upgrade and restart the brokers separately, one after the other.

Redpanda Data does not recommend using the kubectl rollout restart command to perform rolling upgrades. Although the chart’s preStop lifecycle hook puts the broker into maintenance mode before a Pod is deleted, the terminationGracePeriod may not be long enough to allow maintenance mode to finish. If maintenance mode does not finish before the Pod is deleted, you may lose data. After the terminationGracePeriod, the container is forcefully stopped using a SIGKILL command.

If you want to use kubectl rollout restart, it can be a challenge to determine the necessary value for the terminationGracePeriod. In common cases, 30 seconds should be sufficient. For large clusters, 90 seconds should be sufficient. You can test different values in a development environment. To configure the terminationGracePeriod, use the statefulset.terminationGracePeriodSeconds setting.

Deploy an upgraded StatefulSet

To deploy an upgraded StatefulSet, you need to delete the existing StatefulSet, then upgrade the Redpanda Helm chart deployment with your desired Redpanda version.

  1. Delete the existing StatefulSet, but leave the Pods running:

    kubectl delete statefulset redpanda --cascade=orphan --namespace <namespace>
  2. Upgrade the Redpanda version by overriding the image.tag setting. Replace <new-version> with a valid version tag.

    helm upgrade --install redpanda redpanda/redpanda \
      --namespace redpanda \
      --create-namespace \
      --set image.tag=<new-version> --set statefulset.updateStrategy.type=OnDelete

    Make sure to include all your configuration overrides in the helm upgrade command. Otherwise, the upgrade may fail. For example, if you already enabled SASL, include the same SASL overrides.

    Do not use the --reuse-values flag, otherwise Helm won’t include any new values from the upgraded chart.

The statefulset.updateStrategy.type=OnDelete setting stops the StatefulSet from upgrading all the Pods automatically. Changing the upgradeStrategy to OnDelete allows you to keep the existing Pods running and upgrade each broker separately. For more details, see the Kubernetes documentation.

To use the Redpanda version in the latest version of the Redpanda Helm chart, set image.tag to "" (empty string).

Upgrade and restart the brokers

To upgrade the Redpanda brokers, you must do the following to each broker, one at a time:

  1. Place the broker into maintenance mode.

  2. Wait for maintenance mode to finish.

  3. Delete the Pod that the broker was running in.

Before placing a broker into maintenance mode, you may want to temporarily disable or ignore alerts related to under-replicated partitions. When a broker is taken offline during a restart, replicas can become under-replicated.
  1. Check that all brokers are healthy:

    • TLS Enabled

    • TLS Disabled

    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk cluster health \
        -X admin.tls.enabled=true \
        -X admin.tls.ca=<path-to-admin-api-ca-certificate> \
        -X admin.hosts=<broker-url>:<admin-api-port>
    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk cluster health \
        -X admin.hosts=<broker-url>:<admin-api-port>
    Example output:
    CLUSTER HEALTH OVERVIEW
    =======================
    Healthy:                     true (1)
    Controller ID:               0
    All nodes:                   [0 1 2] (2)
    Nodes down:                  [] (3)
    Leaderless partitions:       [] (3)
    Under-replicated partitions: [] (3)
    1 The cluster is either healthy (true) or unhealthy (false).
    2 The node IDs of all brokers in the cluster.
    3 If the cluster is unhealthy, these fields will contain data.
  2. Find the Pod that is running the broker with the ID that you want to upgrade:

    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk cluster info \
        -X brokers=<broker-url>:<kafka-api-port> \
        --tls-enabled \
        --tls-truststore <path-to-kafka-api-ca-certificate>
    Example output:
    BROKERS
    =======
    ID    HOST                                         PORT
    0     redpanda-0.redpanda.test.svc.cluster.local.  9093
    1*    redpanda-1.redpanda.test.svc.cluster.local.  9093
    2     redpanda-2.redpanda.test.svc.cluster.local.  9093

    Here, redpanda-0 is running a broker with the ID 0. In this example, the ordinal of the StatefulSet replica (0 in redpanda-0) is the same as the broker’s ID. However, this is not always the case.

  3. Select a broker that has not been upgraded yet and place it into maintenance mode.

    In this example, the command is executed on a Pod called redpanda-0.

    You can execute the command on any Pod. It doesn’t have to be the one with the ID that you want to place into maintenance mode.
    • TLS Enabled

    • TLS Disabled

    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk cluster maintenance enable 0 --wait \
        -X admin.tls.enabled=true \
        -X admin.tls.ca=<path-to-admin-api-ca-certificate> \
        -X admin.hosts=<broker-url>:<admin-api-port>
    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk cluster maintenance enable 0 --wait \
        -X admin.hosts=<broker-url>:<admin-api-port> \

    The --wait flag ensures that the cluster is healthy before putting the broker into maintenance mode.

    The draining process won’t start until the cluster is healthy. The amount of time it takes to drain a broker and reassign partition leadership depends on the number of partitions and how healthy the cluster is. For healthy clusters, draining leadership should take less than a minute. If the cluster is unhealthy, such as when a follower is not in sync with the leader, then draining the broker can take even longer.

    Example output:

    NODE-ID  DRAINING  FINISHED  ERRORS  PARTITIONS  ELIGIBLE  TRANSFERRING  FAILED
    0        true      true      false   1           0         1             0
    ...
  4. Wait until the cluster is healthy before continuing:

    • TLS Enabled

    • TLS Disabled

    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk cluster health \
        -X admin.tls.enabled=true \
        -X admin.tls.ca=<path-to-admin-api-ca-certificate> \
        -X admin.hosts=<broker-url>:<admin-api-port> \
        --watch --exit-when-healthy

    + The combination of the --watch and --exit-when-healthy flags tell rpk to monitor the cluster health and exit only when the cluster is back in a healthy state.

    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk cluster health \
        -X admin.hosts=<broker-url>:<admin-api-port> \
        --watch --exit-when-healthy

    + The combination of the --watch and --exit-when-healthy flags tell rpk to monitor the cluster health and exit only when the cluster is back in a healthy state.

  5. Check the following metrics:

    Metric Name Description Recommendations

    redpanda_kafka_under_replicated_replicas

    Measures the number of under-replicated Kafka replicas. Non-zero: Replication lagging. Zero: All replicas replicated.

    Pause upgrades if non-zero.

    redpanda_cluster_unavailable_partitions

    Represents the number of partitions that are currently unavailable. Value of zero indicates all partitions are available. Non-zero indicates the respective count of unavailable partitions.

    Ensure metric shows zero unavailable partitions before restart.

    redpanda_kafka_request_bytes_total

    Total bytes processed for Kafka requests.

    Ensure produce and consume rate for each broker recovers to its pre-upgrade value before restart.

    redpanda_kafka_request_latency_seconds

    Latency for processing Kafka requests. Indicates the delay between a Kafka request being initiated and completed.

    Ensure the p99 histogram value recovers to its pre-upgrade level before restart.

    redpanda_rpc_request_latency_seconds

    Latency for processing RPC requests. Shows the delay between an RPC request initiation and completion.

    Ensure the p99 histogram value returns to its pre-upgrade level before restart.

    redpanda_cpu_busy_seconds_total

    CPU utilization for a given second. The value is a decimal between 0.0 and 1.0. A value of 1.0 means that the CPU was busy for the entire second, operating at 100% capacity. A value of 0.5 implies the CPU was busy for half the time (or 500 milliseconds) in the given second. A value of 0.0 indicates that the CPU was idle and not busy during the entire second.

    If you’re seeing high values consistently, investigate the reasons. It could be due to high traffic or other system bottlenecks.

    If the cluster has any issues, take the broker out of maintenance mode by running the following command before proceeding with other operations, such as decommissioning or retrying the rolling upgrade:

    rpk cluster maintenance disable <node-id>
  6. Delete the Pod in which the broker in maintenance mode was running:

    kubectl delete pod redpanda-0 --namespace <namespace>
  7. When the Pod restarts, make sure that it’s now running the upgraded version of Redpanda:

    • TLS Enabled

    • TLS Disabled

    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk redpanda admin brokers list \
        -X admin.tls.enabled=true \
        -X admin.tls.ca=<path-to-admin-api-ca-certificate> \
        -X admin.hosts=<broker-url>:<admin-api-port>
    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk redpanda admin brokers list \
        -X admin.hosts=<broker-url>:<admin-api-port> \
  8. Repeat this process for all the other brokers in the cluster.

Verify that the upgrade was successful

When you’ve upgraded all brokers, verify that the cluster is healthy. If the cluster is unhealthy, the upgrade may still be in progress. Try waiting a few moments, then run the command again.

  • TLS Enabled

  • TLS Disabled

kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
  rpk cluster health \
    -X admin.tls.enabled=true \
    -X admin.tls.ca=<path-to-admin-api-ca-certificate> \
    -X admin.hosts=<broker-url>:<admin-api-port>
kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
  rpk cluster health \
    -X admin.hosts=<broker-url>:<admin-api-port> \
Expected output:
CLUSTER HEALTH OVERVIEW
=======================
Healthy:               true
Controller ID:         1
All nodes:             [2,1,0]
Nodes down:            []
Leaderless partitions: []

Rollbacks

If something does not go as planned during a rolling upgrade, you can roll back to the original version as long as you have not upgraded every broker. The StatefulSet uses the RollingUpdate strategy by default in statefulset.updateStrategy.type, which means all Pods in the StatefulSet are restarted in reverse-ordinal order. For details, see the Kubernetes documentation.

  1. Find the previous revision:

    helm history redpanda --namespace <namespace>

    Example output

    REVISION	UPDATED                 	STATUS    	CHART          	APP VERSION	DESCRIPTION
    1       	Fri Mar  3 15:16:24 year	superseded	redpanda-2.12.2	v22.3.13   	Install complete
    2       	Fri Mar  3 15:19:41 year	deployed	  redpanda-2.12.2	v22.3.13   	Upgrade complete
  2. Roll back to the previous revision:

    helm rollback redpanda <previous-revision> --namespace <namespace>
  3. Verify that the cluster is healthy. If the cluster is unhealthy, the upgrade may still be in progress. The command exits when the cluster is healthy.

    • TLS Enabled

    • TLS Disabled

    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk cluster health \
        -X admin.tls.enabled=true \
        -X admin.tls.ca=<path-to-admin-api-ca-certificate> \
        -X admin.hosts=<broker-url>:<admin-api-port> \
        --watch --exit-when-healthy
    kubectl exec redpanda-0 --namespace <namespace> -c redpanda -- \
      rpk cluster health \
        -X admin.hosts=<broker-url>:<admin-api-port> \
        --watch --exit-when-healthy
    Expected output:
    CLUSTER HEALTH OVERVIEW
    =======================
    Healthy:               true
    Controller ID:         1
    All nodes:             [2,1,0]
    Nodes down:            []
    Leaderless partitions: []

Suggested reading

Set up a real-time dashboard to monitor your cluster health, see Monitor Redpanda.