Upgrade Redpanda in Kubernetes

To benefit from Redpanda’s new features and enhancements, upgrade to the latest version. New features are available after all brokers in the cluster are upgraded and restarted.

Redpanda platform version numbers follow the convention AB.C.D, where AB is the two digit year, C is the feature release, and D is the patch release. For example, version 22.3.1 indicates the first patch release on the third feature release of the year 2022. Patch releases include bug fixes and minor improvements, with no change to user-facing behavior. New and enhanced features are documented with each feature release. You can find a list of all releases on GitHub.

Limitations

The following limitations ensure a smooth transition between versions and help to maintain the stability of your cluster.

  • Broker upgrades:

    • New features are enabled only after upgrading all brokers in the cluster.

    • You can upgrade only one feature release at a time, for example from 22.2 to 22.3. Skipping feature releases is not supported.

  • Rollbacks: You can roll back to the original version only if at least one broker is still running the original version (not yet upgraded) and the cluster hasn’t yet restarted.

  • Downgrades: Downgrades are possible only between patch releases of the same feature release. For example, you can downgrade from 22.2.2 to 22.2.1. Downgrading to previous feature releases, such as 22.1.x, is not supported.

  • Tiered Storage: If you have Tiered Storage enabled and you’re upgrading to 23.2, object storage uploads are paused until all brokers are upgraded. If the cluster cannot upgrade, roll it back to the original version.

    In a mixed-version state, the cluster could run out of disk space. If you need to force a mixed-version cluster to upload, transfer partition leadership to brokers running the original version.
  • Remote Read Replicas: Upgrade the Remote Read Replica cluster first, ensuring it’s on the same version as the origin cluster or one feature release ahead of the origin cluster. When upgrading to Redpanda 23.2, metadata from object storage is not synchronized until all brokers in the cluster are upgraded. If you need to force a mixed-version cluster to sync read replicas, transfer partition leadership to brokers running the original version.

  • Controller snapshots: Controller snapshots are disabled in upgraded clusters. To re-enable them, contact Redpanda Support.

Prerequisites

Impact of broker restarts

When brokers restart, clients may experience higher latency, nodes may experience CPU spikes when the broker becomes available again, and you may receive alerts about under-replicated partitions. Topics that weren’t using replication (that is, topics that had replication.factor=1) will be unavailable.

Temporary increase in latency on clients (producers and consumers)

When you restart one or more brokers in a cluster, clients (consumers and producers) may experience higher latency due to partition leadership reassignment. Because clients must communicate with the leader of a partition, they may send a request to a broker whose leadership has been transferred, and receive NOT_LEADER_FOR_PARTITION. In this case, clients must request metadata from the cluster to find out the address of the new leader. Clients refresh their metadata periodically, or when the client receives some retryable errors that indicate that the metadata may be stale. For example:

  1. Broker A shuts down.

  2. Client sends a request to broker A, and receives NOT_LEADER_FOR_PARTITION.

  3. Client requests metadata, and learns that the new leader is broker B.

  4. Client sends the request to broker B.

CPU spikes upon broker restart

When a restarted broker becomes available again, you may see your nodes' CPU usage increase temporarily. This temporary increase in CPU usage is due to the cluster rebalancing the partition replicas.

Under-replicated partitions

When a broker is in maintenance mode, Redpanda continues to replicate updates to that broker. When a broker is taken offline during a restart, partitions with replicas on the broker could become out of sync until it is brought back online. Once the broker is available again, data is copied to its under-replicated replicas until all affected partitions are in sync with the partition leader.

Incompatible changes

Patch releases in 22.3.14 and 23.1.2 changed the behavior when remote read is disabled and the requested Raft term falls below the local log’s beginning. In earlier versions, Redpanda returned an offset -1. With the patch, when you request a value older than the lowest offset, Redpanda returns the lowest offset, not -1.

Find a new version

Before you perform a rolling upgrade, you must find out:

  • Which Redpanda version you are currently running.

  • Whether you can upgrade directly to the new version.

  • What’s changed since your original version.

  1. Find your current version of Redpanda:

    kubectl exec <pod-name> --namespace <namespace> -c redpanda -- \
      rpk redpanda admin brokers list

    For all available flags, see the rpk redpanda admin brokers list command reference.

    Expected output:

    The Redpanda version for each broker is listed under BROKER-VERSION.

    NODE-ID  BROKER-VERSION
    0        v22.2.10
    1        v22.2.10
    2        v22.2.10
  2. Find the Redpanda version that’s used in the latest version of the Redpanda Helm chart:

    helm repo update && \
    helm show chart redpanda/redpanda | grep appVersion

    Example output:

    appVersion:	v22.2.10

    If your current version is more than one feature release behind the version in the latest Redpanda Helm chart, you must first upgrade to an intermediate version. To list all available versions:

    curl -s 's://hub.docker.com/v2/repositories/redpandadata/redpanda/tags/?ordering=last_updated&page=1&page_size=50' | jq -r '.results[].name'
  3. Check the release notes to find information about what has changed between Redpanda versions.

Upgrade the Redpanda Operator

This section provides guidance on upgrading the Redpanda Operator. If you do not use the Redpanda Operator to manage your clusters, you can skip this section.

When a new version of the Redpanda Helm chart is released, you must assess its compatibility with the current Redpanda Operator. Changes in the Redpanda Helm chart, especially modifications to the values file, may introduce new features or configuration options that the Redpanda Operator should be aware of.

  1. Review the release notes of the Redpanda Helm chart to understand the nature of the changes and their potential impact on your Redpanda clusters.

  2. Determine if the current version of the Redpanda Operator is compatible with the new Redpanda Helm chart version. Not all changes in the Redpanda Helm chart require an upgrade of the Redpanda Operator. See Kubernetes Compatibility.

If the new Redpanda Helm chart version introduces significant changes that affect how the Redpanda Operator manages the Redpanda clusters:

  • Upgrade the CRDs to match the latest version of the Redpanda Operator.

    kubectl kustomize "https://github.com/redpanda-data/redpanda-operator//src/go/k8s/config/crd?ref=v2.1.14-23.3.4" \
        | kubectl apply -f -

    This step ensures that the Redpanda Operator can correctly interpret and manage the configurations defined in the Redpanda resource.

  • Ensure you are using the latest version of the Redpanda Operator that supports the new Redpanda Helm chart version.

    helm repo add redpanda https://charts.redpanda.com
    helm upgrade --install redpanda-controller redpanda/operator \
      --namespace <namespace> \
      --set image.tag=v2.1.14-23.3.4
    Make sure to include all existing overrides, otherwise the upgrade may fail.

Perform a rolling upgrade

A rolling upgrade involves updating the version of Redpanda in the Redpanda Helm chart, which triggers a rolling restart.

The Helm chart’s preStop lifecycle hook puts each broker into maintenance mode before the Pod is terminated. If maintenance mode does not complete before the terminationGracePeriod the container is forcefully terminated using a SIGKILL command.

The default terminationGracePeriod is 90 seconds, which should be long enough for large clusters. You can test different values in a development environment. To configure the terminationGracePeriod, use the statefulset.terminationGracePeriodSeconds setting.

To upgrade:

  1. Check for topics that have a replication factor greater than one.

    If you have topics with a replication factor of 1, and if you have sufficient disk space, temporarily increase the replication factor to limit outages for these topics during the rolling upgrade.

    Increase the replication factor before you upgrade to ensure that Redpanda has time to replicate data to other brokers.

  2. Ensure that the cluster is healthy:

    kubectl exec <pod-name> --namespace <namespace> -c redpanda -- \
      rpk cluster health

    The draining process won’t start until the cluster is healthy.

    Example output:
    CLUSTER HEALTH OVERVIEW
    =======================
    Healthy:                     true (1)
    Controller ID:               0
    All nodes:                   [0 1 2] (2)
    Nodes down:                  [] (3)
    Leaderless partitions:       [] (3)
    Under-replicated partitions: [] (3)
    1 The cluster is either healthy (true) or unhealthy (false).
    2 The node IDs of all brokers in the cluster.
    3 These fields contain data only when the cluster is unhealthy.
  3. If you’re using the Redpanda Helm chart without the Redpanda Operator, list all your existing overrides of the Helm values:

    helm get values redpanda --namespace <namespace>

    You’ll need to apply these overrides in the next step.

  4. Upgrade the Redpanda version by overriding the image.tag setting. Replace <new-version> with a valid version tag.

    • Helm + Operator

    • Helm

    redpanda-cluster.yaml
    apiVersion: cluster.redpanda.com/v1alpha1
    kind: Redpanda
    metadata:
      name: redpanda
    spec:
      chartRef: {}
      clusterSpec:
        image:
          tag: <new-version>
    kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
    • --values

    • --set

    redpanda-version.yaml
    image:
      tag: <new-version>
    helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
    --values redpanda-version.yaml --reuse-values
    helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
      --set image.tag=<new-version>
    Make sure to include all existing overrides, otherwise the upgrade may fail. For example, if you already enabled SASL, include the same SASL overrides. Do not use the --reuse-values flag, otherwise Helm won’t include any new values from the upgraded chart.
  5. Wait for the Pods to be terminated and recreated with the new version of Redpanda.

    kubectl get pod --namespace <namespace> --watch

    Each Pod in the StatefulSet is terminated one at a time, starting from the one with the highest ordinal.

    Example output
    NAME                                    READY   STATUS
    redpanda-controller-operator            2/2     Running
    redpanda-0                              2/2     Running
    redpanda-1                              2/2     Running
    redpanda-2                              0/2     Init:0/3
    redpanda-configuration-88npt            0/1     Completed
    redpanda-console-7cf85cf87f-rmtnj       1/1     Running
    redpanda-post-upgrade-ljqpr             0/1     Completed
  6. When all of the Pods are ready and have a Running status, verify that the brokers are now running the upgraded version of Redpanda:

    kubectl exec <pod-name> --namespace <namespace> -c redpanda -- \
      rpk redpanda admin brokers list

Rollbacks

If something does not go as planned during a rolling upgrade, you can roll back to the original version as long as you have not upgraded every broker.

The Redpanda Operator rolls back automatically after three failed attempts to upgrade the cluster.

If you are using the Redpanda Helm chart without the Redpanda Operator, you can roll back manually.

The StatefulSet uses the RollingUpdate strategy by default in statefulset.updateStrategy.type, which means all Pods in the StatefulSet are restarted in reverse-ordinal order. For details, see the Kubernetes documentation.

  1. Find the previous revision:

    helm history redpanda --namespace <namespace>

    Example output

    REVISION	UPDATED                 	STATUS    	CHART          	APP VERSION	DESCRIPTION
    1       	Fri Mar  3 15:16:24 year	superseded	redpanda-2.12.2	v22.3.13   	Install complete
    2       	Fri Mar  3 15:19:41 year	deployed	  redpanda-2.12.2	v22.3.13   	Upgrade complete
  2. Roll back to the previous revision:

    helm rollback redpanda <previous-revision> --namespace <namespace>
  3. Verify that the cluster is healthy. If the cluster is unhealthy, the upgrade may still be in progress. The command exits when the cluster is healthy.

    kubectl exec <pod-name> --namespace <namespace> -c redpanda -- \
      rpk cluster health \
      --watch --exit-when-healthy
    Example output:
    CLUSTER HEALTH OVERVIEW
    =======================
    Healthy:               true
    Controller ID:         1
    All nodes:             [2,1,0]
    Nodes down:            []
    Leaderless partitions: []

Suggested reading

Set up a real-time dashboard to monitor your cluster health, see Monitor Redpanda.