Upgrade Redpanda

To benefit from Redpanda’s new features and enhancements, upgrade to the latest version. Redpanda Data recommends that you perform a rolling upgrade on production clusters, which requires all brokers to be placed into maintenance mode and restarted separately, one after the other.

Redpanda Self-Managed version numbers follow the convention AB.C.D, where AB is the two digit year, C is the feature release, and D is the patch release. For example, version 22.3.1 indicates the first patch release on the third feature release of the year 2022. Patch releases include bug fixes and minor improvements, with no change to user-facing behavior. New and enhanced features are documented with each feature release. You can find a list of all releases on GitHub.

Limitations

The following limitations ensure a smooth transition between versions and help to maintain the stability of your cluster.

  • Broker upgrades:

    • New features are enabled only after upgrading all brokers in the cluster.

    • You can upgrade only one feature release at a time, for example from 22.2 to 22.3. Skipping feature releases is not supported.

  • Rollbacks: You can roll back to the original version only if at least one broker is still running the original version (not yet upgraded) and the cluster hasn’t yet restarted.

  • Downgrades: Downgrades are possible only between patch releases of the same feature release. For example, you can downgrade from 22.2.2 to 22.2.1. Downgrading to previous feature releases, such as 22.1.x, is not supported.

  • Tiered Storage: If you have Tiered Storage enabled and you’re upgrading to 23.2, object storage uploads are paused until all brokers are upgraded. If the cluster cannot upgrade, roll it back to the original version.

    In a mixed-version state, the cluster could run out of disk space. If you need to force a mixed-version cluster to upload, transfer partition leadership to brokers running the original version.
  • Remote Read Replicas: Upgrade the Remote Read Replica cluster first, ensuring it’s on the same version as the origin cluster or one feature release ahead of the origin cluster. When upgrading to Redpanda 23.2, metadata from object storage is not synchronized until all brokers in the cluster are upgraded. If you need to force a mixed-version cluster to sync read replicas, transfer partition leadership to brokers running the original version.

Prerequisites

  • A running Redpanda cluster.

  • jq for listing available versions.

  • An understanding of the impact of broker restarts on clients, node CPU, and any alerting systems you use.

  • Review incompatible changes in new versions.

Review incompatible changes

  • Starting in version 24.2, when managing configuration properties using the AlterConfigs API directly, Redpanda resets all unspecified values to the default values. This aligns more closely with the behavior in Apache Kafka. There is no change if you’re managing your configuration with tools like rpk, Redpanda Console, Kubernetes, Helm, or Terraform.

    This does not pertain to the redpanda.remote. topic properties, such as redpanda.remote.delete. The remote properties are not reset to their defaults by the AlterConfigs API to avoid the possibility of unintentionally disabling Tiered Storage for a topic, which could cause significant operational issues for clusters designed to use Tiered Storage exclusively or with object storage as the only durable storage tier.
  • Starting in version 24.2, client throughput limits are compatible with the AlterClientQuotas and DescribeClientQuotas Kafka APIs. Redpanda determines client throughput limits on a per-broker basis. In earlier versions, client throughput quotas were applied from cluster configurations on a per-shard basis.

  • Starting in version 24.2, transaction_max_timeout_ms defaults to 15 minutes, which limits the timeout a transactional producer can set during initialization. When performing upgrades to version 24.2, if you have clients with timeouts larger than transaction_max_timeout_ms, then new producer initialization requests will fail due to this limit enforcement. Either reduce the client set transaction timeout to less than 15 minutes or increase the default transaction_max_timeout_ms on the service side.

  • Patch releases in 22.3.14 and 23.1.2 changed the behavior when remote read is disabled and the requested Raft term falls below the local log’s beginning. In earlier versions, Redpanda returned an offset -1. With the patch, when you request a value older than the lowest offset, Redpanda returns the lowest offset, not -1.

Find a new version

Before you upgrade, find out which Redpanda version you are currently running, whether you can upgrade straight to the new version, and what’s changed since your original version. To find your current version, run:

  • Linux

  • Docker

  • macOS

rpk redpanda admin brokers list

For all available flags, see the rpk redpanda admin brokers list command reference.

Running Redpanda directly on Docker is not supported for production usage. This platform should only be used for testing.
docker exec -it <container_name><container_tag> rpk version

Remember to replace the variables <container_name> and <container_tag>. The container tag determines which version of rpk to use. The release process bundles rpk and Redpanda into the same container tag with the same version.

brew list --versions | grep redpanda

Example output:

v24.2.2 (rev 72ba3d3)

If your current version is more than one feature release behind the latest Redpanda version, you must first upgrade to an intermediate version. To list all available versions:

curl -s 'https://hub.docker.com/v2/repositories/redpandadata/redpanda/tags/?ordering=last_updated&page=1&page_size=50' | jq -r '.results[].name'

Check the release notes to find information about what has changed between Redpanda versions.

Impact of broker restarts

When brokers restart, clients may experience higher latency, nodes may experience CPU spikes when the broker becomes available again, and you may receive alerts about under-replicated partitions. Topics that weren’t using replication (that is, topics that had replication.factor=1) will be unavailable.

Temporary increase in latency on clients (producers and consumers)

When you restart one or more brokers in a cluster, clients (consumers and producers) may experience higher latency due to partition leadership reassignment. Because clients must communicate with the leader of a partition, they may send a request to a broker whose leadership has been transferred, and receive NOT_LEADER_FOR_PARTITION. In this case, clients must request metadata from the cluster to find out the address of the new leader. Clients refresh their metadata periodically, or when the client receives some retryable errors that indicate that the metadata may be stale. For example:

  1. Broker A shuts down.

  2. Client sends a request to broker A, and receives NOT_LEADER_FOR_PARTITION.

  3. Client requests metadata, and learns that the new leader is broker B.

  4. Client sends the request to broker B.

CPU spikes upon broker restart

When a restarted broker becomes available again, you may see your nodes' CPU usage increase temporarily. This temporary increase in CPU usage is due to the cluster rebalancing the partition replicas.

Under-replicated partitions

When a broker is in maintenance mode, Redpanda continues to replicate updates to that broker. When a broker is taken offline during a restart, partitions with replicas on the broker could become out of sync until it is brought back online. Once the broker is available again, data is copied to its under-replicated replicas until all affected partitions are in sync with the partition leader.

Perform a rolling upgrade

A rolling upgrade involves putting a broker into maintenance mode, upgrading the broker, taking the broker out of maintenance mode, and then repeating the process on the next broker in the cluster. Placing brokers into maintenance mode ensures a smooth upgrade of your cluster while reducing the risk of interruption or degradation in service.

When a broker is placed into maintenance mode, it reassigns its partition leadership to other brokers for all topics that have a replication factor greater than one. Reassigning partition leadership involves draining leadership from the broker and transferring that leadership to another broker.

  1. Check for topics that have a replication factor greater than one.

    If you have topics with replication.factor=1, and if you have sufficient disk space, Redpanda Data recommends temporarily increasing the replication factor. This can help limit outages for these topics during the rolling upgrade. Do this before the upgrade to make sure there’s time for the data to replicate to other brokers. For more information, see Change topic replication factor.

  2. Ensure that all brokers are active before upgrading:

    rpk redpanda admin brokers list

    All brokers should show active for MEMBERSHIP-STATUS and true for IS-ALIVE:

    Example output
    NODE-ID  NUM-CORES  MEMBERSHIP-STATUS  IS-ALIVE  BROKER-VERSION
    0        1          active             true      v22.3.11
    1        1          active             true      v22.3.11
    2        1          active             true      v22.3.11

New features in a version are enabled after all brokers in the cluster are upgraded. If problems occur, the upgrade is not committed.

Enable maintenance mode

  1. Check that all brokers are healthy:

    rpk cluster health
    Example output:
    CLUSTER HEALTH OVERVIEW
    =======================
    Healthy:                     true (1)
    Controller ID:               0
    All nodes:                   [0 1 2] (2)
    Nodes down:                  [] (3)
    Leaderless partitions:       [] (3)
    Under-replicated partitions: [] (3)
    1 The cluster is either healthy (true) or unhealthy (false).
    2 The node IDs of all brokers in the cluster.
    3 If the cluster is unhealthy, these fields will contain data.
  2. Select a broker that has not been upgraded yet and place it into maintenance mode:

    rpk cluster maintenance enable <node-id> --wait

    The --wait option tells the command to wait until a given broker, 0 in this example, finishes draining all partitions it originally served. After the partition draining completes, the command completes.

    Expected output:
    Successfully enabled maintenance mode for node 0
    Waiting for node to drain...
  3. Verify that the broker is in maintenance mode:

    rpk cluster maintenance status
    Expected output:
    NODE-ID  DRAINING  FINISHED  ERRORS  PARTITIONS  ELIGIBLE  TRANSFERRING  FAILED
    0        true      true      false   3           0         2             0
    1        false     false     false   0           0         0             0
    2        false     false     false   0           0         0             0

    The Finished column should read true for the broker that you put into maintenance mode.

  4. Validate the health of the cluster again:

    rpk cluster health --watch --exit-when-healthy

    The combination of the --watch and --exit-when-healthy flags tell rpk to monitor the cluster health and exit only when the cluster is back in a healthy state.

    You can also evaluate metrics to determine cluster health. If the cluster has any issues, take the broker out of maintenance mode by running the following command before proceeding with other operations, such as decommissioning or retrying the rolling upgrade:

    rpk cluster maintenance disable <node-id>

Upgrade your version

  • Linux

  • Docker

  • macOS

For Linux distributions, the process changes according to the distribution:

  • Fedora/RedHat

  • Debian/Ubuntu

In the terminal, run:

sudo yum update redpanda

In the terminal, run:

sudo apt update
sudo apt install redpanda
Running Redpanda directly on Docker is not supported for production usage. This platform should only be used for testing.

To perform an upgrade you must replace the current image with a new one.

First, check which image is currently running in Docker:

docker ps

Stop and remove the containers:

docker stop <container_id>
docker rm <container_id>

Remove current images:

docker rmi <image_id>

Pull the desired Redpanda version, or adjust the setting to latest in the version tag:

docker pull docker.redpanda.com/redpandadata/redpanda:<version>

After it completes, restart the cluster:

docker restart <container_name>

For more information, see the Redpanda Quickstart.

If you previously installed Redpanda with brew, run:

brew upgrade redpanda-data/tap/redpanda

For installations from binary files, download the preferred version from the release list and then overwrite the current rpk file in the installed location.

Check metrics

Before continuing with the upgrade, check these important metrics to make sure the cluster is healthy and working as expected.

Metric Name Description Recommendations

redpanda_kafka_under_replicated_replicas

Measures the number of under-replicated Kafka replicas. Non-zero: Replication lagging. Zero: All replicas replicated.

Pause upgrades if non-zero.

redpanda_cluster_unavailable_partitions

Represents the number of partitions that are currently unavailable. Value of zero indicates all partitions are available. Non-zero indicates the respective count of unavailable partitions.

Ensure metric shows zero unavailable partitions before restart.

redpanda_kafka_request_bytes_total

Total bytes processed for Kafka requests.

Ensure produce and consume rate for each broker recovers to its pre-upgrade value before restart.

redpanda_kafka_request_latency_seconds

Latency for processing Kafka requests. Indicates the delay between a Kafka request being initiated and completed.

Ensure the p99 histogram value recovers to its pre-upgrade level before restart.

redpanda_rpc_request_latency_seconds

Latency for processing RPC requests. Shows the delay between an RPC request initiation and completion.

Ensure the p99 histogram value recovers to its pre-upgrade level before restart.

redpanda_cpu_busy_seconds_total

CPU utilization for a given second. The value is a decimal between 0.0 and 1.0. A value of 1.0 means that the CPU was busy for the entire second, operating at 100% capacity. A value of 0.5 implies the CPU was busy for half the time (or 500 milliseconds) in the given second. A value of 0.0 indicates that the CPU was idle and not busy during the entire second.

If you’re seeing high values consistently, investigate the reasons. It could be due to high traffic or other system bottlenecks.

Restart the broker

Restart the broker’s Redpanda service with rpk redpanda stop, then rpk redpanda start.

Disable maintenance mode

After you’ve successfully upgraded the broker:

  1. Take the broker out of maintenance mode:

    rpk cluster maintenance disable <node-id>

    Expected output:

    Successfully disabled maintenance mode for node 0
  2. Ensure that the broker is no longer in maintenance mode:

    rpk cluster maintenance status
    Expected output:
    NODE-ID  DRAINING  FINISHED  ERRORS  PARTITIONS  ELIGIBLE  TRANSFERRING  FAILED
    0        false     false     false   0           0         0             0
    1        false     false     false   0           0         0             0
    2        false     false     false   0           0         0             0

Post-upgrade tasks

To verify that the cluster is running properly, run:

rpk cluster health

To view additional information about your brokers, run:

rpk redpanda admin brokers list

Suggested reading