Upgrade Redpanda in Kubernetes
To benefit from Redpanda’s new features and enhancements, upgrade to the latest version. New features are available after all brokers in the cluster are upgraded and restarted.
Redpanda Self-Managed version numbers follow the convention AB.C.D, where AB is the two digit year, C is the feature release, and D is the patch release. For example, version 22.3.1 indicates the first patch release on the third feature release of the year 2022. Patch releases include bug fixes and minor improvements, with no change to user-facing behavior. New and enhanced features are documented with each feature release. You can find a list of all releases on GitHub.
Limitations
The following limitations ensure a smooth transition between versions and help to maintain the stability of your cluster.
-
Broker upgrades:
-
New features are enabled only after upgrading all brokers in the cluster.
-
You can upgrade only one feature release at a time, for example from 22.2 to 22.3. Skipping feature releases is not supported.
-
-
Rollbacks: You can roll back to the original version only if at least one broker is still running the original version (not yet upgraded) and the cluster hasn’t yet restarted.
-
Downgrades: Downgrades are possible only between patch releases of the same feature release. For example, you can downgrade from 22.2.2 to 22.2.1. Downgrading to previous feature releases, such as 22.1.x, is not supported.
-
Tiered Storage: If you have Tiered Storage enabled and you’re upgrading to 23.2, object storage uploads are paused until all brokers are upgraded. If the cluster cannot upgrade, roll it back to the original version.
In a mixed-version state, the cluster could run out of disk space. If you need to force a mixed-version cluster to upload, transfer partition leadership to brokers running the original version. -
Remote Read Replicas: Upgrade the Remote Read Replica cluster first, ensuring it’s on the same version as the origin cluster or one feature release ahead of the origin cluster. When upgrading to Redpanda 23.2, metadata from object storage is not synchronized until all brokers in the cluster are upgraded. If you need to force a mixed-version cluster to sync read replicas, transfer partition leadership to brokers running the original version.
Prerequisites
-
The default RollingUpdate strategy configured in the Helm values.
Impact of broker restarts
When brokers restart, clients may experience higher latency, nodes may experience CPU spikes when the broker becomes available again, and you may receive alerts about under-replicated partitions. Topics that weren’t using replication (that is, topics that had replication.factor=1
) will be unavailable.
Temporary increase in latency on clients (producers and consumers)
When you restart one or more brokers in a cluster, clients (consumers and producers) may experience higher latency due to partition leadership reassignment. Because clients must communicate with the leader of a partition, they may send a request to a broker whose leadership has been transferred, and receive NOT_LEADER_FOR_PARTITION
. In this case, clients must request metadata from the cluster to find out the address of the new leader. Clients refresh their metadata periodically, or when the client receives some retryable errors that indicate that the metadata may be stale. For example:
-
Broker A shuts down.
-
Client sends a request to broker A, and receives
NOT_LEADER_FOR_PARTITION
. -
Client requests metadata, and learns that the new leader is broker B.
-
Client sends the request to broker B.
CPU spikes upon broker restart
When a restarted broker becomes available again, you may see your nodes' CPU usage increase temporarily. This temporary increase in CPU usage is due to the cluster rebalancing the partition replicas.
Under-replicated partitions
When a broker is in maintenance mode, Redpanda continues to replicate updates to that broker. When a broker is taken offline during a restart, partitions with replicas on the broker could become out of sync until it is brought back online. Once the broker is available again, data is copied to its under-replicated replicas until all affected partitions are in sync with the partition leader.
Incompatible changes
Patch releases in 22.3.14 and 23.1.2 changed the behavior when remote read is disabled and the requested Raft term falls below the local log’s beginning. In earlier versions, Redpanda returned an offset -1. With the patch, when you request a value older than the lowest offset, Redpanda returns the lowest offset, not -1.
Check your current Redpanda version
Before you perform a rolling upgrade:
-
Find the Redpanda version that you are currently running.
To find your current version of Redpanda, use
rpk redpanda admin brokers list
:kubectl exec <pod-name> --namespace <namespace> -c redpanda -- \ rpk redpanda admin brokers list
bashExpected output:
The Redpanda version for each broker is listed under
BROKER-VERSION
.NODE-ID BROKER-VERSION 0 v22.2.10 1 v22.2.10 2 v22.2.10
-
Review the Kubernetes compatibility matrix to find out if you need to upgrade the Helm chart or the Redpanda Operator to use your chosen version of Redpanda.
If your current version of Redpanda is more than one feature release behind the one to which you want to upgrade, you must first upgrade to an intermediate version of Redpanda.
-
Learn what’s changed since your original version.
To find information about what has changed between Redpanda versions, check the release notes.
Prepare your cluster
Before you upgrade, you must make sure that your cluster is in a healthy state and that your topics are configured to limit outages during the upgrade process.
-
Check for topics that have a replication factor greater than one.
If you have topics with a replication factor of 1, and if you have sufficient disk space, temporarily increase the replication factor to limit outages for these topics during the rolling upgrade.
Increase the replication factor before you upgrade to ensure that Redpanda has time to replicate data to other brokers.
-
Ensure that the cluster is healthy:
kubectl exec <pod-name> --namespace <namespace> -c redpanda -- \ rpk cluster health
bashThe draining process won’t start until the cluster is healthy.
Example output:
CLUSTER HEALTH OVERVIEW ======================= Healthy: true (1) Controller ID: 0 All nodes: [0 1 2] (2) Nodes down: [] (3) Leaderless partitions: [] (3) Under-replicated partitions: [] (3)
1 The cluster is either healthy ( true
) or unhealthy (false
).2 The node IDs of all brokers in the cluster. 3 These fields contain data only when the cluster is unhealthy.
Perform a rolling upgrade
Performing a rolling upgrade allows you to update the version of Redpanda managed by the Redpanda Helm chart without downtime. This process ensures that each broker is sequentially updated and restarted, minimizing the impact on your environment.
You can use two methods to upgrade a Redpanda cluster in Kubernetes. The first method is to upgrade the Helm release to a newer version of the Redpanda Helm chart that uses the desired Redpanda version as a default. The second method is to update the existing Helm release to use a newer Redpanda image. The first method is preferred because upgrading the entire chart ensures that any new parameters required to configure the cluster are defined.
Upgrading a Redpanda cluster in Kubernetes triggers a sequential restart of the Pods managed by the StatefulSet. During each broker’s restart, the following steps occur:
-
The
preStop
lifecycle hook is executed to place the broker into maintenance mode. This step ensures that the broker stops accepting new connections and finishes processing its current tasks. -
Kubernetes then terminates the Pod. If the broker does not shut down within the allowed grace period (default 90 seconds), Kubernetes forcefully terminates it using a
SIGKILL
signal. -
After the Pod is terminated, the
postStart
lifecycle hook is executed to take the broker out of maintenance mode, allowing it to rejoin the cluster once restarted.
-
Helm + Operator
-
Helm
-
Review the Kubernetes compatibility matrix and determine the version of the Redpanda Operator that is compatible with the Helm chart version you plan to use. The Redpanda Operator must be able to understand and manage the Helm chart and the Redpanda version you are deploying. If you need to upgrade, see Upgrade the Redpanda Operator.
-
Check the default Redpanda version of a chart to make sure that it uses the version that you want to upgrade your cluster to.
helm show chart --version <chart-version> redpanda/redpanda | grep "appVersion"
bashReplace
<chart-version>
with the version number of a newer chart. -
Upgrade the Redpanda version by either updating the Helm chart version or the Redpanda image.
redpanda-cluster.yaml
apiVersion: cluster.redpanda.com/v1alpha1 kind: Redpanda metadata: name: redpanda spec: chartRef: chartVersion: <helm-chart-version> clusterSpec: image: # Optional tag: <new-version> statefulset: # Optional terminationGracePeriodSeconds: <grace-period>
yaml1 The version of the Redpanda Helm chart to deploy. 2 If you need to upgrade to an intermediate version of Redpanda, use this setting to specify the version of Redpanda to deploy. This version overrides the default one in the Helm chart. Replace <new-version>
with a valid version tag.3 The statefulset.terminationGracePeriodSeconds
setting defines how long Kubernetes will wait for the broker to shut down gracefully before forcefully terminating it. The default value is 90 seconds, which is enough for most clusters, but might require adjustment based on your workload. Modify this setting in your Helm values file if your Redpanda brokers have high loads or hold large amounts of data, as they might need more time to shut down gracefully. -
Apply the Redpanda resource to deploy the Redpanda cluster:
kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
bash
-
Review the Kubernetes compatibility matrix and verify which version of the Redpanda Helm chart supports the Redpanda version you plan to upgrade to. The Helm chart version can dictate which configurations and Kubernetes resources are available or required for that specific version of Redpanda.
-
Check the default Redpanda version of a chart to make sure that it uses the version that you want to upgrade your cluster to.
helm show chart --version <chart-version> redpanda/redpanda | grep "appVersion"
bashReplace
<chart-version>
with the version number of a newer chart. -
Back up your current Helm values for the Redpanda Helm chart:
helm get values redpanda --namespace <namespace> > redpanda-values-backup.yaml
bashYou’ll need to apply these overrides in the next step.
-
Optional: Update the following settings:
redpanda-version.yaml
image: tag: <new-version> statefulset: terminationGracePeriodSeconds: <grace-period>
yaml1 If you need to upgrade to an intermediate version of Redpanda, use this setting to specify the version of Redpanda to deploy. This version overrides the default one in the Helm chart. Replace <new-version>
with a valid version tag.2 The statefulset.terminationGracePeriodSeconds
setting defines how long Kubernetes will wait for the broker to shut down gracefully before forcefully terminating it. The default value is 90 seconds, which is enough for most clusters, but might require adjustment based on your workload. Modify this setting in your Helm values file if your Redpanda brokers have high loads or hold large amounts of data, as they might need more time to shut down gracefully. -
Deploy Redpanda with the new Helm chart version:
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> \ --create-namespace \ --version <helm-chart-version> \ --values redpanda-version.yaml
bashMake sure to include all existing overrides, otherwise the upgrade may fail. For example, if you already enabled SASL, include the same SASL overrides.
Do not use the --reuse-values
flag, otherwise Helm won’t include any new values from the upgraded chart.
Verify the upgrade
After upgrading, verify that your Redpanda cluster is functioning correctly:
-
Wait for the Pods to be terminated and recreated with the new version of Redpanda.
kubectl get pod --namespace <namespace> --watch
bashEach Pod in the StatefulSet is terminated one at a time, starting from the one with the highest ordinal.
Example output
NAME READY STATUS redpanda-controller-operator 2/2 Running redpanda-0 2/2 Running redpanda-1 2/2 Running redpanda-2 0/2 Init:0/3 redpanda-configuration-88npt 0/1 Completed redpanda-console-7cf85cf87f-rmtnj 1/1 Running redpanda-post-upgrade-ljqpr 0/1 Completed
-
When all of the Pods are ready and have a
Running
status, verify that the brokers are now running the upgraded version of Redpanda:kubectl exec <pod-name> --namespace <namespace> -c redpanda -- \ rpk redpanda admin brokers list
bash
Roll back
If something does not go as planned during a rolling upgrade, you can roll back to the original version as long as you have not upgraded all brokers.
The StatefulSet uses the RollingUpdate
strategy by default in statefulset.updateStrategy.type
,
which means all Pods in the StatefulSet are restarted in reverse-ordinal order. For details, see the Kubernetes documentation.
-
Helm + Operator
-
Helm
The Redpanda Operator rolls back automatically after three failed attempts to upgrade the cluster.
-
Find the previous revision:
helm history redpanda --namespace <namespace>
bashExample output
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION 1 Fri Mar 3 15:16:24 year superseded redpanda-2.12.2 v22.3.13 Install complete 2 Fri Mar 3 15:19:41 year deployed redpanda-2.12.2 v22.3.13 Upgrade complete
-
Roll back to the previous revision:
helm rollback redpanda <previous-revision> --namespace <namespace>
bash -
Verify that the cluster is healthy. If the cluster is unhealthy, the upgrade may still be in progress. The command exits when the cluster is healthy.
kubectl exec <pod-name> --namespace <namespace> -c redpanda -- \ rpk cluster health \ --watch --exit-when-healthy
bashExample output:
CLUSTER HEALTH OVERVIEW ======================= Healthy: true Controller ID: 1 All nodes: [2,1,0] Nodes down: [] Leaderless partitions: []
Troubleshooting
HelmRelease is not ready
If you are using the Redpanda Operator, you may see the following message while waiting for a Redpanda custom resource to be deployed:
NAME READY STATUS
redpanda False HelmRepository 'redpanda/redpanda-repository' is not ready
redpanda False HelmRelease 'redpanda/redpanda' is not ready
While the deployment process can sometimes take a few minutes, a prolonged 'not ready' status may indicate an issue. Follow the steps below to investigate:
-
Check the status of the HelmRelease:
kubectl describe helmrelease <redpanda-resource-name> --namespace <namespace>
bash -
Review the Redpanda Operator logs:
kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace>
bash
HelmRelease retries exhausted
The HelmRelease retries exhausted
error occurs when the Helm Controller has tried to reconcile the HelmRelease a number of times, but these attempts have failed consistently.
The Helm Controller watches for changes in HelmRelease objects. When changes are detected, it tries to reconcile the state defined in the HelmRelease with the state in the cluster. The process of reconciliation includes installation, upgrade, testing, rollback or uninstallation of Helm releases.
You may see this error due to:
-
Incorrect configuration in the HelmRelease.
-
Issues with the chart, such as a non-existent chart version or the chart repository not being accessible.
-
Missing dependencies or prerequisites required by the chart.
-
Issues with the underlying Kubernetes cluster, such as insufficient resources or connectivity issues.
To debug this error do the following:
-
Check the status of the HelmRelease:
kubectl describe helmrelease <cluster-name> --namespace <namespace>
bash -
Review the Redpanda Operator logs:
kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace>
bash
When you find and fix the error, you must use the Flux CLI, fluxctl
, to suspend and resume the reconciliation process:
-
Suspend the HelmRelease:
flux suspend helmrelease <cluster-name> --namespace <namespace>
bash -
Resume the HelmRelease:
flux resume helmrelease <cluster-name> --namespace <namespace>
bash
Crash loop backoffs
If a broker crashes after startup, or gets stuck in a crash loop, it could produce progressively more stored state that uses additional disk space and takes more time for each restart to process.
To prevent infinite crash loops, the Redpanda Helm chart sets the crash_loop_limit
node property to 5. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. After Redpanda reaches this limit, it will not start until its internal consecutive crash counter is reset to zero. In Kubernetes, the Pod running Redpanda remains in a CrashLoopBackoff
state until its internal consecutive crash counter is reset to zero.
To troubleshoot a crash loop backoff:
-
Check the Redpanda logs from the most recent crashes:
kubectl logs <pod-name> --namespace <namespace>
bashKubernetes retains logs only for the current and the previous instance of a container. This limitation makes it difficult to access logs from earlier crashes, which may contain vital clues about the root cause of the issue. Given these log retention limitations, setting up a centralized logging system is crucial. Systems such as Loki or Datadog can capture and store logs from all containers, ensuring you have access to historical data. -
Resolve the issue that led to the crash loop backoff.
-
Reset the crash counter to zero to allow Redpanda to restart. You can do any of the following to reset the counter:
-
Update the redpanda.yaml configuration file. You can make changes to any of the following sections in the Redpanda Helm chart to trigger an update:
-
config.cluster
-
config.node
-
config.tunable
-
-
Delete the
startup_log
file in the broker’s data directory.kubectl exec <pod-name> --namespace <namespace> -- rm /var/lib/redpanda/data/startup_log
bashIt might be challenging to execute this command within a Pod that is in a CrashLoopBackoff
state due to the limited time during which the Pod is available before it restarts. Wrapping the command in a loop might work. -
Wait one hour since the last crash. The crash counter resets after one hour.
-
To avoid future crash loop backoffs and manage the accumulation of small segments effectively:
-
Monitor the size and number of segments regularly.
-
Optimize your Redpanda configuration for segment management.
-
Consider implementing Tiered Storage to manage data more efficiently.
StatefulSet never rolls out
If the StatefulSet Pods remain in a pending state, they are waiting for resources to become available.
To identify the Pods that are pending, use the following command:
kubectl get pod --namespace <namespace>
The response includes a list of Pods in the StatefulSet and their status.
To view logs for a specific Pod, use the following command.
kubectl logs -f <pod-name> --namespace <namespace>
You can use the output to debug your deployment.
Didn’t match pod anti-affinity rules
If you see this error, your cluster does not have enough nodes to satisfy the anti-affinity rules:
Warning FailedScheduling 18m default-scheduler 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
The Helm chart configures default podAntiAffinity
rules to make sure that only one Pod running a Redpanda broker is scheduled on each worker node. To learn why, see Number of workers.
To resolve this issue, do one of the following:
-
Create additional worker nodes.
-
Modify the anti-affinity rules (for development purposes only).
If adding nodes is not an option, you can modify the
podAntiAffinity
rules in your StatefulSet to be less strict.-
Helm + Operator
-
Helm
redpanda-cluster.yaml
apiVersion: cluster.redpanda.com/v1alpha1 kind: Redpanda metadata: name: redpanda spec: chartRef: {} clusterSpec: statefulset: podAntiAffinity: type: soft
yamlkubectl apply -f redpanda-cluster.yaml --namespace <namespace>
bash-
--values
-
--set
docker-repo.yaml
statefulset: podAntiAffinity: type: soft
yamlhelm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --values docker-repo.yaml --reuse-values
bashhelm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --set statefulset.podAntiAffinity.type=soft
bash -
Unable to mount volume
If you see volume mounting errors in the Pod events or in the Redpanda logs, ensure that each of your Pods has a volume available in which to store data.
-
If you’re using StorageClasses with dynamic provisioners (default), ensure they exist:
kubectl get storageclass
bash -
If you’re using PersistentVolumes, ensure that you have one PersistentVolume available for each Redpanda broker, and that each one has the storage capacity that’s set in
storage.persistentVolume.size
:kubectl get persistentvolume --namespace <namespace>
bash
To learn how to configure different storage volumes, see Configure Storage.
Failed to pull image
When deploying the Redpanda Helm chart, you may encounter Docker rate limit issues because the default registry URL is not recognized as a Docker Hub URL. The domain docker.redpanda.com
is used for statistical purposes, such as tracking the number of downloads. It mirrors Docker Hub’s content while providing specific analytics for Redpanda.
Failed to pull image "docker.redpanda.com/redpandadata/redpanda:v<version>": rpc error: code = Unknown desc = failed to pull and unpack image "docker.redpanda.com/redpandadata/redpanda:v<version>": failed to copy: httpReadSeeker: failed open: unexpected status code 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
To fix this error, do one of the following:
-
Replace the
image.repository
value in the Helm chart withdocker.io/redpandadata/redpanda
. Switching to Docker Hub avoids the rate limit issues associated withdocker.redpanda.com
.-
Helm + Operator
-
Helm
redpanda-cluster.yaml
apiVersion: cluster.redpanda.com/v1alpha1 kind: Redpanda metadata: name: redpanda spec: chartRef: {} clusterSpec: image: repository: docker.io/redpandadata/redpanda
yamlkubectl apply -f redpanda-cluster.yaml --namespace <namespace>
bash-
--values
-
--set
docker-repo.yaml
image: repository: docker.io/redpandadata/redpanda
yamlhelm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --values docker-repo.yaml --reuse-values
bashhelm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --set image.repository=docker.io/redpandadata/redpanda
bash -
-
Authenticate to Docker Hub by logging in with your Docker Hub credentials. The
docker.redpanda.com
site acts as a reflector for Docker Hub. As a result, when you log in with your Docker Hub credentials, you will bypass the rate limit issues.
Dig not defined
This error means that you are using an unsupported version of Helm:
Error: parse error at (redpanda/templates/statefulset.yaml:203): function "dig" not defined
To fix this error, ensure that you are using the minimum required version: 3.10.0.
helm version
Repository name already exists
If you see this error, remove the redpanda
chart repository, then try installing it again.
helm repo remove redpanda
helm repo add redpanda https://charts.redpanda.com
helm repo update
Fatal error during checker "Data directory is writable" execution
This error appears when Redpanda does not have write access to your configured storage volume under storage
in the Helm chart.
Error: fatal error during checker "Data directory is writable" execution: open /var/lib/redpanda/data/test_file: permission denied
To fix this error, set statefulset.initContainers.setDataDirOwnership.enabled
to true
so that the initContainer can set the correct permissions on the data directories.
Cannot patch "redpanda" with kind StatefulSet
This error appears when you run helm upgrade
with the --values
flag but do not include all your previous overrides.
Error: UPGRADE FAILED: cannot patch "redpanda" with kind StatefulSet: StatefulSet.apps "redpanda" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden
To fix this error, do one of the following:
-
Include all the value overrides from the previous installation or upgrade using either the
--set
or the--values
flags. -
Use the
--reuse-values
flag.Do not use the --reuse-values
flag to upgrade from one version of the Helm chart to another. This flag stops Helm from using any new values in the upgraded chart.
Cannot patch "redpanda-console" with kind Deployment
This error appears if you try to upgrade your deployment and you already have console.enabled
set to true
.
Error: UPGRADE FAILED: cannot patch "redpanda-console" with kind Deployment: Deployment.apps "redpanda-console" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"redpanda", "app.kubernetes.io/name":"console"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
To fix this error, set console.enabled
to false
so that Helm doesn’t try to deploy Redpanda Console again.
Helm is in a pending-rollback state
An interrupted Helm upgrade process can leave your Helm release in a pending-rollback
state. This state prevents further actions like upgrades, rollbacks, or deletions through standard Helm commands. To fix this:
-
Identify the Helm release that’s in a
pending-rollback
state:helm list --namespace <namespace> --all
bashLook for releases with a status of
pending-rollback
. These are the ones that need intervention. -
Verify the Secret’s status to avoid affecting the wrong resource:
kubectl --namespace <namespace> get secret --show-labels
bashIdentify the Secret associated with your Helm release by its
pending-rollback
status in the labels.Ensure you have correctly identified the Secret to avoid unintended consequences. Deleting the wrong Secret could impact other deployments or services. -
Delete the Secret to clear the
pending-rollback
state:kubectl --namespace <namespace> delete secret -l status=pending-rollback
bash
After clearing the pending-rollback
state:
-
Retry the upgrade: Restart the upgrade process. You should investigate the initial failure to avoid getting into the
pending-rollback
state again. -
Perform a rollback: If you need to roll back to a previous release, use
helm rollback <release-name> <revision>
to revert to a specific, stable release version.
Suggested reading
Set up a real-time dashboard to monitor your cluster health, see Monitor Redpanda.