Collapse

Resolve Errors in Kubernetes

This section describes errors or issues you might encounter while deploying Redpanda in Kubernetes and explains how to troubleshoot them.

This section addresses common issues that may occur when deploying Redpanda in a Kubernetes environment.

Helm v3.18.0 is not supported (json.Number error)

If you are using Helm v3.18.0, you may encounter errors such as:

Error: INSTALLATION FAILED: execution error at (redpanda/templates/entry-point.yaml:17:4): invalid Quantity expected string or float64 got: json.Number (1)

This is due to a bug in Helm v3.18.0. To avoid similar errors, upgrade to a later version. For more details, see the Helm GitHub issue. === StatefulSet never rolls out

If the StatefulSet Pods remain in a pending state, they are waiting for resources to become available.

To identify the Pods that are pending, use the following command:

kubectl get pod --namespace <namespace>

The response includes a list of Pods in the StatefulSet and their status.

To view logs for a specific Pod, use the following command.

kubectl logs -f <pod-name> --namespace <namespace>

You can use the output to debug your deployment.

Didn’t match pod anti-affinity rules

If you see this error, your cluster does not have enough nodes to satisfy the anti-affinity rules:

Warning  FailedScheduling  18m  default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

The Helm chart configures default podAntiAffinity rules to make sure that only one Pod running a Redpanda broker is scheduled on each worker node. To learn why, see Number of workers.

To resolve this issue, do one of the following:

Create additional worker nodes.

Modify the anti-affinity rules (for development purposes only).

If adding nodes is not an option, you can modify the podAntiAffinity rules in your StatefulSet to be less strict.

Operator
Helm

redpanda-cluster.yaml

apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
  name: redpanda
spec:
  chartRef: {}
  clusterSpec:
    statefulset:
      podAntiAffinity:
        type: soft

kubectl apply -f redpanda-cluster.yaml --namespace <namespace>

--values
--set

docker-repo.yaml

statefulset:
  podAntiAffinity:
    type: soft

helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --values docker-repo.yaml

helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --set statefulset.podAntiAffinity.type=soft

Unable to mount volume

If you see volume mounting errors in the Pod events or in the Redpanda logs, ensure that each of your Pods has a volume available in which to store data.

If you’re using StorageClasses with dynamic provisioners (default), ensure they exist:
```
kubectl get storageclass
```
If you’re using PersistentVolumes, ensure that you have one PersistentVolume available for each Redpanda broker, and that each one has the storage capacity that’s set in storage.persistentVolume.size:
```
kubectl get persistentvolume --namespace <namespace>
```

To learn how to configure different storage volumes, see Configure Storage.

Failed to pull image

When deploying the Redpanda Helm chart, you may encounter Docker rate limit issues because the default registry URL is not recognized as a Docker Hub URL. The domain docker.redpanda.com is used for statistical purposes, such as tracking the number of downloads. It mirrors Docker Hub’s content while providing specific analytics for Redpanda.

Failed to pull image "docker.redpanda.com/redpandadata/redpanda:v<version>": rpc error: code = Unknown desc = failed to pull and unpack image "docker.redpanda.com/redpandadata/redpanda:v<version>": failed to copy: httpReadSeeker: failed open: unexpected status code 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

To fix this error, do one of the following:

Replace the image.repository value in the Helm chart with docker.io/redpandadata/redpanda. Switching to Docker Hub avoids the rate limit issues associated with docker.redpanda.com.

Operator
Helm

redpanda-cluster.yaml

apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
  name: redpanda
spec:
  chartRef: {}
  clusterSpec:
    image:
      repository: docker.io/redpandadata/redpanda

kubectl apply -f redpanda-cluster.yaml --namespace <namespace>

--values
--set

docker-repo.yaml

image:
  repository: docker.io/redpandadata/redpanda

helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --values docker-repo.yaml

helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --set image.repository=docker.io/redpandadata/redpanda

Authenticate to Docker Hub by logging in with your Docker Hub credentials. The docker.redpanda.com site acts as a reflector for Docker Hub. As a result, when you log in with your Docker Hub credentials, you will bypass the rate limit issues.

Dig not defined

This error means that you are using an unsupported version of Helm:

Error: parse error at (redpanda/templates/statefulset.yaml:203): function "dig" not defined

To fix this error, ensure that you are using the minimum required version: 3.10.0.

helm version

Repository name already exists

If you see this error, remove the redpanda chart repository, then try installing it again.

helm repo remove redpanda
helm repo add redpanda https://charts.redpanda.com
helm repo update

Fatal error during checker "Data directory is writable" execution

This error appears when Redpanda does not have write access to your configured storage volume under storage in the Helm chart.

Error: fatal error during checker "Data directory is writable" execution: open /var/lib/redpanda/data/test_file: permission denied

To fix this error, set statefulset.initContainers.setDataDirOwnership.enabled to true so that the initContainer can set the correct permissions on the data directories.

Cannot patch "redpanda" with kind StatefulSet

This error appears when you run helm upgrade with the --values flag but do not include all your previous overrides.

Error: UPGRADE FAILED: cannot patch "redpanda" with kind StatefulSet: StatefulSet.apps "redpanda" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden

To fix this error, include all the value overrides from the previous installation using either the --set or the --values flags.

Do not use the --reuse-values flag to upgrade from one version of the Helm chart to another. This flag stops Helm from using any new values in the upgraded chart.

Cannot patch "redpanda-console" with kind Deployment

This error appears if you try to upgrade your deployment and you already have console.enabled set to true.

Error: UPGRADE FAILED: cannot patch "redpanda-console" with kind Deployment: Deployment.apps "redpanda-console" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"redpanda", "app.kubernetes.io/name":"console"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

To fix this error, set console.enabled to false so that Helm doesn’t try to deploy Redpanda Console again.

Helm is in a pending-rollback state

An interrupted Helm upgrade process can leave your Helm release in a pending-rollback state. This state prevents further actions like upgrades, rollbacks, or deletions through standard Helm commands. To fix this:

Identify the Helm release that’s in a pending-rollback state:
```
helm list --namespace <namespace> --all
```
Look for releases with a status of pending-rollback. These are the ones that need intervention.
Verify the Secret’s status to avoid affecting the wrong resource:
```
kubectl --namespace <namespace> get secret --show-labels
```
Identify the Secret associated with your Helm release by its pending-rollback status in the labels.

Ensure you have correctly identified the Secret to avoid unintended consequences. Deleting the wrong Secret could impact other deployments or services.

Delete the Secret to clear the pending-rollback state:

kubectl --namespace <namespace> delete secret -l status=pending-rollback

After clearing the pending-rollback state:

Retry the upgrade: Restart the upgrade process. You should investigate the initial failure to avoid getting into the pending-rollback state again.
Perform a rollback: If you need to roll back to a previous release, use helm rollback <release-name> <revision> to revert to a specific, stable release version.

Deployment issues

This section addresses common deployment issues encountered during Redpanda setup or upgrades.

Crash loop backoffs

If a broker crashes after startup, or gets stuck in a crash loop, it can accumulate an increasing amount of stored state. This accumulated state not only consumes additional disk space but also prolongs the time required for each subsequent restart to process it.

To prevent infinite crash loops, the Redpanda Helm chart sets the crash_loop_limit broker configuration property to 5. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. By default, the broker terminates immediately after hitting the crash_loop_limit. The Pod running Redpanda remains in a CrashLoopBackoff state until its internal consecutive crash counter is reset to zero.

To facilitate debugging in environments where a broker is stuck in a crash loop, you can also set the crash_loop_sleep_sec broker configuration property. This setting determines how long the broker sleeps before terminating the process after reaching the crash loop limit. By providing a window during which the Pod remains available, you can SSH into it and troubleshoot the issue.

Example configuration:

config:
  node:
    crash_loop_limit: 5
    crash_loop_sleep_sec: 60

In this example, when the broker hits the crash_loop_limit of 5, it will sleep for 60 seconds before terminating the process. This delay allows administrators to access the Pod and troubleshoot.

To troubleshoot a crash loop backoff:

Check the Redpanda logs from the most recent crashes:

kubectl logs <pod-name> --namespace <namespace>

Kubernetes retains logs only for the current and the previous instance of a container. This limitation makes it difficult to access logs from earlier crashes, which may contain vital clues about the root cause of the issue. Given these log retention limitations, setting up a centralized logging system is crucial. Systems such as Loki or Datadog can capture and store logs from all containers, ensuring you have access to historical data.

Resolve the issue that led to the crash loop backoff.

Reset the crash counter to zero to allow Redpanda to restart. You can do any of the following to reset the counter:

Make changes to any of the following sections in the Redpanda Helm chart to trigger an update:
- config.node
- config.tunable
For example:
```
config:
  node:
    crash_loop_limit: <new-integer>
```

Delete the startup_log file in the broker’s data directory.

kubectl exec <pod-name> --namespace <namespace> -- rm /var/lib/redpanda/data/startup_log

It might be challenging to execute this command within a Pod that is in a CrashLoopBackoff state due to the limited time during which the Pod is available before it restarts. Wrapping the command in a loop might work.

Wait one hour since the last crash. The crash counter resets after one hour.

To avoid future crash loop backoffs and manage the accumulation of small segments effectively:

Monitor the size and number of segments regularly.
Optimize your Redpanda configuration for segment management.
Consider implementing Tiered Storage to manage data more efficiently.

A Redpanda Enterprise Edition license is required

During a Redpanda upgrade, if enterprise features are enabled and a valid Enterprise Edition license is missing, Redpanda logs a warning and aborts the upgrade process on the first broker. This issue prevents a successful upgrade.

A Redpanda Enterprise Edition license is required to use the currently enabled features. To apply your license, downgrade this broker to the pre-upgrade version and provide a valid license key via rpk using 'rpk cluster license set <key>', or via Redpanda Console. To request an enterprise license, please visit <redpanda.com/upgrade>. To try Redpanda Enterprise for 30 days, visit <redpanda.com/try-enterprise>. For more information, see <https://docs.redpanda.com/current/get-started/licenses>.

If you encounter this message, follow these steps to recover:

Roll back the affected broker to the original version.
Do one of the following:
- Apply a valid Redpanda Enterprise Edition license to the cluster.
- Disable enterprise features.
  
  If you do not have a valid license and want to proceed without using enterprise features, you can disable the enterprise features in your Redpanda configuration.
Retry the upgrade.

Networking issues

This section provides insights into diagnosing network-related errors, such as connection timeouts, DNS misconfigurations, and network stability.

I/O timeout

This error appears when your worker nodes are unreachable through the given address.

Check the following:

The address and port are correct.
Your DNS records point to addresses that resolve to your worker nodes.

TLS issues

This section covers common TLS errors, their causes, and solutions, including certificate issues and correct client configuration.

Redpanda not applying TLS changes

Enabling or disabling TLS for the RPC listener requires you to delete all Pods that run Redpanda. When you change the rpc.tls.enabled setting, or if it is not overridden and you change the global tls.enabled option, Redpanda cannot safely apply the change because RPC listener configurations must be the same across all brokers. To apply the change, all Redpanda Pods must be deleted simultaneously so that they all start with the updated RPC listener. This action results in temporary downtime of the cluster.

Although you can use the --force option to speed up the rollout, it may result in data loss as Redpanda will not be given time to shut down gracefully.

kubectl delete pod -l app=redpanda --namespace <namespace>

Invalid large response size

This error appears when your cluster is configured to use TLS, but you don’t specify that you are connecting over TLS.

unable to request metadata: invalid large response size 352518912 > limit 104857600; the first three bytes received appear to be a tls alert record for TLS v1.2; is this a plaintext connection speaking to a tls endpoint?

If you’re using rpk, ensure to add the -X tls.enabled flag, and any other necessary TLS flags such as the TLS certificate:

kubectl exec <pod-name> -c redpanda --namespace <namespace> -- \
rpk cluster info -X tls.enabled=true

For all available flags, see the rpk options reference.

Malformed HTTP response

This error appears when a cluster has TLS enabled, and you try to access the admin API without passing the required TLS parameters.

Retrying POST for error: Post "http://127.0.0.1:9644/v1/security/users": net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x03\x00\x02\x02"

If you’re using rpk, ensure to include the TLS flags.

For all available flags, see the rpk options reference.

x509: certificate signed by unknown authority

This error appears when the Certificate Authority (CA) that signed your certificates is not trusted by your system.

Check the following:

Ensure you have installed the root CA certificate correctly on your local system.
If using a self-signed certificate, ensure it is properly configured and included in your system’s trust store.
If you are using a certificate issued by a CA, ensure the issuing CA is included in your system’s trust store.
If you are using cert-manager, ensure it is correctly configured and running properly.
Check the validity of your certificates. They might have expired.

x509: certificate is not valid for any names

This error indicates that the certificate you are using is not valid for the specific domain or IP address you are trying to use it with. This error typically occurs when there is a mismatch between the certificate’s Subject Alternative Name (SAN) or Common Name (CN) field and the name being used to access the broker.

To fix this error, you may need to obtain a new certificate that is valid for the specific domain or IP address you are using. Ensure that the certificate’s SAN or CN entry matches the name being used, and that the certificate is not expired or revoked.

cannot validate certificate for 127.0.0.1

This error appears if you are using a CA certificate when you try to establish an internal connection using localhost. For example:

unable to request metadata: unable to dial: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs

To fix this error, you must either specify the URL with a public domain or use self-signed certificates:

kubectl exec redpanda-0 -c redpanda --namespace <namespace> -- \
rpk cluster info \
-X brokers=<redpanda-url>:<port> \
-X tls.enabled=true

SASL issues

This section addresses errors related to SASL (Simple Authentication and Security Layer), focusing on connection and authentication problems.

Unable to continue with update: Secret

When you use a YAML list to specify superusers, the Helm chart creates a Secret using the value of auth.sasl.secretRef as the Secret’s name, and stores those superusers in the Secret. If the Secret already exists in the namespace when you deploy Redpanda, the following error is displayed:

Error: UPGRADE FAILED: rendered manifests contain a resource that already exists. Unable to continue with update: Secret

To fix this error, ensure that you use only one of the following methods to create superusers:

auth.sasl.secretRef
auth.sasl.users

Is SASL missing?

This error appears when you try to interact with a cluster that has SASL enabled without passing a user’s credentials.

unable to request metadata: broker closed the connection immediately after a request was issued, which happens when SASL is required but not provided: is SASL missing?

If you’re using rpk, ensure to specify the -X user, -X pass, and -X sasl.mechanism flags.

For all available flags, see the rpk options reference.

pattern_type is unspecified

When creating a shadow link with rpk shadow create, you may see:

Invalid cluster link configuration: pattern_type is unspecified

Ensure pattern_type values are uppercase: LITERAL, PREFIX.

broker_not_available with TLS enabled

When creating a shadow link with TLS enabled, you may see:

Cluster link unreachable, preflight check failed - { node: -1 }, { error_code: broker_not_available [8] }

The shadow cluster cannot verify the source cluster’s TLS certificate. This is the most common issue when using TLS with self-signed certificates (the default for Kubernetes deployments with tls.certs.default.caEnabled=true).

Ensure that the shadow link configuration includes the source cluster’s CA certificate.

Wrong SSL version number

When creating a shadow link, you may see in the source cluster logs:

Disconnected (applying protocol, Wrong SSL Version number: ensure client is configured to use TLS)

The source cluster requires TLS but your shadow link configuration is missing TLS settings or has tls_settings.enabled: false.

broker_not_available without TLS

When creating a shadow link without TLS, you may see:

Cluster link unreachable, preflight check failed - { node: -1 }, { error_code: broker_not_available [8] }

Verify that bootstrap_servers addresses are reachable from the shadow cluster and that ports are correct.

Test connectivity from the shadow pod:

kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  curl -v telnet://<source-address>:<port>

Connection timeout

When creating a shadow link, the command may hang or timeout without completing.

Check network connectivity between shadow and source clusters. Verify firewall rules and network policies allow traffic between the namespaces.

Topics in FAULTED state

When monitoring shadow links, you may see topics showing FAULTED state in status output.

Check shadow cluster logs for specific error messages:

kubectl logs --namespace <shadow-namespace> <shadow-pod-name> --container redpanda | grep -i "shadow\|error"

Common causes include:

Source topic deleted: topic no longer exists on source cluster
Permission denied: shadow link service account lacks required permissions
Network interruption: temporary connectivity issues

If the source topic still exists and should be replicated, delete and recreate the shadow link to reset the faulted state.

High replication lag

When monitoring shadow links, you may see LAG values continuously increasing in rpk shadow status.

Check the following:

Check source cluster load: high produce rate may exceed replication capacity
Check shadow cluster resources: CPU, memory, or disk constraints
Check network bandwidth: verify sufficient bandwidth between clusters

To resolve:

Scale shadow cluster resources if constrained
Verify network connectivity and bandwidth
Review topic configuration for optimization opportunities

Task shows LINK_UNAVAILABLE

When monitoring shadow links, you may see tasks showing LINK_UNAVAILABLE state with "No brokers available" message.

Common causes include:

Source cluster requires SASL authentication but shadow link not configured for it
Source cluster unreachable from shadow cluster
Network policy blocking traffic between clusters

To resolve:

Verify SASL configuration if source cluster requires authentication
Test network connectivity: kubectl exec into shadow pod and try connecting to source cluster
Check Kubernetes NetworkPolicies and firewall rules

ShadowLink resource stuck

When using the Operator, the ShadowLink resource may not delete or show errors.

Check the Redpanda Operator logs:

kubectl logs --namespace <operator-namespace> -l app.kubernetes.io/name=operator --tail=100

Check the Operator logs for specific errors preventing cleanup. Contact Redpanda support if the resource remains stuck after addressing any logged errors.

Application connection failures after failover

Applications may not be able to connect to the shadow cluster after failover.

Verify shadow cluster Kubernetes Service endpoints:

kubectl get service --namespace <shadow-namespace>

Check NetworkPolicy if using network policies:

kubectl get networkpolicy --namespace <shadow-namespace>

Confirm authentication credentials are valid for the shadow cluster and test network connectivity from application hosts.

Consumer group offset issues after failover

After failover, consumers may start from the beginning or wrong positions.

Verify consumer group offsets were replicated (check your shadow link filters):

kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk group describe <consumer-group-name>

If necessary, manually reset offsets to appropriate positions. See How to manage consumer group offsets in Redpanda for detailed reset procedures.

Was this helpful?

group Ask in the community

mail Share your feedback

group_add Make a contribution

What do you think of this page?

Let us know more:

Let us contact you about your feedback:

Resolve Errors in Kubernetes

Kubernetes-related issues

Helm v3.18.0 is not supported (json.Number error)

Didn’t match pod anti-affinity rules

Unable to mount volume

Failed to pull image

Dig not defined

Repository name already exists

Fatal error during checker "Data directory is writable" execution

Cannot patch "redpanda" with kind StatefulSet

Cannot patch "redpanda-console" with kind Deployment

Helm is in a pending-rollback state

Deployment issues

Crash loop backoffs

A Redpanda Enterprise Edition license is required

Networking issues

I/O timeout

TLS issues

Redpanda not applying TLS changes

Invalid large response size

Malformed HTTP response

x509: certificate signed by unknown authority

x509: certificate is not valid for any names

cannot validate certificate for 127.0.0.1

SASL issues

Unable to continue with update: Secret

Is SASL missing?

pattern_type is unspecified

broker_not_available with TLS enabled

Wrong SSL version number

broker_not_available without TLS

Connection timeout

Topics in FAULTED state

High replication lag

Task shows LINK_UNAVAILABLE

ShadowLink resource stuck

Application connection failures after failover

Consumer group offset issues after failover

Simple online edits

Contribution guide