Docs Self-Managed Manage Kubernetes Troubleshooting Troubleshoot Redpanda in Kubernetes You are viewing the Self-Managed v24.3 beta documentation. We welcome your feedback at the Redpanda Community Slack #beta-feedback channel. To view the latest available version of the docs, see v24.2. Troubleshoot Redpanda in Kubernetes This topic provides guidance on how to diagnose and troubleshoot issues with Redpanda deployments in Kubernetes. Prerequisites Before troubleshooting Redpanda, ensure that Kubernetes isn’t the cause of the issue. For information about debugging applications in a Kubernetes cluster, see the Kubernetes documentation. Collect all debugging data If you’re unsure of what is wrong, you can generate a diagnostics bundle that contains a wide range of data to help debug and diagnose a Redpanda cluster or the nodes on which the brokers are running. See Diagnostics Bundles in Kubernetes. View Helm chart configuration To check the overrides that were applied to your deployment: helm get values <chart-name> --namespace <namespace> If you’re using the Redpanda Operator, the chart name matches the name of your Redpanda resource. To check all the values that were set in the Redpanda Helm chart, including any overrides: helm get values <chart-name> --namespace <namespace> --all View recent events To understand the latest events that occurred in your Redpanda cluster’s namespace, you can sort events by their creation timestamp: kubectl get events --namespace <namespace> --sort-by='.metadata.creationTimestamp' View Redpanda logs Logs are crucial for monitoring and troubleshooting your Redpanda clusters. Redpanda brokers output logs to STDOUT, making them accessible via kubectl. To access logs for a specific Pod: List all Pods to find the names of those that are running Redpanda brokers: kubectl get pods --namespace <namespace> View logs for a particular Pod by replacing <pod-name> with the name of your Pod: kubectl logs <pod-name> --namespace <namespace> For a comprehensive overview, you can view aggregated logs from all Pods in the StatefulSet: kubectl logs --namespace <namespace> -l app.kubernetes.io/component=redpanda-statefulset Change the default log level To change the default log level for all Redpanda subsystems, use the logging.logLevel configuration. Valid values are trace, debug, info, warn, error. Changing the default log level to debug can provide more detailed logs for diagnostics. This logging level increases the volume of generated logs. To set different log levels for individual subsystems, see Override the default log level for Redpanda subsystems. Helm + Operator Helm Apply the new log level: redpanda-cluster.yaml apiVersion: cluster.redpanda.com/v1alpha2 kind: Redpanda metadata: name: redpanda spec: chartRef: {} clusterSpec: logging: logLevel: debug Then, apply this configuration: kubectl apply -f redpanda-cluster.yaml --namespace <namespace> Choose between using a custom values file or setting values directly: --values --set Specify logging settings in logging.yaml, then upgrade: logging.yaml logging: logLevel: debug helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --values logging.yaml --reuse-values Directly set the log level during upgrade: helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --set logging.logLevel=debug After applying these changes, verify the log level by checking the initial output of the logs for the Redpanda Pods. Override the default log level for Redpanda subsystems You can override the log levels for individual subsystems, such as rpc and kafka, for more detailed logging control. Overrides exist for the entire length of the running Redpanda process. To temporarily override the log level for individual subsystems, you can use the rpk-redpanda-admin-config-log-level-set command. List all available subsystem loggers: kubectl exec -it --namespace <namespace> <pod-name> -c redpanda -- rpk redpanda start --help-loggers Set the log level for one or more subsystems. In this example, the rpc and kafka subsystem loggers are set to debug. Helm + Operator Helm Apply the new log level: redpanda-cluster.yaml apiVersion: cluster.redpanda.com/v1alpha2 kind: Redpanda metadata: name: redpanda spec: chartRef: {} clusterSpec: statefulset: additionalRedpandaCmdFlags: - '--logger-log-level=rpc=debug:kafka=debug' Then, apply this configuration: kubectl apply -f redpanda-cluster.yaml --namespace <namespace> Choose between using a custom values file or setting values directly: --values --set Specify logging settings in logging.yaml, then upgrade: logging.yaml statefulset: additionalRedpandaCmdFlags: - '--logger-log-level=rpc=debug:kafka=debug' helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --values logging.yaml --reuse-values Directly set the log level during upgrade: helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --set statefulset.additionalRedpandaCmdFlags="{--logger-log-level=rpc=debug:kafka=debug}" Overriding the log levels for specific subsystems provides enhanced visibility into Redpanda’s internal operations, facilitating better debugging and monitoring. View Redpanda Operator logs To learn what’s happening with the Redpanda Operator and the associated Redpanda resources, you can inspect a combination of Kubernetes events and the resource manifests. By monitoring these events and resources, you can troubleshoot any issues that arise during the lifecycle of a Redpanda deployment. kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace> Inspect Helm releases The Redpanda Operator uses Flux to deploy the Redpanda Helm chart. By inspecting the helmreleases.helm.toolkit.fluxcd.io resource, you can get detailed information about the Helm installation process for your Redpanda resource: kubectl get helmreleases.helm.toolkit.fluxcd.io -o yaml <redpanda-resource-name> --namespace <namespace> To check the Redpanda resource: kubectl get redpandas.cluster.redpanda.com -o yaml --namespace <namespace> In both the HelmRelease and the Redpanda resource, the condition section reveals the ongoing status of the Helm installation. These conditions provide information on the success, failure, or pending status of various operations. Troubleshoot known issues This section describes issues you might encounter while deploying Redpanda in Kubernetes and explains how to troubleshoot them. HelmRelease is not ready If you are using the Redpanda Operator, you may see the following message while waiting for a Redpanda custom resource to be deployed: NAME READY STATUS redpanda False HelmRepository 'redpanda/redpanda-repository' is not ready redpanda False HelmRelease 'redpanda/redpanda' is not ready While the deployment process can sometimes take a few minutes, a prolonged 'not ready' status may indicate an issue. Follow the steps below to investigate: Check the status of the HelmRelease: kubectl describe helmrelease <redpanda-resource-name> --namespace <namespace> Review the Redpanda Operator logs: kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace> HelmRelease retries exhausted The HelmRelease retries exhausted error occurs when the Helm Controller has tried to reconcile the HelmRelease a number of times, but these attempts have failed consistently. The Helm Controller watches for changes in HelmRelease objects. When changes are detected, it tries to reconcile the state defined in the HelmRelease with the state in the cluster. The process of reconciliation includes installation, upgrade, testing, rollback or uninstallation of Helm releases. You may see this error due to: Incorrect configuration in the HelmRelease. Issues with the chart, such as a non-existent chart version or the chart repository not being accessible. Missing dependencies or prerequisites required by the chart. Issues with the underlying Kubernetes cluster, such as insufficient resources or connectivity issues. To debug this error do the following: Check the status of the HelmRelease: kubectl describe helmrelease <cluster-name> --namespace <namespace> Review the Redpanda Operator logs: kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace> When you find and fix the error, you must use the Flux CLI, fluxctl, to suspend and resume the reconciliation process: Install Flux CLI. Suspend the HelmRelease: flux suspend helmrelease <cluster-name> --namespace <namespace> Resume the HelmRelease: flux resume helmrelease <cluster-name> --namespace <namespace> Crash loop backoffs If a broker crashes after startup, or gets stuck in a crash loop, it could produce progressively more stored state that uses additional disk space and takes more time for each restart to process. To prevent infinite crash loops, the Redpanda Helm chart sets the crash_loop_limit node property to 5. The crash loop limit is the number of consecutive crashes that can happen within one hour of each other. After Redpanda reaches this limit, it will not start until its internal consecutive crash counter is reset to zero. In Kubernetes, the Pod running Redpanda remains in a CrashLoopBackoff state until its internal consecutive crash counter is reset to zero. To troubleshoot a crash loop backoff: Check the Redpanda logs from the most recent crashes: kubectl logs <pod-name> --namespace <namespace> Kubernetes retains logs only for the current and the previous instance of a container. This limitation makes it difficult to access logs from earlier crashes, which may contain vital clues about the root cause of the issue. Given these log retention limitations, setting up a centralized logging system is crucial. Systems such as Loki or Datadog can capture and store logs from all containers, ensuring you have access to historical data. Resolve the issue that led to the crash loop backoff. Reset the crash counter to zero to allow Redpanda to restart. You can do any of the following to reset the counter: Make changes to any of the following sections in the Redpanda Helm chart to trigger an update: config.node config.tunable For example: config: node: crash_loop_limit: <new-integer> Delete the startup_log file in the broker’s data directory. kubectl exec <pod-name> --namespace <namespace> -- rm /var/lib/redpanda/data/startup_log It might be challenging to execute this command within a Pod that is in a CrashLoopBackoff state due to the limited time during which the Pod is available before it restarts. Wrapping the command in a loop might work. Wait one hour since the last crash. The crash counter resets after one hour. To avoid future crash loop backoffs and manage the accumulation of small segments effectively: Monitor the size and number of segments regularly. Optimize your Redpanda configuration for segment management. Consider implementing Tiered Storage to manage data more efficiently. StatefulSet never rolls out If the StatefulSet Pods remain in a pending state, they are waiting for resources to become available. To identify the Pods that are pending, use the following command: kubectl get pod --namespace <namespace> The response includes a list of Pods in the StatefulSet and their status. To view logs for a specific Pod, use the following command. kubectl logs -f <pod-name> --namespace <namespace> You can use the output to debug your deployment. Didn’t match pod anti-affinity rules If you see this error, your cluster does not have enough nodes to satisfy the anti-affinity rules: Warning FailedScheduling 18m default-scheduler 0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod. The Helm chart configures default podAntiAffinity rules to make sure that only one Pod running a Redpanda broker is scheduled on each worker node. To learn why, see Number of workers. To resolve this issue, do one of the following: Create additional worker nodes. Modify the anti-affinity rules (for development purposes only). If adding nodes is not an option, you can modify the podAntiAffinity rules in your StatefulSet to be less strict. Helm + Operator Helm redpanda-cluster.yaml apiVersion: cluster.redpanda.com/v1alpha2 kind: Redpanda metadata: name: redpanda spec: chartRef: {} clusterSpec: statefulset: podAntiAffinity: type: soft kubectl apply -f redpanda-cluster.yaml --namespace <namespace> --values --set docker-repo.yaml statefulset: podAntiAffinity: type: soft helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --values docker-repo.yaml --reuse-values helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --set statefulset.podAntiAffinity.type=soft Unable to mount volume If you see volume mounting errors in the Pod events or in the Redpanda logs, ensure that each of your Pods has a volume available in which to store data. If you’re using StorageClasses with dynamic provisioners (default), ensure they exist: kubectl get storageclass If you’re using PersistentVolumes, ensure that you have one PersistentVolume available for each Redpanda broker, and that each one has the storage capacity that’s set in storage.persistentVolume.size: kubectl get persistentvolume --namespace <namespace> To learn how to configure different storage volumes, see Configure Storage. Failed to pull image When deploying the Redpanda Helm chart, you may encounter Docker rate limit issues because the default registry URL is not recognized as a Docker Hub URL. The domain docker.redpanda.com is used for statistical purposes, such as tracking the number of downloads. It mirrors Docker Hub’s content while providing specific analytics for Redpanda. Failed to pull image "docker.redpanda.com/redpandadata/redpanda:v<version>": rpc error: code = Unknown desc = failed to pull and unpack image "docker.redpanda.com/redpandadata/redpanda:v<version>": failed to copy: httpReadSeeker: failed open: unexpected status code 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit To fix this error, do one of the following: Replace the image.repository value in the Helm chart with docker.io/redpandadata/redpanda. Switching to Docker Hub avoids the rate limit issues associated with docker.redpanda.com. Helm + Operator Helm redpanda-cluster.yaml apiVersion: cluster.redpanda.com/v1alpha2 kind: Redpanda metadata: name: redpanda spec: chartRef: {} clusterSpec: image: repository: docker.io/redpandadata/redpanda kubectl apply -f redpanda-cluster.yaml --namespace <namespace> --values --set docker-repo.yaml image: repository: docker.io/redpandadata/redpanda helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --values docker-repo.yaml --reuse-values helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --set image.repository=docker.io/redpandadata/redpanda Authenticate to Docker Hub by logging in with your Docker Hub credentials. The docker.redpanda.com site acts as a reflector for Docker Hub. As a result, when you log in with your Docker Hub credentials, you will bypass the rate limit issues. Dig not defined This error means that you are using an unsupported version of Helm: Error: parse error at (redpanda/templates/statefulset.yaml:203): function "dig" not defined To fix this error, ensure that you are using the minimum required version: 3.10.0. helm version Repository name already exists If you see this error, remove the redpanda chart repository, then try installing it again. helm repo remove redpanda helm repo add redpanda https://charts.redpanda.com helm repo update Fatal error during checker "Data directory is writable" execution This error appears when Redpanda does not have write access to your configured storage volume under storage in the Helm chart. Error: fatal error during checker "Data directory is writable" execution: open /var/lib/redpanda/data/test_file: permission denied To fix this error, set statefulset.initContainers.setDataDirOwnership.enabled to true so that the initContainer can set the correct permissions on the data directories. Cannot patch "redpanda" with kind StatefulSet This error appears when you run helm upgrade with the --values flag but do not include all your previous overrides. Error: UPGRADE FAILED: cannot patch "redpanda" with kind StatefulSet: StatefulSet.apps "redpanda" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden To fix this error, do one of the following: Include all the value overrides from the previous installation or upgrade using either the --set or the --values flags. Use the --reuse-values flag. Do not use the --reuse-values flag to upgrade from one version of the Helm chart to another. This flag stops Helm from using any new values in the upgraded chart. Cannot patch "redpanda-console" with kind Deployment This error appears if you try to upgrade your deployment and you already have console.enabled set to true. Error: UPGRADE FAILED: cannot patch "redpanda-console" with kind Deployment: Deployment.apps "redpanda-console" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"redpanda", "app.kubernetes.io/name":"console"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable To fix this error, set console.enabled to false so that Helm doesn’t try to deploy Redpanda Console again. Helm is in a pending-rollback state An interrupted Helm upgrade process can leave your Helm release in a pending-rollback state. This state prevents further actions like upgrades, rollbacks, or deletions through standard Helm commands. To fix this: Identify the Helm release that’s in a pending-rollback state: helm list --namespace <namespace> --all Look for releases with a status of pending-rollback. These are the ones that need intervention. Verify the Secret’s status to avoid affecting the wrong resource: kubectl --namespace <namespace> get secret --show-labels Identify the Secret associated with your Helm release by its pending-rollback status in the labels. Ensure you have correctly identified the Secret to avoid unintended consequences. Deleting the wrong Secret could impact other deployments or services. Delete the Secret to clear the pending-rollback state: kubectl --namespace <namespace> delete secret -l status=pending-rollback After clearing the pending-rollback state: Retry the upgrade: Restart the upgrade process. You should investigate the initial failure to avoid getting into the pending-rollback state again. Perform a rollback: If you need to roll back to a previous release, use helm rollback <release-name> <revision> to revert to a specific, stable release version. A Redpanda Enterprise Edition license is required During a Redpanda upgrade, if enterprise features are enabled and a valid Enterprise Edition license is missing, Redpanda logs a warning and aborts the upgrade process on the affected broker. This issue prevents a successful upgrade. If you encounter this issue, follow these steps to recover: Roll back the affected broker to the original version. Do one of the following: Apply a valid Redpanda Enterprise Edition license to the cluster. Disable enterprise features. If you do not have a valid license and want to proceed without using enterprise features, you can disable the enterprise features in your Redpanda configuration. Retry the upgrade. Invalid large response size This error appears when your cluster is configured to use TLS, but you don’t specify that you are connecting over TLS. unable to request metadata: invalid large response size 352518912 > limit 104857600; the first three bytes received appear to be a tls alert record for TLS v1.2; is this a plaintext connection speaking to a tls endpoint? If you’re using rpk, ensure to add the -X tls.enabled flag, and any other necessary TLS flags such as the TLS certificate: kubectl exec <pod-name> -c redpanda --namespace <namespace> -- \ rpk cluster info -X tls.enabled=true For all available flags, see the rpk command reference. Malformed HTTP response This error appears when a cluster has TLS enabled, and you try to access the admin API without passing the required TLS parameters. Retrying POST for error: Post "http://127.0.0.1:9644/v1/security/users": net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x03\x00\x02\x02" If you’re using rpk, ensure to include the TLS flags. For all available flags, see the rpk command reference. x509: certificate signed by unknown authority This error appears when the Certificate Authority (CA) that signed your certificates is not trusted by your system. Check the following: Ensure you have installed the root CA certificate correctly on your local system. If using a self-signed certificate, ensure it is properly configured and included in your system’s trust store. If you are using a certificate issued by a CA, ensure the issuing CA is included in your system’s trust store. If you are using cert-manager, ensure it is correctly configured and running properly. Check the validity of your certificates. They might have expired. x509: certificate is not valid for any names This error indicates that the certificate you are using is not valid for the specific domain or IP address you are trying to use it with. This error typically occurs when there is a mismatch between the certificate’s Subject Alternative Name (SAN) or Common Name (CN) field and the name being used to access the broker. To fix this error, you may need to obtain a new certificate that is valid for the specific domain or IP address you are using. Ensure that the certificate’s SAN or CN entry matches the name being used, and that the certificate is not expired or revoked. cannot validate certificate for 127.0.0.1 This error appears if you are using a CA certificate when you try to establish an internal connection using localhost. For example: unable to request metadata: unable to dial: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs To fix this error, you must either specify the URL with a public domain or use self-signed certificates: kubectl exec redpanda-0 -c redpanda --namespace <namespace> -- \ rpk cluster info \ -X brokers=<redpanda-url>:<port> \ -X tls.enabled=true Redpanda not applying TLS changes Enabling or disabling TLS for the RPC listener requires you to delete all Pods that run Redpanda. When you change the rpc.tls.enabled setting, or if it is not overridden and you change the global tls.enabled option, Redpanda cannot safely apply the change because RPC listener configurations must be the same across all brokers. To apply the change, all Redpanda Pods must be deleted simultaneously so that they all start with the updated RPC listener. This action results in temporary downtime of the cluster. Although you can use the --force option to speed up the rollout, it may result in data loss as Redpanda will not be given time to shut down gracefully. kubectl delete pod -l app=redpanda --namespace <namespace> I/O timeout This error appears when your worker nodes are unreachable through the given address. Check the following: The address and port are correct. Your DNS records point to addresses that resolve to your worker nodes. Is SASL missing? This error appears when you try to interact with a cluster that has SASL enabled without passing a user’s credentials. unable to request metadata: broker closed the connection immediately after a request was issued, which happens when SASL is required but not provided: is SASL missing? If you’re using rpk, ensure to specify the -X user, -X pass, and -X sasl.mechanism flags. For all available flags, see the rpk command reference. Unable to continue with update: Secret When you use a YAML list to specify superusers, the Helm chart creates a Secret using the value of auth.sasl.secretRef as the Secret’s name, and stores those superusers in the Secret. If the Secret already exists in the namespace when you deploy Redpanda, the following error is displayed: Error: UPGRADE FAILED: rendered manifests contain a resource that already exists. Unable to continue with update: Secret To fix this error, ensure that you use only one of the following methods to create superusers: auth.sasl.secretRef auth.sasl.users Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution Troubleshooting Diagnostics Bundle