Troubleshoot Redpanda in Kubernetes

This topic provides guidance on how to diagnose and troubleshoot issues with Redpanda deployments in Kubernetes.

Prerequisites

Before troubleshooting Redpanda, ensure that Kubernetes isn’t the cause of the issue. For information about debugging applications in a Kubernetes cluster, see the Kubernetes documentation.

Collect all debugging data

If you’re unsure of what is wrong, you can generate a diagnostics bundle that contains a wide range of data to help debug and diagnose a Redpanda cluster or the nodes on which the brokers are running.

View Helm chart configuration

To check the overrides that were applied to your deployment:

helm get values <chart-name> --namespace <namespace>

If you’re using the Redpanda Operator, the chart name matches the name of your Redpanda resource.

To check all the values that were set in the Redpanda Helm chart, including any overrides:

helm get values <chart-name> --namespace <namespace> --all

View Redpanda logs

Logs for each Redpanda broker running inside a Pod are sent to STDOUT. Use kubectl logs to get logs from STDOUT.

To view logs for a particular Pod:

kubectl logs <pod-name> --namespace <namespace>

To view logs for all Pods in the StatefulSet:

kubectl logs --namespace <namespace> -l app.kubernetes.io/component=redpanda-statefulset

To change the log level of the Redpanda brokers to debug:

  • Helm + Operator

  • Helm

redpanda-cluster.yaml
apiVersion: cluster.redpanda.com/v1alpha1
kind: Redpanda
metadata:
  name: redpanda
spec:
  chartRef: {}
  clusterSpec:
    logging:
      logLevel: debug
kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
  • --values

  • --set

logging.yaml
logging:
  logLevel: debug
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --values logging.yaml --reuse-values
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --set logging.logLevel=debug

View Redpanda Operator logs

To learn what’s happening with the Redpanda Operator and the associated Redpanda resources, you can inspect a combination of Kubernetes events and the resource manifests. By monitoring these events and resources, you can troubleshoot any issues that arise during the lifecycle of a Redpanda deployment.

kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace>

View recent events

To understand the latest events that occurred in your Redpanda cluster’s namespace, you can sort events by their creation timestamp:

kubectl get events --namespace <namespace> --sort-by='.metadata.creationTimestamp'

Inspect Helm releases

The Redpanda Operator uses Flux to deploy the Redpanda Helm chart. By inspecting the helmreleases.helm.toolkit.fluxcd.io resource, you can get detailed information about the Helm installation process for your Redpanda resource:

kubectl get helmreleases.helm.toolkit.fluxcd.io -o yaml <redpanda-resource-name> --namespace <namespace>

To check the Redpanda resource:

kubectl get redpandas.cluster.redpanda.com -o yaml --namespace <namespace>

In both the HelmRelease and the Redpanda resource, the condition section reveals the ongoing status of the Helm installation. These conditions provide information on the success, failure, or pending status of various operations.

Troubleshoot known issues

This section describes issues you might encounter while deploying Redpanda in Kubernetes and explains how to troubleshoot them.

HelmRelease is not ready

If you are using the Redpanda Operator, you may see the following message while waiting for a Redpanda custom resource to be deployed:

NAME       READY   STATUS
redpanda   False   HelmRepository 'redpanda/redpanda-repository' is not ready
redpanda   False   HelmRelease 'redpanda/redpanda' is not ready

While the deployment process can sometimes take a few minutes, a prolonged 'not ready' status may indicate an issue. Follow the steps below to investigate:

  1. Check the status of the HelmRelease:

    kubectl describe helmrelease <redpanda-resource-name> --namespace <namespace>
  2. Review the Redpanda Operator logs:

    kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace>

HelmRelease retries exhausted

The HelmRelease retries exhausted error occurs when the Helm Controller has tried to reconcile the HelmRelease a number of times, but these attempts have failed consistently.

The Helm Controller watches for changes in HelmRelease objects. When changes are detected, it tries to reconcile the state defined in the HelmRelease with the state in the cluster. The process of reconciliation includes installation, upgrade, testing, rollback or uninstallation of Helm releases.

You may see this error due to:

  • Incorrect configuration in the HelmRelease.

  • Issues with the chart, such as a non-existent chart version or the chart repository not being accessible.

  • Missing dependencies or prerequisites required by the chart.

  • Issues with the underlying Kubernetes cluster, such as insufficient resources or connectivity issues.

To debug this error do the following:

  1. Check the status of the HelmRelease:

    kubectl describe helmrelease <cluster-name> --namespace <namespace>
  2. Review the Redpanda Operator logs:

    kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace>

When you find and fix the error, you must use the Flux CLI, fluxctl, to suspend and resume the reconciliation process:

  1. Install Flux CLI.

  2. Suspend the HelmRelease:

    flux suspend helmrelease <cluster-name> --namespace <namespace>
  3. Resume the HelmRelease:

    flux resume helmrelease <cluster-name> --namespace <namespace>

StatefulSet never rolls out

If the StatefulSet Pods remain in a pending state, they are waiting for resources to become available.

To identify the Pods that are pending, use the following command:

kubectl get pod --namespace <namespace>

The response includes a list of Pods in the StatefulSet and their status.

To view logs for a specific Pod, use the following command.

kubectl logs -f <pod-name> --namespace <namespace>

You can use the output to debug your deployment.

Unable to mount volume

If you see volume mounting errors in the Pod events or in the Redpanda logs, ensure that each of your Pods has a volume available in which to store data.

  • If you’re using StorageClasses with dynamic provisioners (default), ensure they exist:

    kubectl get storageclass
  • If you’re using PersistentVolumes, ensure that you have one PersistentVolume available for each Redpanda broker, and that each one has the storage capacity that’s set in storage.persistentVolume.size:

    kubectl get persistentvolume --namespace <namespace>

To learn how to configure different storage volumes, see Configure Storage.

Dig not defined

This error means that you are using an unsupported version of Helm:

Error: parse error at (redpanda/templates/statefulset.yaml:203): function "dig" not defined

Ensure that you are using the minimum required version: 3.6.0.

helm version

Repository name already exists

If you see this error, remove the redpanda chart repository, then try installing it again.

helm repo remove redpanda
helm repo add redpanda https://charts.redpanda.com
helm repo update

Fatal error during checker "Data directory is writable" execution

This error appears when Redpanda does not have write access to your configured storage volume under storage in the Helm chart.

Error: fatal error during checker "Data directory is writable" execution: open /var/lib/redpanda/data/test_file: permission denied

To fix this error, set statefulset.initContainers.setDataDirOwnership.enabled to true so that the initContainer can set the correct permissions on the data directories.

Cannot patch "redpanda" with kind StatefulSet

This error appears when you run helm upgrade with the --values flag but do not include all your previous overrides.

Error: UPGRADE FAILED: cannot patch "redpanda" with kind StatefulSet: StatefulSet.apps "redpanda" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden

Ensure to do one of the following:

  • Include all the value overrides from the previous installation or upgrade using either the --set or the --values flags.

  • Use the --reuse-values flag.

    Do not use the --reuse-values flag to upgrade from one version of the Helm chart to another. This flag stops Helm from using any new values in the upgraded chart.

Cannot patch "redpanda-console" with kind Deployment

This error appears if you try to upgrade your deployment and you already have console.enabled set to true.

Error: UPGRADE FAILED: cannot patch "redpanda-console" with kind Deployment: Deployment.apps "redpanda-console" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"redpanda", "app.kubernetes.io/name":"console"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

To fix this error, set console.enabled to false so that Helm doesn’t try to deploy Redpanda Console again.

Invalid large response size

This error appears when your cluster is configured to use TLS, but you don’t specify that you are connecting over TLS.

unable to request metadata: invalid large response size 352518912 > limit 104857600; the first three bytes received appear to be a tls alert record for TLS v1.2; is this a plaintext connection speaking to a tls endpoint?

If you’re using rpk, ensure to add the -X tls.enabled flag, and any other necessary TLS flags such as the TLS certificate:

kubectl exec <pod-name> -c redpanda --namespace <namespace> -- rpk cluster info -X brokers=<subdomain>.<domain>:<external-port> -X tls.enabled=true

For all available flags, see the rpk command reference.

Malformed HTTP response

This error appears when a cluster has TLS enabled, and you try to access the admin API without passing the required TLS parameters.

Retrying POST for error: Post "http://127.0.0.1:9644/v1/security/users": net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x03\x00\x02\x02"

If you’re using rpk, ensure to include the TLS flags.

For all available flags, see the rpk command reference.

x509: certificate signed by unknown authority

This error appears when the Certificate Authority (CA) that signed your certificates is not trusted by your system.

Check the following:

  • Ensure you have installed the root CA certificate correctly on your local system.

  • If using a self-signed certificate, ensure it is properly configured and included in your system’s trust store.

  • If you are using a certificate issued by a CA, ensure the issuing CA is included in your system’s trust store.

  • If you are using cert-manager, ensure it is correctly configured and running properly.

  • Check the validity of your certificates. They might have expired.

x509: certificate is not valid for any names

This error indicates that the certificate you are using is not valid for the specific domain or IP address you are trying to use it with. This error typically occurs when there is a mismatch between the certificate’s Subject Alternative Name (SAN) or Common Name (CN) field and the name being used to access the broker.

To resolve this issue, you may need to obtain a new certificate that is valid for the specific domain or IP address you are using. Ensure that the certificate’s SAN or CN entry matches the name being used, and that the certificate is not expired or revoked.

cannot validate certificate for 127.0.0.1

This error appears if you are using a CA certificate when you try to establish an internal connection using localhost. For example:

unable to request metadata: unable to dial: x509: cannot validate certificate for 127.0.0.1 because it doesn't contain any IP SANs

To resolve this issue, you must either specify the public domain or use self-signed certificates:

kubectl exec redpanda-0 -c redpanda --namespace <namespace> -- \
  rpk cluster info \
  -X brokers=<subdomain>.<domain>:<external-port> \
  -X tls.enabled=true

I/O timeout

This error appears when your worker nodes are unreachable through the given address.

Check the following:

  • The address and port are correct.

  • Your DNS records point to addresses that resolve to your worker nodes.

Is SASL missing?

This error appears when you try to interact with a cluster that has SASL enabled without passing a user’s credentials.

unable to request metadata: broker closed the connection immediately after a request was issued, which happens when SASL is required but not provided: is SASL missing?

If you’re using rpk, ensure to specify the -X user, -X pass, and -X sasl.mechanism flags.

For all available flags, see the rpk command reference.

Unable to continue with update: Secret

When you use a YAML list to specify superusers, the Helm chart creates a Secret using the value of auth.sasl.secretRef as the Secret’s name, and stores those superusers in the Secret. If the Secret already exists in the namespace when you deploy Redpanda, the following error is displayed:

Error: UPGRADE FAILED: rendered manifests contain a resource that already exists. Unable to continue with update: Secret

To resolve this issue, ensure that you use only one of the following methods to create superusers:

  • auth.sasl.secretRef

  • auth.sasl.users