Docs Self-Managed Troubleshoot Cluster Diagnostics Kubernetes Run Cluster Diagnostics in Kubernetes Use this guide to diagnose and troubleshoot issues in a Redpanda cluster running in Kubernetes. Prerequisites Before troubleshooting Redpanda, ensure that Kubernetes isn’t the cause of the issue. For information about debugging applications in a Kubernetes cluster, see the Kubernetes documentation. Collect all debugging data For a comprehensive diagnostic snapshot, generate a debug bundle that collects detailed data for cluster, broker, or node analysis. See Generate a Debug Bundle in Kubernetes for details on generating a debug bundle. View Helm chart configuration To check the overrides that were applied to your deployment: helm get values <chart-name> --namespace <namespace> If you’re using the Redpanda Operator, the chart name matches the name of your Redpanda resource. To check all the values that were set in the Redpanda Helm chart, including any overrides: helm get values <chart-name> --namespace <namespace> --all View recent events To understand the latest events that occurred in your Redpanda cluster’s namespace, you can sort events by their creation timestamp: kubectl get events --namespace <namespace> --sort-by='.metadata.creationTimestamp' View Redpanda logs Logs are crucial for monitoring and troubleshooting your Redpanda clusters. Redpanda brokers output logs to STDOUT, making them accessible via kubectl. To access logs for a specific Pod: List all Pods to find the names of those that are running Redpanda brokers: kubectl get pods --namespace <namespace> View logs for a particular Pod by replacing <pod-name> with the name of your Pod: kubectl logs <pod-name> --namespace <namespace> For a comprehensive overview, you can view aggregated logs from all Pods in the StatefulSet: kubectl logs --namespace <namespace> -l app.kubernetes.io/component=redpanda-statefulset Change the default log level To change the default log level for all Redpanda subsystems, use the logging.logLevel configuration. Valid values are trace, debug, info, warn, error. Changing the default log level to debug can provide more detailed logs for diagnostics. This logging level increases the volume of generated logs. To set different log levels for individual subsystems, see Override the default log level for Redpanda subsystems. Helm + Operator Helm Apply the new log level: redpanda-cluster.yaml apiVersion: cluster.redpanda.com/v1alpha2 kind: Redpanda metadata: name: redpanda spec: chartRef: {} clusterSpec: logging: logLevel: debug Then, apply this configuration: kubectl apply -f redpanda-cluster.yaml --namespace <namespace> Choose between using a custom values file or setting values directly: --values --set Specify logging settings in logging.yaml, then upgrade: logging.yaml logging: logLevel: debug helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --values logging.yaml --reuse-values Directly set the log level during upgrade: helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --set logging.logLevel=debug After applying these changes, verify the log level by checking the initial output of the logs for the Redpanda Pods. Override the default log level for Redpanda subsystems You can override the log levels for individual subsystems, such as rpc and kafka, for more detailed logging control. Overrides exist for the entire length of the running Redpanda process. To temporarily override the log level for individual subsystems, you can use the rpk redpanda admin config log-level set command. List all available subsystem loggers: kubectl exec -it --namespace <namespace> <pod-name> -c redpanda -- rpk redpanda start --help-loggers Set the log level for one or more subsystems. In this example, the rpc and kafka subsystem loggers are set to debug. Helm + Operator Helm Apply the new log level: redpanda-cluster.yaml apiVersion: cluster.redpanda.com/v1alpha2 kind: Redpanda metadata: name: redpanda spec: chartRef: {} clusterSpec: statefulset: additionalRedpandaCmdFlags: - '--logger-log-level=rpc=debug:kafka=debug' Then, apply this configuration: kubectl apply -f redpanda-cluster.yaml --namespace <namespace> Choose between using a custom values file or setting values directly: --values --set Specify logging settings in logging.yaml, then upgrade: logging.yaml statefulset: additionalRedpandaCmdFlags: - '--logger-log-level=rpc=debug:kafka=debug' helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --values logging.yaml --reuse-values Directly set the log level during upgrade: helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \ --set statefulset.additionalRedpandaCmdFlags="{--logger-log-level=rpc=debug:kafka=debug}" Overriding the log levels for specific subsystems provides enhanced visibility into Redpanda’s internal operations, facilitating better debugging and monitoring. View Redpanda Operator logs To learn what’s happening with the Redpanda Operator and the associated Redpanda resources, you can inspect a combination of Kubernetes events and the resource manifests. By monitoring these events and resources, you can troubleshoot any issues that arise during the lifecycle of a Redpanda deployment. kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace> Inspect Helm releases The Redpanda Operator uses Flux to deploy the Redpanda Helm chart. By inspecting the helmreleases.helm.toolkit.fluxcd.io resource, you can get detailed information about the Helm installation process for your Redpanda resource: kubectl get helmreleases.helm.toolkit.fluxcd.io -o yaml <redpanda-resource-name> --namespace <namespace> To check the Redpanda resource: kubectl get redpandas.cluster.redpanda.com -o yaml --namespace <namespace> In both the HelmRelease and the Redpanda resource, the condition section reveals the ongoing status of the Helm installation. These conditions provide information on the success, failure, or pending status of various operations. Self-test benchmarks When anomalous behavior arises in a cluster, you can determine if it’s caused by issues with hardware, such as disk drives or network interfaces (NICs) by running rpk cluster self-test to assess their performance and compare it to vendor specifications. The rpk cluster self-test command runs a set of benchmarks to gauge the maximum performance of a machine’s disks and network connections: Disk tests: Measures throughput and latency by performing concurrent sequential operations. Network tests: Selects unique pairs of Redpanda brokers as client/server pairs and runs throughput tests between them. Each benchmark runs for a configurable duration and returns IOPS, throughput, and latency metrics. This helps you determine if hardware performance aligns with expected vendor specifications. Cloud storage tests You can also use the self-test command to confirm your cloud storage is configured correctly for Tiered Storage. Self-test performs the following checks to validate cloud storage configuration: Upload an object (a random buffer of 1024 bytes) to the cloud storage bucket/container. List objects in the bucket/container. Download the uploaded object from the bucket/container. Download the uploaded object’s metadata from the bucket/container. Delete the uploaded object from the bucket/container. Upload and then delete multiple objects (random buffers) at once from the bucket/container. For more information on cloud storage test details, see the rpk cluster self-test start reference. Start self-test To start using self-test, run the self-test start command. Only initiate self-test start when system resources are available, as this operation can be resource-intensive. rpk cluster self-test start For command help, run rpk cluster self-test start -h. For additional command flags, see the rpk cluster self-test start reference. Before self-test start begins, it requests your confirmation to run its potentially large workload. Example start output: ? Redpanda self-test will run benchmarks of disk and network hardware that will consume significant system resources. Do not start self-test if large workloads are already running on the system. (Y/n) Redpanda self-test has started, test identifier: "031be460-246b-46af-98f2-5fc16f03aed3", To check the status run: rpk cluster self-test status The self-test start command returns immediately, and self-test runs its benchmarks asynchronously. Check self-test status To check the status of self-test, run the self-test status command. rpk cluster self-test status For command help, run rpk cluster self-test status -h. For additional command flags, see the rpk cluster self-test status reference. If benchmarks are currently running, self-test status returns a test-in-progress message. Example status output: $ rpk cluster self-test status Nodes [0 1 2] are still running jobs The status command can output results in JSON format for automated checks or script integration. Use the --format=json option: rpk cluster self-test status --format=json If benchmarks have completed, self-test status returns their results. Test results are grouped by broker ID. Each test returns the following: Name: Description of the test. Info: Details about the test run attached by Redpanda. Type: Either disk, network, or cloud test. Test Id: Unique identifier given to jobs of a run. All IDs in a test should match. If they don’t match, then newer and/or older test results have been included erroneously. Timeouts: Number of timeouts incurred during the test. Start time: Time that the test started, in UTC. End time: Time that the test ended, in UTC. Avg Duration: Duration of the test. IOPS: Number of operations per second. For disk, it’s seastar::dma_read and seastar::dma_write. For network, it’s rpc.send(). Throughput: For disk, throughput rate is in bytes per second. For network, throughput rate is in bits per second. Note that GiB vs. Gib is the correct notation displayed by the UI. Latency: 50th, 90th, etc. percentiles of operation latency, reported in microseconds (μs). Represented as P50, P90, P99, P999, and MAX respectively. If Tiered Storage is not enabled, then cloud storage tests do not run, and a warning displays: "Cloud storage is not enabled." All results are shown as 0. Example status output: test results $ rpk cluster self-test status NODE ID: 0 | STATUS: IDLE ========================= NAME 512KB sequential r/w INFO write run (iodepth: 4, dsync: true) TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:02:45 UTC 2024 END TIME Fri Jul 19 15:03:15 UTC 2024 AVG DURATION 30002ms IOPS 1182 req/sec THROUGHPUT 591.4MiB/sec LATENCY P50 P90 P99 P999 MAX 3199us 3839us 9727us 12799us 21503us NAME 512KB sequential r/w INFO read run TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:03:15 UTC 2024 END TIME Fri Jul 19 15:03:45 UTC 2024 AVG DURATION 30000ms IOPS 6652 req/sec THROUGHPUT 3.248GiB/sec LATENCY P50 P90 P99 P999 MAX 607us 671us 831us 991us 2431us NAME 4KB sequential r/w, low io depth INFO write run (iodepth: 1, dsync: true) TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:03:45 UTC 2024 END TIME Fri Jul 19 15:04:15 UTC 2024 AVG DURATION 30000ms IOPS 406 req/sec THROUGHPUT 1.59MiB/sec LATENCY P50 P90 P99 P999 MAX 2431us 2559us 2815us 5887us 9215us NAME 4KB sequential r/w, low io depth INFO read run TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:04:15 UTC 2024 END TIME Fri Jul 19 15:04:45 UTC 2024 AVG DURATION 30000ms IOPS 430131 req/sec THROUGHPUT 1.641GiB/sec LATENCY P50 P90 P99 P999 MAX 1us 2us 12us 28us 511us NAME 4KB sequential write, medium io depth INFO write run (iodepth: 8, dsync: true) TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:04:45 UTC 2024 END TIME Fri Jul 19 15:05:15 UTC 2024 AVG DURATION 30013ms IOPS 513 req/sec THROUGHPUT 2.004MiB/sec LATENCY P50 P90 P99 P999 MAX 15871us 16383us 21503us 32767us 40959us NAME 4KB sequential write, high io depth INFO write run (iodepth: 64, dsync: true) TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:05:15 UTC 2024 END TIME Fri Jul 19 15:05:45 UTC 2024 AVG DURATION 30114ms IOPS 550 req/sec THROUGHPUT 2.151MiB/sec LATENCY P50 P90 P99 P999 MAX 118783us 118783us 147455us 180223us 180223us NAME 4KB sequential write, very high io depth INFO write run (iodepth: 256, dsync: true) TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:05:45 UTC 2024 END TIME Fri Jul 19 15:06:16 UTC 2024 AVG DURATION 30460ms IOPS 558 req/sec THROUGHPUT 2.183MiB/sec LATENCY P50 P90 P99 P999 MAX 475135us 491519us 507903us 524287us 524287us NAME 4KB sequential write, no dsync INFO write run (iodepth: 64, dsync: false) TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:06:16 UTC 2024 END TIME Fri Jul 19 15:06:46 UTC 2024 AVG DURATION 30000ms IOPS 424997 req/sec THROUGHPUT 1.621GiB/sec LATENCY P50 P90 P99 P999 MAX 135us 183us 303us 543us 9727us NAME 16KB sequential r/w, high io depth INFO write run (iodepth: 64, dsync: false) TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:06:46 UTC 2024 END TIME Fri Jul 19 15:07:16 UTC 2024 AVG DURATION 30000ms IOPS 103047 req/sec THROUGHPUT 1.572GiB/sec LATENCY P50 P90 P99 P999 MAX 511us 1087us 1343us 1471us 15871us NAME 16KB sequential r/w, high io depth INFO read run TYPE disk TEST ID 21c5a3de-c75b-480c-8a3d-0cbb63228cb1 TIMEOUTS 0 START TIME Fri Jul 19 15:07:16 UTC 2024 END TIME Fri Jul 19 15:07:46 UTC 2024 AVG DURATION 30000ms IOPS 193966 req/sec THROUGHPUT 2.96GiB/sec LATENCY P50 P90 P99 P999 MAX 319us 383us 735us 1023us 6399us NAME 8K Network Throughput Test INFO Test performed against node: 1 TYPE network TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 61612 req/sec THROUGHPUT 3.76Gib/sec LATENCY P50 P90 P99 P999 MAX 159us 207us 303us 431us 1151us NAME 8K Network Throughput Test INFO Test performed against node: 2 TYPE network TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 60306 req/sec THROUGHPUT 3.68Gib/sec LATENCY P50 P90 P99 P999 MAX 159us 215us 351us 495us 11263us NAME Cloud Storage Test INFO Put TYPE cloud TEST ID a349685a-ee49-4141-8390-c302975db3a5 TIMEOUTS 0 START TIME Tue Jul 16 18:06:30 UTC 2024 END TIME Tue Jul 16 18:06:30 UTC 2024 AVG DURATION 8ms NAME Cloud Storage Test INFO List TYPE cloud TEST ID a349685a-ee49-4141-8390-c302975db3a5 TIMEOUTS 0 START TIME Tue Jul 16 18:06:30 UTC 2024 END TIME Tue Jul 16 18:06:30 UTC 2024 AVG DURATION 1ms NAME Cloud Storage Test INFO Get TYPE cloud TEST ID a349685a-ee49-4141-8390-c302975db3a5 TIMEOUTS 0 START TIME Tue Jul 16 18:06:30 UTC 2024 END TIME Tue Jul 16 18:06:30 UTC 2024 AVG DURATION 1ms NAME Cloud Storage Test INFO Head TYPE cloud TEST ID a349685a-ee49-4141-8390-c302975db3a5 TIMEOUTS 0 START TIME Tue Jul 16 18:06:30 UTC 2024 END TIME Tue Jul 16 18:06:30 UTC 2024 AVG DURATION 0ms NAME Cloud Storage Test INFO Delete TYPE cloud TEST ID a349685a-ee49-4141-8390-c302975db3a5 TIMEOUTS 0 START TIME Tue Jul 16 18:06:30 UTC 2024 END TIME Tue Jul 16 18:06:30 UTC 2024 AVG DURATION 1ms NAME Cloud Storage Test INFO Plural Delete TYPE cloud TEST ID a349685a-ee49-4141-8390-c302975db3a5 TIMEOUTS 0 START TIME Tue Jul 16 18:06:30 UTC 2024 END TIME Tue Jul 16 18:06:30 UTC 2024 AVG DURATION 47ms If self-test returns write results that are unexpectedly and significantly lower than read results, it may be because the Redpanda rpk client hardcodes the DSync option to true. When DSync is enabled, files are opened with the O_DSYNC flag set, and this represents the actual setting that Redpanda uses when it writes to disk. Stop self-test To stop a running self-test, run the self-test stop command. rpk cluster self-test stop Example stop output: $ rpk cluster self-test stop All self-test jobs have been stopped For command help, run rpk cluster self-test stop -h. For additional command flags, see the rpk cluster self-test stop reference. For more details about self-test, including command flags, see rpk cluster self-test. Next steps Learn how to resolve common errors. Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution Linux Generate Debug Bundle