Docs Self-Managed Manage Cluster Maintenance Cluster Diagnostics This is documentation for Self-Managed v23.3. To view the latest available version of the docs, see v24.3. Cluster Diagnostics This topic provides guides for using tools and tests to help diagnose and debug a Redpanda cluster. Disk and network self-test benchmarks When anomalous behavior arises in a cluster and you’re trying to figure out whether it’s caused by faulty hardware (disks, NICs) of a cluster’s machines, run rpk cluster self-test (self-test) to characterize their performance and compare it with their expected, vendor-specified performance. Self-test runs a set of benchmarks to determine the maximum performance of a machine’s disks and network connections. For disks, it runs throughput and latency tests by performing concurrent sequential operations. For networks, it selects unique pairs of Redpanda nodes as client/server pairs, then it runs throughput tests between them. Self-test runs each benchmark for a configurable duration, and it returns IOPS, throughput, and latency metrics. Self-test command examples To begin using self-test, run the self-test start command. rpk cluster self-test start For command help, run rpk cluster self-test start -h. For additional command flags, see the rpk cluster self-test start reference. Before it starts, self-test start asks for your confirmation to run its potentially large workload. Example start output: ? Redpanda self-test will run benchmarks of disk and network hardware that will consume significant system resources. Do not start self-test if large workloads are already running on the system. (Y/n) Redpanda self-test has started, test identifier: "031be460-246b-46af-98f2-5fc16f03aed3", To check the status run: rpk cluster self-test status The self-test start command returns immediately, and self-test runs its benchmarks asynchronously. To check on the status of self-test, run the self-test status command. rpk cluster self-test status For command help, run rpk cluster self-test status -h. For additional command flags, see the rpk cluster self-test status reference. If benchmarks are currently running, self-test status returns a test-in-progress message. Example status output: $ rpk cluster self-test status Nodes [0 1 2] are still running jobs To automate checking the status of self-test, the status command can output its results in JSON format by using the --format=json option: rpk cluster self-test status --format=json If benchmarks have completed, self-test status returns their results. Example status output: test results Test results are grouped by node ID. Each test returns the following: NAME: Description of the test. INFO: Detail about the test run attached by Redpanda itself. TYPE: Either disk or network test. TEST ID: Unique identifier given to jobs of a run. All IDs in a test should match. If they don’t match, then newer and/or older test results have been included erroneously. TIMEOUTS: Number of timeouts incurred during the test. DURATION: Duration of the test. IOPS: Number of operations per second. For disk, it’s seastar::dma_read and seastar::dma_write. For network, it’s rpc.send() THROUGHPUT: For disk, it’s throughput rate in bytes per second. For network, it’s throughput rate in bits per second in. (Note: GiB vs. Gib is the correct notation displayed by the UI.) LATENCY: 50th, 90th, etc. percentiles of operation latency, reported in microseconds. $ rpk cluster self-test status NODE ID: 1 | STATUS: IDLE ========================= NAME 512K sequential r/w throughput disk test INFO write run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5001ms IOPS 1590 req/sec THROUGHPUT 795.2MiB/sec LATENCY P50 P90 P99 P999 MAX 831us 5887us 11263us 24575us 507903us NAME 512K sequential r/w throughput disk test INFO read run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5001ms IOPS 4504 req/sec THROUGHPUT 2.2GiB/sec LATENCY P50 P90 P99 P999 MAX 703us 1599us 4351us 6399us 10239us NAME 4k sequential r/w latency/iops disk test INFO write run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5031ms IOPS 289 req/sec THROUGHPUT 144.7MiB/sec LATENCY P50 P90 P99 P999 MAX 543us 34815us 69631us 77823us 77823us NAME 4k sequential r/w latency/iops disk test INFO read run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 8275 req/sec THROUGHPUT 4.041GiB/sec LATENCY P50 P90 P99 P999 MAX 191us 447us 831us 2175us 278527us NAME 8K Network Throughput Test INFO Test performed against node: 0 TYPE network TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 61254 req/sec THROUGHPUT 3.74Gib/sec LATENCY P50 P90 P99 P999 MAX 159us 207us 303us 415us 1087us NAME 8K Network Throughput Test INFO Test performed against node: 2 TYPE network TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 54814 req/sec THROUGHPUT 3.35Gib/sec LATENCY P50 P90 P99 P999 MAX 167us 255us 367us 511us 25599us NODE ID: 0 | STATUS: IDLE ========================= NAME 512K sequential r/w throughput disk test INFO write run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5002ms IOPS 1593 req/sec THROUGHPUT 796.8MiB/sec LATENCY P50 P90 P99 P999 MAX 735us 5887us 11263us 69631us 507903us NAME 512K sequential r/w throughput disk test INFO read run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 4372 req/sec THROUGHPUT 2.135GiB/sec LATENCY P50 P90 P99 P999 MAX 735us 1599us 4351us 7423us 9215us NAME 4k sequential r/w latency/iops disk test INFO write run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5026ms IOPS 286 req/sec THROUGHPUT 143.1MiB/sec LATENCY P50 P90 P99 P999 MAX 543us 34815us 69631us 77823us 77823us NAME 4k sequential r/w latency/iops disk test INFO read run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 8269 req/sec THROUGHPUT 4.038GiB/sec LATENCY P50 P90 P99 P999 MAX 191us 447us 831us 2175us 278527us NAME 8K Network Throughput Test INFO Test performed against node: 1 TYPE network TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 61612 req/sec THROUGHPUT 3.76Gib/sec LATENCY P50 P90 P99 P999 MAX 159us 207us 303us 431us 1151us NAME 8K Network Throughput Test INFO Test performed against node: 2 TYPE network TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 60306 req/sec THROUGHPUT 3.68Gib/sec LATENCY P50 P90 P99 P999 MAX 159us 215us 351us 495us 11263us NODE ID: 2 | STATUS: IDLE ========================= NAME 512K sequential r/w throughput disk test INFO write run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5001ms IOPS 1580 req/sec THROUGHPUT 790MiB/sec LATENCY P50 P90 P99 P999 MAX 671us 5887us 12287us 47103us 507903us NAME 512K sequential r/w throughput disk test INFO read run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 3932 req/sec THROUGHPUT 1.92GiB/sec LATENCY P50 P90 P99 P999 MAX 831us 1791us 4351us 7167us 9215us NAME 4k sequential r/w latency/iops disk test INFO write run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5027ms IOPS 280 req/sec THROUGHPUT 140.1MiB/sec LATENCY P50 P90 P99 P999 MAX 575us 34815us 73727us 86015us 86015us NAME 4k sequential r/w latency/iops disk test INFO read run TYPE disk TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 8699 req/sec THROUGHPUT 4.248GiB/sec LATENCY P50 P90 P99 P999 MAX 183us 367us 831us 2175us 278527us NAME 8K Network Throughput Test INFO Test performed against node: 0 TYPE network TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 60027 req/sec THROUGHPUT 3.66Gib/sec LATENCY P50 P90 P99 P999 MAX 159us 223us 351us 511us 11775us NAME 8K Network Throughput Test INFO Test performed against node: 1 TYPE network TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d TIMEOUTS 0 DURATION 5000ms IOPS 63090 req/sec THROUGHPUT 3.85Gib/sec LATENCY P50 P90 P99 P999 MAX 151us 207us 319us 463us 17407us If self-test returns write results that are unexpectedly and significantly lower than read results, it may be because the Redpanda rpk client hardcodes the DSync option to true. When DSync is enabled, files are opened with the O_DSYNC flag set, and this represents the actual setting that Redpanda uses when it writes to disk. To stop a running self-test, run the self-test stop command. rpk cluster self-test stop Example stop output: $ rpk cluster self-test stop All self-test jobs have been stopped For command help, run rpk cluster self-test stop -h. For additional command flags, see the rpk cluster self-test stop reference. For more details about self-test, including command flags, see rpk cluster self-test. Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution Configure Availability Forced Partition Recovery