Cluster Diagnostics
This topic provides guides for using tools and tests to help diagnose and debug a Redpanda cluster.
Disk and network self-test benchmarks
When anomalous behavior arises in a cluster and you’re trying to figure out whether it’s caused by faulty hardware (disks, NICs) of a cluster’s machines, run rpk cluster self-test (self-test) to characterize their performance and compare it with their expected, vendor-specified performance.
Self-test runs a set of benchmarks to determine the maximum performance of a machine’s disks and network connections. For disks, it runs throughput and latency tests by performing concurrent sequential operations. For networks, it selects unique pairs of Redpanda nodes as client/server pairs, then it runs throughput tests between them. Self-test runs each benchmark for a configurable duration, and it returns IOPS, throughput, and latency metrics.
Self-test command examples
To begin using self-test, run the self-test start
command.
rpk cluster self-test start
For command help, run rpk cluster self-test start -h
. For additional command flags, see the rpk cluster self-test start reference.
Before it starts, self-test start
asks for your confirmation to run its potentially large workload.
Example start output
$ rpk cluster self-test start
? Redpanda self-test will run benchmarks of disk and network hardware that will consume significant system resources. Do not start self-test if large workloads are already running on the system. (Y/n)
Redpanda self-test has started, test identifier: "031be460-246b-46af-98f2-5fc16f03aed3", To check the status run:
rpk cluster self-test status
The self-test start
command returns immediately, and self-test runs its benchmarks asynchronously.
To check on the status of self-test, run the self-test status
command.
rpk cluster self-test status
For command help, run rpk cluster self-test status -h
. For additional command flags, see the rpk cluster self-test status reference.
If benchmarks are currently running, self-test status
returns a test-in-progress message.
Example status output: in progress
$ rpk cluster self-test status
Nodes [0 1 2] are still running jobs
To automate checking the status of self-test, the rpk cluster self-test status --format=json |
If benchmarks have completed, self-test status
returns their results.
Example status output: test results
Test results are grouped by node ID. Each test returns the following:
-
NAME: Description of the test.
-
INFO: Detail about the test run attached by Redpanda itself.
-
TYPE: Either
disk
ornetwork
test. -
TEST ID: Unique identifier given to jobs of a run. All IDs in a test should match. If they don’t match, then newer and/or older test results have been included erroneously.
-
TIMEOUTS: Number of timeouts incurred during the test.
-
DURATION: Duration of the test.
-
IOPS: Number of operations per second. For disk, it’s
seastar::dma_read
andseastar::dma_write
. For network, it’srpc.send()
-
THROUGHPUT: For disk, it’s throughput rate in bytes per second. For network, it’s throughput rate in bits per second in. (Note: GiB vs. Gib is the correct notation displayed by the UI.)
-
LATENCY: 50th, 90th, etc. percentiles of operation latency, reported in microseconds.
$ rpk cluster self-test status
NODE ID: 1 | STATUS: IDLE
=========================
NAME 512K sequential r/w throughput disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5001ms
IOPS 1590 req/sec
THROUGHPUT 795.2MiB/sec
LATENCY P50 P90 P99 P999 MAX
831us 5887us 11263us 24575us 507903us
NAME 512K sequential r/w throughput disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5001ms
IOPS 4504 req/sec
THROUGHPUT 2.2GiB/sec
LATENCY P50 P90 P99 P999 MAX
703us 1599us 4351us 6399us 10239us
NAME 4k sequential r/w latency/iops disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5031ms
IOPS 289 req/sec
THROUGHPUT 144.7MiB/sec
LATENCY P50 P90 P99 P999 MAX
543us 34815us 69631us 77823us 77823us
NAME 4k sequential r/w latency/iops disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 8275 req/sec
THROUGHPUT 4.041GiB/sec
LATENCY P50 P90 P99 P999 MAX
191us 447us 831us 2175us 278527us
NAME 8K Network Throughput Test
INFO Test performed against node: 0
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 61254 req/sec
THROUGHPUT 3.74Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 207us 303us 415us 1087us
NAME 8K Network Throughput Test
INFO Test performed against node: 2
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 54814 req/sec
THROUGHPUT 3.35Gib/sec
LATENCY P50 P90 P99 P999 MAX
167us 255us 367us 511us 25599us
NODE ID: 0 | STATUS: IDLE
=========================
NAME 512K sequential r/w throughput disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5002ms
IOPS 1593 req/sec
THROUGHPUT 796.8MiB/sec
LATENCY P50 P90 P99 P999 MAX
735us 5887us 11263us 69631us 507903us
NAME 512K sequential r/w throughput disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 4372 req/sec
THROUGHPUT 2.135GiB/sec
LATENCY P50 P90 P99 P999 MAX
735us 1599us 4351us 7423us 9215us
NAME 4k sequential r/w latency/iops disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5026ms
IOPS 286 req/sec
THROUGHPUT 143.1MiB/sec
LATENCY P50 P90 P99 P999 MAX
543us 34815us 69631us 77823us 77823us
NAME 4k sequential r/w latency/iops disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 8269 req/sec
THROUGHPUT 4.038GiB/sec
LATENCY P50 P90 P99 P999 MAX
191us 447us 831us 2175us 278527us
NAME 8K Network Throughput Test
INFO Test performed against node: 1
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 61612 req/sec
THROUGHPUT 3.76Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 207us 303us 431us 1151us
NAME 8K Network Throughput Test
INFO Test performed against node: 2
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 60306 req/sec
THROUGHPUT 3.68Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 215us 351us 495us 11263us
NODE ID: 2 | STATUS: IDLE
=========================
NAME 512K sequential r/w throughput disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5001ms
IOPS 1580 req/sec
THROUGHPUT 790MiB/sec
LATENCY P50 P90 P99 P999 MAX
671us 5887us 12287us 47103us 507903us
NAME 512K sequential r/w throughput disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 3932 req/sec
THROUGHPUT 1.92GiB/sec
LATENCY P50 P90 P99 P999 MAX
831us 1791us 4351us 7167us 9215us
NAME 4k sequential r/w latency/iops disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5027ms
IOPS 280 req/sec
THROUGHPUT 140.1MiB/sec
LATENCY P50 P90 P99 P999 MAX
575us 34815us 73727us 86015us 86015us
NAME 4k sequential r/w latency/iops disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 8699 req/sec
THROUGHPUT 4.248GiB/sec
LATENCY P50 P90 P99 P999 MAX
183us 367us 831us 2175us 278527us
NAME 8K Network Throughput Test
INFO Test performed against node: 0
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 60027 req/sec
THROUGHPUT 3.66Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 223us 351us 511us 11775us
NAME 8K Network Throughput Test
INFO Test performed against node: 1
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 63090 req/sec
THROUGHPUT 3.85Gib/sec
LATENCY P50 P90 P99 P999 MAX
151us 207us 319us 463us 17407us
If self-test returns write results that are unexpectedly and significantly lower than read results, it may be because the Redpanda rpk client hardcodes the DSync option to true . When DSync is enabled, files are opened with the O_DSYNC flag set, and this represents the actual setting that Redpanda uses when it writes to disk.
|
To stop a running self-test, run the self-test stop
command.
rpk cluster self-test stop
Example stop output
$ rpk cluster self-test stop
All self-test jobs have been stopped
For command help, run rpk cluster self-test stop -h
. For additional command flags, see the rpk cluster self-test stop reference.
For more details about self-test, including command flags, see rpk cluster self-test.