Skip to main content
Version: 23.1

Cluster Diagnostics

This topic provides guides for using tools and tests to help diagnose and debug a Redpanda cluster.

Disk and network self-test benchmarks

When anomalous behavior arises in a cluster and you're trying to figure out whether it's caused by faulty hardware (disks, NICs) of a cluster's machines, run rpk cluster self-test (self-test) to characterize their performance and compare it with their expected, vendor-specified performance.

Self-test runs a set of benchmarks to determine the maximum performance of a machine's disks and network connections. For disks, it runs throughput and latency tests by performing concurrent sequential operations. For networks, it selects unique pairs of Redpanda nodes as client/server pairs, then it runs throughput tests between them. Self-test runs each benchmark for a configurable duration, and it returns IOPS, throughput, and latency metrics.

Self-test command examples

To begin using self-test, run the self-test start command.

rpk cluster self-test start

For command help, run rpk cluster self-test start -h. For additional command flags, see the rpk cluster self-test start reference.

Before it starts, self-test start asks for your confirmation to run its potentially large workload.

Example start output
$ rpk cluster self-test start
? Redpanda self-test will run benchmarks of disk and network hardware that will consume significant system resources. Do not start self-test if large workloads are already running on the system. (Y/n)
Redpanda self-test has started, test identifier: "031be460-246b-46af-98f2-5fc16f03aed3", To check the status run:
rpk cluster self-test status

The self-test start command returns immediately, and self-test runs its benchmarks asynchronously.

To check on the status of self-test, run the self-test status command.

rpk cluster self-test status

For command help, run rpk cluster self-test status -h. For additional command flags, see the rpk cluster self-test status reference.

If benchmarks are currently running, self-test status returns a test-in-progress message.

Example status output: in progress
$ rpk cluster self-test status
Nodes [0 1 2] are still running jobs
tip

To automate checking the status of self-test, the status command can output its results in JSON format by using the --format=json option:

rpk cluster self-test status --format=json

If benchmarks have completed, self-test status returns their results.

Example status output: test results

Test results are grouped by node ID. Each test returns the following:

  • NAME: Description of the test.
  • INFO: Detail about the test run attached by Redpanda itself.
  • TYPE: Either disk or network test.
  • TEST ID: Unique identifier given to jobs of a run. All IDs in a test should match. If they don't match, then newer and/or older test results have been included erroneously.
  • TIMEOUTS: Number of timeouts incurred during the test.
  • DURATION: Duration of the test.
  • IOPS: Number of operations per second. For disk, it's seastar::dma_read and seastar::dma_write. For network, it's rpc.send()
  • THROUGHPUT: For disk, it's throughput rate in bytes per second. For network, it's throughput rate in bits per second in. (Note: GiB vs. Gib is the correct notation displayed by the UI.)
  • LATENCY: 50th, 90th, etc. percentiles of operation latency, reported in microseconds.
$ rpk cluster self-test status
NODE ID: 1 | STATUS: IDLE
=========================
NAME 512K sequential r/w throughput disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5001ms
IOPS 1590 req/sec
THROUGHPUT 795.2MiB/sec
LATENCY P50 P90 P99 P999 MAX
831us 5887us 11263us 24575us 507903us

NAME 512K sequential r/w throughput disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5001ms
IOPS 4504 req/sec
THROUGHPUT 2.2GiB/sec
LATENCY P50 P90 P99 P999 MAX
703us 1599us 4351us 6399us 10239us

NAME 4k sequential r/w latency/iops disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5031ms
IOPS 289 req/sec
THROUGHPUT 144.7MiB/sec
LATENCY P50 P90 P99 P999 MAX
543us 34815us 69631us 77823us 77823us

NAME 4k sequential r/w latency/iops disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 8275 req/sec
THROUGHPUT 4.041GiB/sec
LATENCY P50 P90 P99 P999 MAX
191us 447us 831us 2175us 278527us

NAME 8K Network Throughput Test
INFO Test performed against node: 0
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 61254 req/sec
THROUGHPUT 3.74Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 207us 303us 415us 1087us

NAME 8K Network Throughput Test
INFO Test performed against node: 2
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 54814 req/sec
THROUGHPUT 3.35Gib/sec
LATENCY P50 P90 P99 P999 MAX
167us 255us 367us 511us 25599us

NODE ID: 0 | STATUS: IDLE
=========================
NAME 512K sequential r/w throughput disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5002ms
IOPS 1593 req/sec
THROUGHPUT 796.8MiB/sec
LATENCY P50 P90 P99 P999 MAX
735us 5887us 11263us 69631us 507903us

NAME 512K sequential r/w throughput disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 4372 req/sec
THROUGHPUT 2.135GiB/sec
LATENCY P50 P90 P99 P999 MAX
735us 1599us 4351us 7423us 9215us

NAME 4k sequential r/w latency/iops disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5026ms
IOPS 286 req/sec
THROUGHPUT 143.1MiB/sec
LATENCY P50 P90 P99 P999 MAX
543us 34815us 69631us 77823us 77823us

NAME 4k sequential r/w latency/iops disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 8269 req/sec
THROUGHPUT 4.038GiB/sec
LATENCY P50 P90 P99 P999 MAX
191us 447us 831us 2175us 278527us

NAME 8K Network Throughput Test
INFO Test performed against node: 1
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 61612 req/sec
THROUGHPUT 3.76Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 207us 303us 431us 1151us

NAME 8K Network Throughput Test
INFO Test performed against node: 2
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 60306 req/sec
THROUGHPUT 3.68Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 215us 351us 495us 11263us

NODE ID: 2 | STATUS: IDLE
=========================
NAME 512K sequential r/w throughput disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5001ms
IOPS 1580 req/sec
THROUGHPUT 790MiB/sec
LATENCY P50 P90 P99 P999 MAX
671us 5887us 12287us 47103us 507903us

NAME 512K sequential r/w throughput disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 3932 req/sec
THROUGHPUT 1.92GiB/sec
LATENCY P50 P90 P99 P999 MAX
831us 1791us 4351us 7167us 9215us

NAME 4k sequential r/w latency/iops disk test
INFO write run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5027ms
IOPS 280 req/sec
THROUGHPUT 140.1MiB/sec
LATENCY P50 P90 P99 P999 MAX
575us 34815us 73727us 86015us 86015us

NAME 4k sequential r/w latency/iops disk test
INFO read run
TYPE disk
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 8699 req/sec
THROUGHPUT 4.248GiB/sec
LATENCY P50 P90 P99 P999 MAX
183us 367us 831us 2175us 278527us

NAME 8K Network Throughput Test
INFO Test performed against node: 0
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 60027 req/sec
THROUGHPUT 3.66Gib/sec
LATENCY P50 P90 P99 P999 MAX
159us 223us 351us 511us 11775us

NAME 8K Network Throughput Test
INFO Test performed against node: 1
TYPE network
TEST ID 5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS 0
DURATION 5000ms
IOPS 63090 req/sec
THROUGHPUT 3.85Gib/sec
LATENCY P50 P90 P99 P999 MAX
151us 207us 319us 463us 17407us

note

If self-test returns write results that are unexpectedly and significantly lower than read results, it may be because the Redpanda rpk client hardcodes the DSync option to true. When DSync is enabled, files are opened with the O_DSYNC flag set, and this represents the actual setting that Redpanda uses when it writes to disk.

To stop a running self-test, run the self-test stop command.

rpk cluster self-test stop
Example stop output
$ rpk cluster self-test stop
All self-test jobs have been stopped

For command help, run rpk cluster self-test stop -h. For additional command flags, see the rpk cluster self-test stop reference.

For more details about self-test, including command flags, see rpk cluster self-test.

What do you like about this doc?




Optional: Share your email address if we can contact you about your feedback.

Let us know what we do well: