Cluster Diagnostics

This topic provides guides for using tools and tests to help diagnose and debug a Redpanda cluster.

Disk and network self-test benchmarks

When anomalous behavior arises in a cluster and you’re trying to figure out whether it’s caused by faulty hardware (disks, NICs) of a cluster’s machines, run rpk cluster self-test (self-test) to characterize their performance and compare it with their expected, vendor-specified performance.

Self-test runs a set of benchmarks to determine the maximum performance of a machine’s disks and network connections. For disks, it runs throughput and latency tests by performing concurrent sequential operations. For networks, it selects unique pairs of Redpanda nodes as client/server pairs, then it runs throughput tests between them. Self-test runs each benchmark for a configurable duration, and it returns IOPS, throughput, and latency metrics.

Self-test command examples

To begin using self-test, run the self-test start command.

rpk cluster self-test start

For command help, run rpk cluster self-test start -h. For additional command flags, see the rpk cluster self-test start reference.

Before it starts, self-test start asks for your confirmation to run its potentially large workload.

Example start output:

? Redpanda self-test will run benchmarks of disk and network hardware that will consume significant system resources. Do not start self-test if large workloads are already running on the system. (Y/n)
Redpanda self-test has started, test identifier: "031be460-246b-46af-98f2-5fc16f03aed3", To check the status run:
rpk cluster self-test status

The self-test start command returns immediately, and self-test runs its benchmarks asynchronously.

To check on the status of self-test, run the self-test status command.

rpk cluster self-test status

For command help, run rpk cluster self-test status -h. For additional command flags, see the rpk cluster self-test status reference.

If benchmarks are currently running, self-test status returns a test-in-progress message.

Example status output:

$ rpk cluster self-test status
Nodes [0 1 2] are still running jobs

To automate checking the status of self-test, the status command can output its results in JSON format by using the --format=json option:

rpk cluster self-test status --format=json

If benchmarks have completed, self-test status returns their results.

Example status output: test results

Test results are grouped by node ID. Each test returns the following:

  • NAME: Description of the test.

  • INFO: Detail about the test run attached by Redpanda itself.

  • TYPE: Either disk or network test.

  • TEST ID: Unique identifier given to jobs of a run. All IDs in a test should match. If they don’t match, then newer and/or older test results have been included erroneously.

  • TIMEOUTS: Number of timeouts incurred during the test.

  • DURATION: Duration of the test.

  • IOPS: Number of operations per second. For disk, it’s seastar::dma_read and seastar::dma_write. For network, it’s rpc.send()

  • THROUGHPUT: For disk, it’s throughput rate in bytes per second. For network, it’s throughput rate in bits per second in. (Note: GiB vs. Gib is the correct notation displayed by the UI.)

  • LATENCY: 50th, 90th, etc. percentiles of operation latency, reported in microseconds.

$ rpk cluster self-test status
NODE ID: 1 | STATUS: IDLE
=========================
NAME        512K sequential r/w throughput disk test
INFO        write run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5001ms
IOPS        1590 req/sec
THROUGHPUT  795.2MiB/sec
LATENCY     P50    P90     P99      P999     MAX
            831us  5887us  11263us  24575us  507903us

NAME        512K sequential r/w throughput disk test
INFO        read run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5001ms
IOPS        4504 req/sec
THROUGHPUT  2.2GiB/sec
LATENCY     P50    P90     P99     P999    MAX
            703us  1599us  4351us  6399us  10239us

NAME        4k sequential r/w latency/iops disk test
INFO        write run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5031ms
IOPS        289 req/sec
THROUGHPUT  144.7MiB/sec
LATENCY     P50    P90      P99      P999     MAX
            543us  34815us  69631us  77823us  77823us

NAME        4k sequential r/w latency/iops disk test
INFO        read run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        8275 req/sec
THROUGHPUT  4.041GiB/sec
LATENCY     P50    P90    P99    P999    MAX
            191us  447us  831us  2175us  278527us

NAME        8K Network Throughput Test
INFO        Test performed against node: 0
TYPE        network
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        61254 req/sec
THROUGHPUT  3.74Gib/sec
LATENCY     P50    P90    P99    P999   MAX
            159us  207us  303us  415us  1087us

NAME        8K Network Throughput Test
INFO        Test performed against node: 2
TYPE        network
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        54814 req/sec
THROUGHPUT  3.35Gib/sec
LATENCY     P50    P90    P99    P999   MAX
            167us  255us  367us  511us  25599us

NODE ID: 0 | STATUS: IDLE
=========================
NAME        512K sequential r/w throughput disk test
INFO        write run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5002ms
IOPS        1593 req/sec
THROUGHPUT  796.8MiB/sec
LATENCY     P50    P90     P99      P999     MAX
            735us  5887us  11263us  69631us  507903us

NAME        512K sequential r/w throughput disk test
INFO        read run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        4372 req/sec
THROUGHPUT  2.135GiB/sec
LATENCY     P50    P90     P99     P999    MAX
            735us  1599us  4351us  7423us  9215us

NAME        4k sequential r/w latency/iops disk test
INFO        write run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5026ms
IOPS        286 req/sec
THROUGHPUT  143.1MiB/sec
LATENCY     P50    P90      P99      P999     MAX
            543us  34815us  69631us  77823us  77823us

NAME        4k sequential r/w latency/iops disk test
INFO        read run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        8269 req/sec
THROUGHPUT  4.038GiB/sec
LATENCY     P50    P90    P99    P999    MAX
            191us  447us  831us  2175us  278527us

NAME        8K Network Throughput Test
INFO        Test performed against node: 1
TYPE        network
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        61612 req/sec
THROUGHPUT  3.76Gib/sec
LATENCY     P50    P90    P99    P999   MAX
            159us  207us  303us  431us  1151us

NAME        8K Network Throughput Test
INFO        Test performed against node: 2
TYPE        network
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        60306 req/sec
THROUGHPUT  3.68Gib/sec
LATENCY     P50    P90    P99    P999   MAX
            159us  215us  351us  495us  11263us

NODE ID: 2 | STATUS: IDLE
=========================
NAME        512K sequential r/w throughput disk test
INFO        write run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5001ms
IOPS        1580 req/sec
THROUGHPUT  790MiB/sec
LATENCY     P50    P90     P99      P999     MAX
            671us  5887us  12287us  47103us  507903us

NAME        512K sequential r/w throughput disk test
INFO        read run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        3932 req/sec
THROUGHPUT  1.92GiB/sec
LATENCY     P50    P90     P99     P999    MAX
            831us  1791us  4351us  7167us  9215us

NAME        4k sequential r/w latency/iops disk test
INFO        write run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5027ms
IOPS        280 req/sec
THROUGHPUT  140.1MiB/sec
LATENCY     P50    P90      P99      P999     MAX
            575us  34815us  73727us  86015us  86015us

NAME        4k sequential r/w latency/iops disk test
INFO        read run
TYPE        disk
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        8699 req/sec
THROUGHPUT  4.248GiB/sec
LATENCY     P50    P90    P99    P999    MAX
            183us  367us  831us  2175us  278527us

NAME        8K Network Throughput Test
INFO        Test performed against node: 0
TYPE        network
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        60027 req/sec
THROUGHPUT  3.66Gib/sec
LATENCY     P50    P90    P99    P999   MAX
            159us  223us  351us  511us  11775us

NAME        8K Network Throughput Test
INFO        Test performed against node: 1
TYPE        network
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        63090 req/sec
THROUGHPUT  3.85Gib/sec
LATENCY     P50    P90    P99    P999   MAX
            151us  207us  319us  463us  17407us
If self-test returns write results that are unexpectedly and significantly lower than read results, it may be because the Redpanda rpk client hardcodes the DSync option to true. When DSync is enabled, files are opened with the O_DSYNC flag set, and this represents the actual setting that Redpanda uses when it writes to disk.

To stop a running self-test, run the self-test stop command.

rpk cluster self-test stop

Example stop output:

$ rpk cluster self-test stop
All self-test jobs have been stopped

For command help, run rpk cluster self-test stop -h. For additional command flags, see the rpk cluster self-test stop reference.

For more details about self-test, including command flags, see rpk cluster self-test.