# Run Cluster Diagnostics in Kubernetes

> For the complete documentation index, see [llms.txt](https://docs.redpanda.com/llms.txt). Component-specific: [streaming-full.txt](https://docs.redpanda.com/streaming-full.txt)

---
title: Run Cluster Diagnostics in Kubernetes
latest-redpanda-tag: v26.1.9
latest-console-tag: v3.7.3
latest-operator-version: v26.1.4
# EOL = End-of-Life (support lifecycle status)
page-is-nearing-eol: "false"
page-is-past-eol: "false"
page-eol-date: March 31, 2027
latest-connect-version: 4.93.0
docname: cluster-diagnostics/k-diagnose-issues
page-component-name: streaming
page-version: "26.1"
page-component-version: "26.1"
page-component-title: Streaming
page-relative-src-path: cluster-diagnostics/k-diagnose-issues.adoc
page-edit-url: https://github.com/redpanda-data/docs/edit/main/modules/troubleshoot/pages/cluster-diagnostics/k-diagnose-issues.adoc
description: Use this guide to diagnose and troubleshoot issues in a Redpanda cluster running in Kubernetes.
page-git-created-date: "2024-12-03"
page-git-modified-date: "2024-12-03"
support-status: supported
---

<!-- Source: https://docs.redpanda.com/streaming/current/troubleshoot/cluster-diagnostics/k-diagnose-issues.md -->

Use this guide to diagnose and troubleshoot issues in a Redpanda cluster running in Kubernetes.

## [](#prerequisites)Prerequisites

Before troubleshooting Redpanda, ensure that Kubernetes isn’t the cause of the issue. For information about debugging applications in a Kubernetes cluster, see the [Kubernetes documentation](https://kubernetes.io/docs/tasks/debug/).

## [](#collect-all-debugging-data)Collect all debugging data

For a comprehensive diagnostic snapshot, generate a debug bundle that collects detailed data for cluster, broker, or node analysis.

See [Generate a Debug Bundle with `rpk` in Kubernetes](https://docs.redpanda.com/streaming/current/troubleshoot/debug-bundle/generate/kubernetes/) for details on generating a debug bundle.

## [](#view-helm-chart-configuration)View Helm chart configuration

To check the overrides that were applied to your deployment:

```bash
helm get values <chart-name> --namespace <namespace>
```

If you’re using the Redpanda Operator, the chart name matches the name of your Redpanda resource.

To check all the values that were set in the Redpanda Helm chart, including any overrides:

```bash
helm get values <chart-name> --namespace <namespace> --all
```

## [](#view-recent-events)View recent events

To understand the latest events that occurred in your Redpanda cluster’s namespace, you can sort events by their creation timestamp:

```bash
kubectl get events --namespace <namespace> --sort-by='.metadata.creationTimestamp'
```

## [](#view-redpanda-logs)View Redpanda logs

Logs are crucial for monitoring and troubleshooting your Redpanda clusters. Redpanda brokers output logs to STDOUT, making them accessible via `kubectl`.

To access logs for a specific Pod:

1.  List all Pods to find the names of those that are running Redpanda brokers:

    ```bash
    kubectl get pods --namespace <namespace>
    ```

2.  View logs for a particular Pod by replacing `<pod-name>` with the name of your Pod:

    ```bash
    kubectl logs <pod-name> --namespace <namespace>
    ```

    > 💡 **TIP**
    >
    > For a comprehensive overview, you can view aggregated logs from all Pods in the StatefulSet:
    >
    > ```bash
    > kubectl logs --namespace <namespace> -l app.kubernetes.io/component=redpanda-statefulset
    > ```


### [](#change-the-default-log-level)Change the default log level

To change the default log level for all Redpanda subsystems, use the `logging.logLevel` configuration. Valid values are `trace`, `debug`, `info`, `warn`, `error`.

Changing the default log level to `debug` can provide more detailed logs for diagnostics. This logging level increases the volume of generated logs.

> 📝 **NOTE**
>
> To set different log levels for individual subsystems, see [Override the default log level for Redpanda subsystems](#override-the-default-log-level-for-redpanda-subsystems).

#### Operator

Apply the new log level:

`redpanda-cluster.yaml`

```yaml
apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
  name: redpanda
spec:
  chartRef: {}
  clusterSpec:
    logging:
      logLevel: debug
```

Then, apply this configuration:

```bash
kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
```

#### Helm

Choose between using a custom values file or setting values directly:
##### --values

Specify logging settings in `logging.yaml`, then upgrade:

`logging.yaml`

```yaml
logging:
  logLevel: debug
```

```bash
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --values logging.yaml --reuse-values
```

##### --set

Directly set the log level during upgrade:

```bash
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --set logging.logLevel=debug
```

After applying these changes, verify the log level by [checking the initial output of the logs](#view-redpanda-logs) for the Redpanda Pods.

### [](#override-the-default-log-level-for-redpanda-subsystems)Override the default log level for Redpanda subsystems

You can override the log levels for individual subsystems, such as `rpc` and `kafka`, for more detailed logging control. Overrides exist for the entire length of the running Redpanda process.

> 💡 **TIP**
>
> To temporarily override the log level for individual subsystems, you can use the [`rpk redpanda admin config log-level set`](https://docs.redpanda.com/streaming/current/reference/rpk/rpk-redpanda/rpk-redpanda-admin-config-log-level-set/) command.

1.  List all available subsystem loggers:

    ```bash
    kubectl exec -it --namespace <namespace> <pod-name> -c redpanda -- rpk redpanda start --help-loggers
    ```

2.  Set the log level for one or more subsystems. In this example, the `rpc` and `kafka` subsystem loggers are set to `debug`.

    #### Operator

    Apply the new log level:

    `redpanda-cluster.yaml`

    ```yaml
    apiVersion: cluster.redpanda.com/v1alpha2
    kind: Redpanda
    metadata:
      name: redpanda
    spec:
      chartRef: {}
      clusterSpec:
        statefulset:
          additionalRedpandaCmdFlags:
            - '--logger-log-level=rpc=debug:kafka=debug'
    ```

    Then, apply this configuration:

    ```bash
    kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
    ```


    #### Helm

    Choose between using a custom values file or setting values directly:
    ##### --values

    Specify logging settings in `logging.yaml`, then upgrade:

    `logging.yaml`

    ```yaml
    statefulset:
      additionalRedpandaCmdFlags:
        - '--logger-log-level=rpc=debug:kafka=debug'
    ```

    ```bash
    helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
      --values logging.yaml --reuse-values
    ```


    ##### --set

    Directly set the log level during upgrade:

    ```bash
    helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
      --set statefulset.additionalRedpandaCmdFlags="{--logger-log-level=rpc=debug:kafka=debug}"
    ```


Overriding the log levels for specific subsystems provides enhanced visibility into Redpanda’s internal operations, facilitating better debugging and monitoring.

## [](#view-redpanda-operator-logs)View Redpanda Operator logs

To learn what’s happening with the Redpanda Operator and the associated Redpanda resources, you can inspect a combination of Kubernetes events and the resource manifests. By monitoring these events and resources, you can troubleshoot any issues that arise during the lifecycle of a Redpanda deployment.

```bash
kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace>
```

## [](#inspect-the-redpanda-resource)Inspect the Redpanda resource

In the Redpanda resource, the conditions section reveals the ongoing status of reconciliation. These conditions provide information on the success, failure, or pending status of various operations.

To check the Redpanda resource:

```bash
kubectl get redpandas.cluster.redpanda.com -o yaml --namespace <namespace>
```

## [](#self-test)Self-test benchmarks

When anomalous behavior arises in a cluster, you can determine if it’s caused by issues with hardware, such as disk drives or network interfaces (NICs) by running [`rpk cluster self-test`](https://docs.redpanda.com/streaming/current/reference/rpk/rpk-cluster/rpk-cluster-self-test/) to assess their performance and compare it to vendor specifications.

The `rpk cluster self-test` command runs a set of benchmarks to gauge the maximum performance of a machine’s disks and network connections:

-   **Disk tests**: Measures throughput and latency by performing concurrent sequential operations.

-   **Network tests**: Selects unique pairs of Redpanda brokers as client/server pairs and runs throughput tests between them.


Each benchmark runs for a configurable duration and returns IOPS, throughput, and latency metrics. This helps you determine if hardware performance aligns with expected vendor specifications.

### [](#cloud-storage-tests)Cloud storage tests

You can also use the self-test command to confirm your cloud storage is configured correctly for [Tiered Storage](https://docs.redpanda.com/streaming/current/manage/tiered-storage/).

Self-test performs the following checks to validate cloud storage configuration:

1.  Upload an object (a random buffer of 1024 bytes) to the cloud storage bucket/container.

2.  List objects in the bucket/container.

3.  Download the uploaded object from the bucket/container.

4.  Download the uploaded object’s metadata from the bucket/container.

5.  Delete the uploaded object from the bucket/container.

6.  Upload and then delete multiple objects (random buffers) at once from the bucket/container.


For more information on cloud storage test details, see the [`rpk cluster self-test start`](https://docs.redpanda.com/streaming/current/reference/rpk/rpk-cluster/rpk-cluster-self-test-start/) reference.

### [](#start-self-test)Start self-test

To start using self-test, run the `self-test start` command. Only initiate `self-test start` when system resources are available, as this operation can be resource-intensive.

```bash
rpk cluster self-test start
```

For command help, run `rpk cluster self-test start -h`. For additional command flags, see the [rpk cluster self-test start](https://docs.redpanda.com/streaming/current/reference/rpk/rpk-cluster/rpk-cluster-self-test-start/) reference.

Before `self-test start` begins, it requests your confirmation to run its potentially large workload.

Example start output:

? Redpanda self-test will run benchmarks of disk and network hardware that will consume significant system resources. Do not start self-test if large workloads are already running on the system. (Y/n)
Redpanda self-test has started, test identifier: "031be460-246b-46af-98f2-5fc16f03aed3", To check the status run:
rpk cluster self-test status

The `self-test start` command returns immediately, and self-test runs its benchmarks asynchronously.

### [](#check-self-test-status)Check self-test status

To check the status of self-test, run the `self-test status` command.

```bash
rpk cluster self-test status
```

For command help, run `rpk cluster self-test status -h`. For additional command flags, see the [rpk cluster self-test status](https://docs.redpanda.com/streaming/current/reference/rpk/rpk-cluster/rpk-cluster-self-test-status/) reference.

If benchmarks are currently running, `self-test status` returns a test-in-progress message.

Example status output:

$ rpk cluster self-test status
Nodes \[0 1 2\] are still running jobs

> 💡 **TIP**
>
> The `status` command can output results in JSON format for automated checks or script integration. Use the `--format=json` option:
>
> ```bash
> rpk cluster self-test status --format=json
> ```

If benchmarks have completed, `self-test status` returns their results.

Test results are grouped by broker ID. Each test returns the following:

-   **Name**: Description of the test.

-   **Info**: Details about the test run attached by Redpanda.

-   **Type**: Either `disk`, `network`, or `cloud` test.

-   **Test Id**: Unique identifier given to jobs of a run. All IDs in a test should match. If they don’t match, then newer and/or older test results have been included erroneously.

-   **Timeouts**: Number of timeouts incurred during the test.

-   **Start time**: Time that the test started, in UTC.

-   **End time**: Time that the test ended, in UTC.

-   **Avg Duration**: Duration of the test.

-   **IOPS**: Number of operations per second. For disk, it’s `seastar::dma_read` and `seastar::dma_write`. For network, it’s `rpc.send()`.

-   **Throughput**: For disk, throughput rate is in bytes per second. For network, throughput rate is in bits per second. Note that GiB vs. Gib is the correct notation displayed by the UI.

-   **Latency**: 50th, 90th, etc. percentiles of operation latency, reported in microseconds (μs). Represented as P50, P90, P99, P999, and MAX respectively.


If [Tiered Storage](https://docs.redpanda.com/streaming/current/manage/tiered-storage/) is not enabled, then cloud storage tests do not run, and a warning displays: "Cloud storage is not enabled." All results are shown as 0.

Example status output: test results

```console
$ rpk cluster self-test status
NODE ID: 0 | STATUS: IDLE
=========================
NAME          512KB sequential r/w
INFO          write run (iodepth: 4, dsync: true)
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:02:45 UTC 2024
END TIME      Fri Jul 19 15:03:15 UTC 2024
AVG DURATION  30002ms
IOPS          1182 req/sec
THROUGHPUT    591.4MiB/sec
LATENCY       P50     P90     P99     P999     MAX
              3199us  3839us  9727us  12799us  21503us

NAME          512KB sequential r/w
INFO          read run
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:03:15 UTC 2024
END TIME      Fri Jul 19 15:03:45 UTC 2024
AVG DURATION  30000ms
IOPS          6652 req/sec
THROUGHPUT    3.248GiB/sec
LATENCY       P50    P90    P99    P999   MAX
              607us  671us  831us  991us  2431us

NAME          4KB sequential r/w, low io depth
INFO          write run (iodepth: 1, dsync: true)
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:03:45 UTC 2024
END TIME      Fri Jul 19 15:04:15 UTC 2024
AVG DURATION  30000ms
IOPS          406 req/sec
THROUGHPUT    1.59MiB/sec
LATENCY       P50     P90     P99     P999    MAX
              2431us  2559us  2815us  5887us  9215us

NAME          4KB sequential r/w, low io depth
INFO          read run
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:04:15 UTC 2024
END TIME      Fri Jul 19 15:04:45 UTC 2024
AVG DURATION  30000ms
IOPS          430131 req/sec
THROUGHPUT    1.641GiB/sec
LATENCY       P50   P90   P99   P999  MAX
              1us   2us   12us  28us  511us

NAME          4KB sequential write, medium io depth
INFO          write run (iodepth: 8, dsync: true)
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:04:45 UTC 2024
END TIME      Fri Jul 19 15:05:15 UTC 2024
AVG DURATION  30013ms
IOPS          513 req/sec
THROUGHPUT    2.004MiB/sec
LATENCY       P50      P90      P99      P999     MAX
              15871us  16383us  21503us  32767us  40959us

NAME          4KB sequential write, high io depth
INFO          write run (iodepth: 64, dsync: true)
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:05:15 UTC 2024
END TIME      Fri Jul 19 15:05:45 UTC 2024
AVG DURATION  30114ms
IOPS          550 req/sec
THROUGHPUT    2.151MiB/sec
LATENCY       P50       P90       P99       P999      MAX
              118783us  118783us  147455us  180223us  180223us

NAME          4KB sequential write, very high io depth
INFO          write run (iodepth: 256, dsync: true)
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:05:45 UTC 2024
END TIME      Fri Jul 19 15:06:16 UTC 2024
AVG DURATION  30460ms
IOPS          558 req/sec
THROUGHPUT    2.183MiB/sec
LATENCY       P50       P90       P99       P999      MAX
              475135us  491519us  507903us  524287us  524287us

NAME          4KB sequential write, no dsync
INFO          write run (iodepth: 64, dsync: false)
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:06:16 UTC 2024
END TIME      Fri Jul 19 15:06:46 UTC 2024
AVG DURATION  30000ms
IOPS          424997 req/sec
THROUGHPUT    1.621GiB/sec
LATENCY       P50    P90    P99    P999   MAX
              135us  183us  303us  543us  9727us

NAME          16KB sequential r/w, high io depth
INFO          write run (iodepth: 64, dsync: false)
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:06:46 UTC 2024
END TIME      Fri Jul 19 15:07:16 UTC 2024
AVG DURATION  30000ms
IOPS          103047 req/sec
THROUGHPUT    1.572GiB/sec
LATENCY       P50    P90     P99     P999    MAX
              511us  1087us  1343us  1471us  15871us

NAME          16KB sequential r/w, high io depth
INFO          read run
TYPE          disk
TEST ID       21c5a3de-c75b-480c-8a3d-0cbb63228cb1
TIMEOUTS      0
START TIME    Fri Jul 19 15:07:16 UTC 2024
END TIME      Fri Jul 19 15:07:46 UTC 2024
AVG DURATION  30000ms
IOPS          193966 req/sec
THROUGHPUT    2.96GiB/sec
LATENCY       P50    P90    P99    P999    MAX
              319us  383us  735us  1023us  6399us

NAME        8K Network Throughput Test
INFO        Test performed against node: 1
TYPE        network
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        61612 req/sec
THROUGHPUT  3.76Gib/sec
LATENCY     P50    P90    P99    P999   MAX
            159us  207us  303us  431us  1151us

NAME        8K Network Throughput Test
INFO        Test performed against node: 2
TYPE        network
TEST ID     5e4052f3-b828-4c0d-8fd0-b34ff0b6c35d
TIMEOUTS    0
DURATION    5000ms
IOPS        60306 req/sec
THROUGHPUT  3.68Gib/sec
LATENCY     P50    P90    P99    P999   MAX
            159us  215us  351us  495us  11263us

NAME          Cloud Storage Test
INFO          Put
TYPE          cloud
TEST ID       a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS      0
START TIME    Tue Jul 16 18:06:30 UTC 2024
END TIME      Tue Jul 16 18:06:30 UTC 2024
AVG DURATION  8ms

NAME          Cloud Storage Test
INFO          List
TYPE          cloud
TEST ID       a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS      0
START TIME    Tue Jul 16 18:06:30 UTC 2024
END TIME      Tue Jul 16 18:06:30 UTC 2024
AVG DURATION  1ms

NAME          Cloud Storage Test
INFO          Get
TYPE          cloud
TEST ID       a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS      0
START TIME    Tue Jul 16 18:06:30 UTC 2024
END TIME      Tue Jul 16 18:06:30 UTC 2024
AVG DURATION  1ms

NAME          Cloud Storage Test
INFO          Head
TYPE          cloud
TEST ID       a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS      0
START TIME    Tue Jul 16 18:06:30 UTC 2024
END TIME      Tue Jul 16 18:06:30 UTC 2024
AVG DURATION  0ms

NAME          Cloud Storage Test
INFO          Delete
TYPE          cloud
TEST ID       a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS      0
START TIME    Tue Jul 16 18:06:30 UTC 2024
END TIME      Tue Jul 16 18:06:30 UTC 2024
AVG DURATION  1ms

NAME          Cloud Storage Test
INFO          Plural Delete
TYPE          cloud
TEST ID       a349685a-ee49-4141-8390-c302975db3a5
TIMEOUTS      0
START TIME    Tue Jul 16 18:06:30 UTC 2024
END TIME      Tue Jul 16 18:06:30 UTC 2024
AVG DURATION  47ms
```

> 📝 **NOTE**
>
> If self-test returns write results that are unexpectedly and significantly lower than read results, it may be because the Redpanda `rpk` client hardcodes the `DSync` option to `true`. When `DSync` is enabled, files are opened with the `O_DSYNC` flag set, and this represents the actual setting that Redpanda uses when it writes to disk.

### [](#stop-self-test)Stop self-test

To stop a running self-test, run the `self-test stop` command.

```bash
rpk cluster self-test stop
```

Example stop output:

$ rpk cluster self-test stop
All self-test jobs have been stopped

For command help, run `rpk cluster self-test stop -h`. For additional command flags, see the [rpk cluster self-test stop](https://docs.redpanda.com/streaming/current/reference/rpk/rpk-cluster/rpk-cluster-self-test-stop/) reference.

For more details about self-test, including command flags, see [rpk cluster self-test](https://docs.redpanda.com/streaming/current/reference/rpk/rpk-cluster/rpk-cluster-self-test/).

## [](#next-steps)Next steps

Learn how to [resolve common errors](https://docs.redpanda.com/streaming/current/troubleshoot/errors-solutions/k-resolve-errors/).