# Monitor Redpanda in Kubernetes

> For the complete documentation index, see [llms.txt](https://docs.redpanda.com/llms.txt). Component-specific: [streaming-full.txt](https://docs.redpanda.com/streaming-full.txt)

---
title: Monitor Redpanda in Kubernetes
latest-redpanda-tag: v25.1.1
latest-console-tag: v3.7.3
latest-operator-version: v26.1.4
# EOL = End-of-Life (support lifecycle status)
page-is-nearing-eol: "false"
page-is-past-eol: "true"
page-eol-date: April 7, 2026
latest-connect-version: 4.93.0
docname: kubernetes/monitoring/k-monitor-redpanda
page-component-name: streaming
page-version: "25.1"
page-component-version: "25.1"
page-component-title: Streaming
page-relative-src-path: kubernetes/monitoring/k-monitor-redpanda.adoc
page-edit-url: https://github.com/redpanda-data/docs/edit/v/25.1/modules/manage/pages/kubernetes/monitoring/k-monitor-redpanda.adoc
description: Monitor the health of your system to predict issues and optimize performance.
page-git-created-date: "2024-01-04"
page-git-modified-date: "2024-03-28"
support-status: past end-of-life
---

<!-- Source: https://docs.redpanda.com/streaming/25.1/manage/kubernetes/monitoring/k-monitor-redpanda.md -->

Redpanda exports metrics through two endpoints on the Admin API port (default: 9644) for you to monitor system health and optimize system performance.

> 💡 **TIP**
>
> Use [/public\_metrics](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/) for your primary dashboards for monitoring system health. These metrics have low cardinality and are designed for customer consumption, with aggregated labels for better performance. **Public metrics use the `redpanda_` prefix.**
>
> Use [/metrics](https://docs.redpanda.com/streaming/25.1/reference/internal-metrics-reference/) for detailed analysis and debugging. These metrics can have high cardinality with thousands of series, providing granular operational insights. **Internal metrics use the `vectorized_` prefix.**

The [`/metrics`](https://docs.redpanda.com/streaming/25.1/reference/internal-metrics-reference/) endpoint is a legacy endpoint that includes many internal metrics that are unnecessary for a typical Redpanda user to monitor. The `/metrics` endpoint is also referred to as the 'internal metrics' endpoint, and Redpanda recommends that you use it for development, testing, and analysis. Alternatively, the [`/public_metrics`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/) endpoint provides a smaller set of important metrics that can be queried and ingested more quickly and inexpensively.

> 📝 **NOTE**
>
> To maximize monitoring performance by minimizing the cardinality of data, some metrics are exported when their underlying features are in use, and are not exported when not in use. For example, a metric for consumer groups, [`redpanda_kafka_consumer_group_committed_offset`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_kafka_consumer_group_committed_offset), is not exported when no groups are registered.
>
> When monitoring internal metrics, consider enabling [aggregate\_metrics](https://docs.redpanda.com/streaming/25.1/reference/properties/cluster-properties/#aggregate_metrics) to reduce the cardinality of data to monitor.

This topic covers the following about monitoring Redpanda metrics:

-   [Configure Prometheus to monitor Redpanda metrics](#configure-prometheus)

-   [Generate Grafana dashboard](#generate-grafana-dashboard)

-   [Learn from examples in the Redpanda monitoring examples repository](#use-redpanda-monitoring-examples)

-   [Metrics and queries to monitor for system performance and health](#monitor-for-performance-and-health)

-   [References of public and internal metrics](#references)


## [](#configure-prometheus)Configure Prometheus

[Prometheus](https://prometheus.io/) is a system monitoring and alerting tool. It collects and stores metrics as time-series data identified by a metric name and key/value pairs.

To configure Prometheus to monitor Redpanda metrics in Kubernetes, you can use the [Prometheus Operator](https://prometheus-operator.dev/):

1.  Follow the steps to [deploy the Prometheus Operator](https://prometheus-operator.dev/docs/getting-started/installation/).

    Make sure to configure the Prometheus resource to target your Redpanda cluster:

    `prometheus.yaml`

    ```yaml
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
      name: prometheus
    spec:
      serviceAccountName: prometheus
      serviceMonitorNamespaceSelector:
        matchLabels:
          name: <namespace>
      serviceMonitorSelector:
        matchLabels:
          app.kubernetes.io/name: redpanda
      resources:
        requests:
          memory: 400Mi
      enableAdminAPI: false
    ```

    -   `serviceMonitorNamespaceSelector.matchLabels.name`: The namespace in which you will deploy Redpanda. The Prometheus Operator looks for ServiceMonitor resources in this namespace.

    -   `serviceMonitorSelector.matchLabels.app.kubernetes.io/name`: The value of `fullnameOverride` in your Redpanda Helm chart. The default is `redpanda`. The Redpanda Helm chart creates the ServiceMonitor resource with this label.


2.  Deploy Redpanda with monitoring enabled to deploy the ServiceMonitor resource:

    ### Operator

    `redpanda-cluster.yaml`

    ```yaml
    apiVersion: cluster.redpanda.com/v1alpha2
    kind: Redpanda
    metadata:
      name: redpanda
    spec:
      chartRef: {}
      clusterSpec:
        monitoring:
          enabled: true
          scrapeInterval: 30s
    ```

    ```bash
    kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
    ```


    ### Helm


    #### --values

    `prometheus-monitoring.yaml`

    ```yaml
    monitoring:
      enabled: true
      scrapeInterval: 30s
    ```

    ```bash
    helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
    --values prometheus-monitoring.yaml --reuse-values
    ```


    #### --set

    ```bash
    helm upgrade --install redpanda redpanda/redpanda \
      --namespace <namespace> \
      --create-namespace \
      --set monitoring.enabled=true \
      --set monitoring.scrapeInterval="30s"
    ```

3.  Wait until all Pods are running:

    ```bash
    kubectl -n <namespace> rollout status statefulset redpanda --watch
    ```

4.  Ensure that the ServiceMonitor was deployed:

    ```bash
    kubectl get servicemonitor --namespace <namespace>
    ```

5.  Ensure that you’ve [exposed the Prometheus Service](https://prometheus-operator.dev/docs/user-guides/getting-started/#exposing-the-prometheus-service).

6.  Expose the Prometheus server to your localhost:

    ```bash
    kubectl port-forward svc/prometheus 9090
    ```

7.  [Open Prometheus](http://localhost:9090/graph), and see that Prometheus is scraping metrics from your Redpanda endpoints.


## [](#generate-grafana-dashboard)Generate Grafana dashboard

[Grafana](https://grafana.com/oss/grafana/) is a tool to query, visualize, and generate alerts for metrics.

Redpanda supports generating Grafana dashboards from its metrics endpoints with `rpk generate grafana-dashboard`.

To generate a comprehensive Grafana dashboard, run the following command and pipe the output to a file that can be imported into Grafana:

```bash
rpk generate grafana-dashboard --datasource <name> --metrics-endpoint <url> > <output-file>
```

-   `<name>` is the name of the Prometheus data source configured in your Grafana instance.

-   `<url>` is the address to a Redpanda broker’s metrics endpoint (public or internal).

-   For `/public_metrics`, for example, run the following command:

    ```bash
    rpk generate grafana-dashboard \
      --datasource prometheus \
      --metrics-endpoint <broker-address>:9644/public_metrics > redpanda-dashboard.json
    ```

-   For `/metrics`, for example, run the following command:

    ```bash
    rpk generate grafana-dashboard \
      --datasource prometheus \
      --metrics-endpoint <broker-address>:9644/metrics > redpanda-dashboard.json
    ```


For details about the command, see [`rpk generate grafana-dashboard`](https://docs.redpanda.com/streaming/25.1/reference/rpk/rpk-generate/rpk-generate-grafana-dashboard/).

In Grafana, import the generated JSON file to create a dashboard. Out of the box, Grafana generates panels tracking latency for 50%, 95%, and 99% (based on the maximum latency set), throughput, and error segmentation by type.

To use the imported dashboard to create new panels:

1.  Click **+** in the left pane, and select **Add a new panel**.

2.  On the **Query** tab, select **Prometheus** data source.

3.  Decide which metric you want to monitor, click **Metrics browser**, and type `redpanda` to show available public metrics (or `vectorized` for internal metrics) from the Redpanda cluster.


## [](#use-redpanda-monitoring-examples)Use Redpanda monitoring examples

For hands-on learning, Redpanda provides a repository with examples of monitoring Redpanda with Prometheus and Grafana: [redpanda-data/observability](https://github.com/redpanda-data/observability).

![Example Redpanda Ops Dashboard^](https://github.com/redpanda-data/observability/blob/main/docs/images/Ops%20Dashboard.png?raw=true)

It includes [example Grafana dashboards](https://github.com/redpanda-data/observability#grafana-dashboards) and a [sandbox environment](https://github.com/redpanda-data/observability#sandbox-environment) in which you launch a Dockerized Redpanda cluster and create a custom workload to monitor with dashboards.

## [](#monitor-for-performance-and-health)Monitor for performance and health

This section provides guidelines and example queries using Redpanda’s public metrics to optimize your system’s performance and monitor its health.

To help detect and mitigate anomalous system behaviors, capture baseline metrics of your healthy system at different stages (at start-up, under high load, in steady state) so you can set thresholds and alerts according to those baselines.

> 💡 **TIP**
>
> For counter type metrics, a broker restart causes the count to reset to zero in tools like Prometheus and Grafana. Redpanda recommends wrapping counter metrics in a rate query to account for broker restarts, for example:
>
> ```promql
> rate(redpanda_kafka_records_produced_total[5m])
> ```

### [](#redpanda-architecture)Redpanda architecture

Understanding the unique aspects of Redpanda’s architecture and data path can improve your performance, debugging, and tuning skills:

-   Redpanda replicates partitions across brokers in a cluster using [Raft](https://raft.github.io/), where each partition is a Raft consensus group. A message written from the Kafka API flows down to the Raft implementation layer that eventually directs it to a broker to be stored. Metrics about the Raft layer can reveal the health of partitions and data flowing within Redpanda.

-   Redpanda is designed with a [thread-per-core](https://docs.redpanda.com/streaming/25.1/reference/glossary/#thread-per-core) model that it implements with the [Seastar](https://seastar.io/) library. With each application thread pinned to a CPU core, when observing or analyzing the behavior of a specific application, monitor the relevant metrics with the label for the specific [shard](https://docs.redpanda.com/streaming/25.1/reference/glossary/#shard), if available.


### [](#infrastructure-resources)Infrastructure resources

The underlying infrastructure of your system should have sufficient margins to handle peaks in processing, storage, and I/O loads. Monitor infrastructure health with the following queries.

#### [](#cpu-usage)CPU usage

For the total CPU uptime, monitor [`redpanda_uptime_seconds_total`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_uptime_seconds_total). Monitoring its rate of change with the following query can help detect unexpected dips in uptime:

```promql
rate(redpanda_uptime_seconds_total[5m])
```

For the total CPU busy (non-idle) time, monitor [`redpanda_cpu_busy_seconds_total`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_cpu_busy_seconds_total).

To detect unexpected idling, you can query the rate of change as a percentage of the shard that is in use at a given point in time.

```promql
rate(redpanda_cpu_busy_seconds_total[5m])
```

> 💡 **TIP**
>
> While CPU utilization at the host-level might appear high (for example, 99-100% utilization) when I/O events like message arrival occur, the actual Redpanda process utilization is likely low. System-level metrics such as those provided by the `top` command can be misleading.
>
> This high host-level CPU utilization happens because Redpanda uses Seastar, which runs event loops on every core (also referred to as a _reactor_), constantly polling for the next task. This process never blocks and will increment clock ticks. It doesn’t necessarily mean that Redpanda is busy.
>
> Use [`redpanda_cpu_busy_seconds_total`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_cpu_busy_seconds_total) to monitor the actual Redpanda CPU utilization. When it indicates close to 100% utilization over a given period of time, make sure to also monitor produce and consume [latency](#latency) as they may then start to increase as a result of resources becoming overburdened.

#### [](#memory-allocated)Memory allocated

To monitor the percentage of memory allocated, use a formula with [`redpanda_memory_allocated_memory`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_memory_allocated_memory) and [`redpanda_memory_free_memory`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_memory_free_memory):

```promql
sum(redpanda_memory_allocated_memory) / (sum(redpanda_memory_free_memory) + sum(redpanda_memory_allocated_memory))
```

#### [](#disk-used)Disk used

To monitor the percentage of disk consumed, use a formula with [`redpanda_storage_disk_free_bytes`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_storage_disk_free_bytes) and [`redpanda_storage_disk_total_bytes`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_storage_disk_total_bytes):

```promql
1 - (sum(redpanda_storage_disk_free_bytes) / sum(redpanda_storage_disk_total_bytes))
```

Also monitor [`redpanda_storage_disk_free_space_alert`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_storage_disk_free_space_alert) for an alert when available disk space is low or degraded.

#### [](#iops)IOPS

For read and write I/O operations per second (IOPS), monitor the [`redpanda_io_queue_total_read_ops`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_io_queue_total_read_ops) and [`redpanda_io_queue_total_write_ops`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_io_queue_total_write_ops) counters:

```promql
rate(redpanda_io_queue_total_read_ops[5m]),
rate(redpanda_io_queue_total_write_ops[5m])
```

### [](#throughput)Throughput

While maximizing the rate of messages moving from producers to brokers then to consumers depends on tuning each of those components, the total throughput of all topics provides a system-level metric to monitor. When you observe abnormal, unhealthy spikes or dips in producer or consumer throughput, look for correlation with changes in the number of active connections ([`redpanda_rpc_active_connections`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_rpc_active_connections)) and logged errors to drill down to the root cause.

The total throughput of a cluster can be measured by the producer and consumer rates across all topics.

To observe the total producer and consumer rates of a cluster, monitor [`redpanda_kafka_request_bytes_total`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_kafka_request_bytes_total) with the `produce` and `consume` labels, respectively.

#### [](#producer-throughput)Producer throughput

For the produce rate, create a query to get the produce rate across all topics:

```promql
sum(rate(redpanda_kafka_request_bytes_total{redpanda_request="produce"} [5m] )) by (redpanda_request)
```

#### [](#consumer-throughput)Consumer throughput

For the consume rate, create a query to get the total consume rate across all topics:

```promql
sum(rate(redpanda_kafka_request_bytes_total{redpanda_request="consume"} [5m] )) by (redpanda_request)
```

### [](#latency)Latency

Latency should be consistent between produce and fetch sides. It should also be consistent over time. Take periodic snapshots of produce and fetch latencies, including at upper percentiles (95%, 99%), and watch out for significant changes over a short duration.

In Redpanda, the latency of produce and fetch requests includes the latency of inter-broker RPCs that are born from Redpanda’s internal implementation using Raft.

#### [](#kafka-consumer-latency)Kafka consumer latency

To monitor Kafka consumer request latency, use the [`redpanda_kafka_request_latency_seconds`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_kafka_request_latency_seconds) histogram with the label `redpanda_request="consume"`. For example, create a query for the 99th percentile:

```promql
histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{redpanda_request="consume"}[5m])) by (le, provider, region, instance, namespace, pod))
```

You can monitor the rate of Kafka consumer requests using `redpanda_kafka_request_latency_seconds_count` with the `redpanda_request="consume"` label:

rate(redpanda\_kafka\_request\_latency\_seconds\_count{redpanda\_request="consume"}\[5m\])

#### [](#kafka-producer-latency)Kafka producer latency

To monitor Kafka producer request latency, use the [`redpanda_kafka_request_latency_seconds`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_kafka_request_latency_seconds) histogram with the `redpanda_request="produce"` label. For example, create a query for the 99th percentile:

```promql
histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{redpanda_request="produce"}[5m])) by (le, provider, region, instance, namespace, pod))
```

You can monitor the rate of Kafka producer requests with `redpanda_kafka_request_latency_seconds_count` with the `redpanda_request="produce"` label:

```promql
rate(redpanda_kafka_request_latency_seconds_count{redpanda_request="produce"}[5m])
```

#### [](#internal-rpc-latency)Internal RPC latency

To monitor Redpanda internal RPC latency, use the [`redpanda_rpc_request_latency_seconds`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_rpc_request_latency_seconds) histogram with the `redpanda_server="internal"` label. For example, create a query for the 99th percentile latency:

```promql
histogram_quantile(0.99, (sum(rate(redpanda_rpc_request_latency_seconds_bucket{redpanda_server="internal"}[5m])) by (le, provider, region, instance, namespace, pod)))
```

You can monitor the rate of internal RPC requests with [`redpanda_rpc_request_latency_seconds`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_rpc_request_latency_seconds) histogram’s count:

```promql
rate(redpanda_rpc_request_latency_seconds_count[5m])
```

### [](#partition-health)Partition health

The health of Kafka partitions often reflects the health of the brokers that host them. Thus, when alerts occur for conditions such as under-replicated partitions or more frequent leadership transfers, check for unresponsive or unavailable brokers.

With Redpanda’s internal implementation of the Raft consensus protocol, the health of partitions is also reflected in any errors in the internal RPCs exchanged between Raft peers.

#### [](#leadership-changes)Leadership changes

Stable clusters have a consistent balance of leaders across all brokers, with few to no leadership transfers between brokers.

To observe changes in leadership, monitor the [`redpanda_raft_leadership_changes`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_raft_leadership_changes) counter. For example, use a query to get the total rate of increase of leadership changes for a cluster:

```promql
sum(rate(redpanda_raft_leadership_changes[5m]))
```

#### [](#under-replicated-partitions)Under-replicated partitions

A healthy cluster has partition data fully replicated across its brokers.

An under-replicated partition is at higher risk of data loss. It also adds latency because messages must be replicated before being committed. To know when a partition isn’t fully replicated, create an alert for the [`redpanda_kafka_under_replicated_replicas`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_kafka_under_replicated_replicas) gauge when it is greater than zero:

```promql
redpanda_kafka_under_replicated_replicas > 0
```

Under-replication can be caused by unresponsive brokers. When an alert on `redpanda_kafka_under_replicated_replicas` is triggered, identify the problem brokers and examine their logs.

#### [](#leaderless-partitions)Leaderless partitions

A healthy cluster has a leader for every partition.

A partition without a leader cannot exchange messages with producers or consumers. To identify when a partition doesn’t have a leader, create an alert for the [`redpanda_cluster_unavailable_partitions`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_cluster_unavailable_partitions) gauge when it is greater than zero:

```promql
redpanda_cluster_unavailable_partitions > 0
```

Leaderless partitions can be caused by unresponsive brokers. When an alert on `redpanda_cluster_unavailable_partitions` is triggered, identify the problem brokers and examine their logs.

#### [](#raft-rpcs)Raft RPCs

Redpanda’s Raft implementation exchanges periodic status RPCs between a broker and its peers. The [`redpanda_node_status_rpcs_timed_out`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_node_status_rpcs_timed_out) gauge increases when a status RPC times out for a peer, which indicates that a peer may be unresponsive and may lead to problems with partition replication that Raft manages. Monitor for non-zero values of this gauge, and correlate it with any logged errors or changes in partition replication.

### [](#consumers)Consumer group lag

Consumer group lag is an important performance indicator that measures the difference between the broker’s latest (max) offset and the consumer group’s last committed offset. The lag indicates how current the consumed data is relative to real-time production. A high or increasing lag means that consumers are processing messages slower than producers are generating them. A decreasing or stable lag implies that consumers are keeping pace with producers, ensuring real-time or near-real-time data consumption.

By monitoring consumer lag, you can identify performance bottlenecks and make informed decisions about scaling consumers, tuning configurations, and improving processing efficiency.

A high maximum lag may indicate that a consumer is experiencing connectivity problems or cannot keep up with the incoming workload.

A high or increasing total lag (lag sum) suggests that the consumer group lacks sufficient resources to process messages at the rate they are produced. In such cases, scaling the number of consumers within the group can help, but only up to the number of partitions available in the topic. If lag persists despite increasing consumers, repartitioning the topic may be necessary to distribute the workload more effectively and improve processing efficiency.

Redpanda provides the following methods for monitoring consumer group lag:

-   [Dedicated gauges](#dedicated-gauges): Redpanda brokers can internally calculate consumer group lag and expose two dedicated gauges. This method is recommended for environments where your observability platform does not support complex queries required to calculate the lag from offset metrics.

    Enabling these gauges may add a small amount of additional processing overhead to the brokers.

-   [Offset-based calculation](#offset-based-calculation): You can use your observability platform to calculate consumer group lag from offset metrics. Use this method if your observability platform supports functions, such as `max()`, and you prefer to avoid additional processing overhead on the broker.


#### [](#dedicated-gauges)Dedicated gauges

Redpanda can internally calculate consumer group lag and expose it as two dedicated gauges.

-   [`redpanda_kafka_consumer_group_lag_max`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_kafka_consumer_group_lag_max): Reports the maximum lag observed among all partitions for a consumer group. This metric helps pinpoint the partition with the greatest delay, indicating potential performance or configuration issues.

-   [`redpanda_kafka_consumer_group_lag_sum`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_kafka_consumer_group_lag_sum): Aggregates the lag across all partitions, providing an overall view of data consumption delay for the consumer group.


To enable these dedicated gauges, you must enable consumer group metrics in your cluster properties. Add the following to your Redpanda configuration:

-   [`enable_consumer_group_metrics`](https://docs.redpanda.com/streaming/25.1/reference/properties/cluster-properties/#enable_consumer_group_metrics): A list of properties to enable for consumer group metrics. You must add the `consumer_lag` property to enable consumer group lag metrics.

-   [`consumer_group_lag_collection_interval_sec`](https://docs.redpanda.com/streaming/25.1/reference/properties/cluster-properties/#consumer_group_lag_collection_interval_sec) (optional): The interval in seconds for collecting consumer group lag metrics. The default is 60 seconds.

    Set this value equal to the scrape interval of your metrics collection system. Aligning these intervals ensures synchronized data collection, reducing the likelihood of missing or misaligned lag measurements.


For example:

##### Helm + Operator

`redpanda-cluster.yaml`

```yaml
apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
  name: redpanda
spec:
  chartRef: {}
  clusterSpec:
    config:
      cluster:
        enable_consumer_group_metrics:
          - group
          - partition
          - consumer_lag
```

```bash
kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
```

##### Helm

###### --values

`enable-consumer-metrics.yaml`

```yaml
config:
  cluster:
    enable_consumer_group_metrics:
      - group
      - partition
      - consumer_lag
```

```bash
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
--values enable-consumer-metrics.yaml --reuse-values
```

###### --set

```bash
helm upgrade --install redpanda redpanda/redpanda \
  --namespace <namespace> \
  --create-namespace \
  --set config.cluster.enable_consumer_group_metrics[0]=group \
  --set config.cluster.enable_consumer_group_metrics[1]=partition \
  --set config.cluster.enable_consumer_group_metrics[2]=consumer_lag
```

When these properties are enabled, Redpanda computes and exposes the `redpanda_kafka_consumer_group_lag_max` and `redpanda_kafka_consumer_group_lag_sum` gauges to the `/public_metrics` endpoint.

#### [](#offset-based-calculation)Offset-based calculation

If your environment is sensitive to the performance overhead of the [dedicated gauges](#dedicated-gauges), use the offset-based calculation method to calculate consumer group lag. This method requires your observability platform to support functions like `max()`.

Redpanda provides two metrics to calculate consumer group lag:

-   [`redpanda_kafka_max_offset`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_kafka_max_offset): The broker’s latest offset for a partition.

-   [`redpanda_kafka_consumer_group_committed_offset`](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/#redpanda_kafka_consumer_group_committed_offset): The last committed offset for a consumer group on that partition.


For example, here’s a typical query to compute consumer lag:

```promql
max by(redpanda_namespace, redpanda_topic, redpanda_partition)(redpanda_kafka_max_offset{redpanda_namespace="kafka"}) - on(redpanda_topic, redpanda_partition) group_right max by(redpanda_group, redpanda_topic, redpanda_partition)(redpanda_kafka_consumer_group_committed_offset)
```

### [](#services)Services

Monitor the health of specific Redpanda services with the following metrics.

#### [](#schema-registry)Schema Registry

Schema Registry request latency:

```promql
histogram_quantile(0.99, (sum(rate(redpanda_schema_registry_request_latency_seconds_bucket[5m])) by (le, provider, region, instance, namespace, pod)))
```

Schema Registry request rate:

```promql
rate(redpanda_schema_registry_request_latency_seconds_count[5m]) + sum without(redpanda_status)(rate(redpanda_schema_registry_request_errors_total[5m]))
```

Schema Registry request error rate:

```promql
rate(redpanda_schema_registry_request_errors_total[5m])
```

#### [](#rest-proxy)REST proxy

REST proxy request latency:

```promql
histogram_quantile(0.99, (sum(rate(redpanda_rest_proxy_request_latency_seconds_bucket[5m])) by (le, provider, region, instance, namespace, pod)))
```

REST proxy request rate:

```promql
rate(redpanda_rest_proxy_request_latency_seconds_count[5m]) + sum without(redpanda_status)(rate(redpanda_rest_proxy_request_errors_total[5m]))
```

REST proxy request error rate:

```promql
rate(redpanda_rest_proxy_request_errors_total[5m])
```

### [](#data-transforms)Data transforms

See [Monitor Data Transforms](https://docs.redpanda.com/streaming/25.1/develop/data-transforms/monitor/).

## [](#references)References

-   [Public Metrics Reference](https://docs.redpanda.com/streaming/25.1/reference/public-metrics-reference/)

-   [Internal Metrics Reference](https://docs.redpanda.com/streaming/25.1/reference/internal-metrics-reference/)

-   [Redpanda monitoring examples repository](https://github.com/redpanda-data/observability)


## [](#suggested-reading)Suggested reading

-   [Monitoring Redpanda in Kubernetes(Day 2 Ops)](https://killercoda.com/redpanda/scenario/redpanda-k8s-day2)


## Suggested labs

-   [Owl Shop Example Application in Docker](https://docs.redpanda.com/labs/docker-compose/owl-shop/)
-   [Start a Single Redpanda Broker with Redpanda Console in Docker](https://docs.redpanda.com/labs/docker-compose/single-broker/)
-   [Start a Cluster of Redpanda Brokers with Redpanda Console in Docker](https://docs.redpanda.com/labs/docker-compose/three-brokers/)

[Search all labs](https://docs.redpanda.com/labs)