Monitoring

You can monitor the health of your system to optimize performance. The most useful metrics are available on <node ip>:9644/public_metrics, which has been available since v22.2. This section describes how to use Prometheus queries to monitor your system (for example, to check consumer group lag or Kafka producer request latency).

For information about the original metrics available on :9644/metrics, see Internal Metrics.

The /public_metrics endpoint exports different metrics than the /metrics endpoint.

Use /public_metrics for your primary dashboards for system health.

Use /metrics for detailed analysis and debugging.

Configure Prometheus

Prometheus is a systems monitoring and alerting tool. It collects and stores metrics as time-series data identified by metric name and key/value pairs. Redpanda exports Prometheus metrics on :9644/public_metrics. To generate the configuration on an existing Prometheus instance, run:

rpk generate prometheus-config

The output is a YAML object that you can add to the scrape_configs list.

- job_name: redpanda
  static_configs:
  - targets:
    - 172.31.18.239:9644
    - 172.31.18.238:9644
    - 172.31.18.237:9644
  metrics_path: /public_metrics

When you run the command on a node where Redpanda is running, it displays other available nodes. If Redpanda is not running, you can set the flag --seed-addr to specify a remote Redpanda node to discover the additional ones, or set --node-addrs with a comma-separated list of all known cluster node addresses. For example:

rpk generate prometheus-config --job-name redpanda-metrics-test --node-addrs 172.31.18.239:9644,172.31.18.239:9643,172.31.18.239:9642

Edit the prometheus.yml file in the Prometheus root folder to add the Redpanda configuration under the scrape_configs:

…
scrape_configs:
…
- job_name: redpanda-metrics-test
  static_configs:
    - targets:
        - 172.31.18.239:9644
        - 172.31.18.238:9644
        - 172.31.18.237:9644
  metrics_path: /public_metrics
…

The number of targets may change depending on the total running nodes.

Save the configuration file, and restart Prometheus to apply changes.

After you have the metrics, you can use a tool like Grafana to query, visualize, and generate alerts. For example, to display the total number of partitions in your cluster, run this query on Grafana:

 redpanda_cluster_partitions

To show the number of partitions for a specific topic, run:

count(count by (redpanda_topic, redpanda_partition)
 (redpanda_kafka_max_offset{redpanda_topic="<topic_name>"}))

Generate Grafana dashboard

Grafana is a tool to query, visualize, and generate alerts for metrics.

Redpanda supports generating Grafana dashboards from its metrics endpoints with rpk generate grafana-dashboard.

To generate a comprehensive Grafana dashboard, run the following command and pipe the output to a file that can be imported into Grafana:

rpk generate grafana-dashboard --datasource <name> --metrics-endpoint <url> > <output-file>
  • <name> is the name of the Prometheus data source configured in your Grafana instance.

  • <url> is the address to a Redpanda node’s metrics endpoint (public or internal).

For /public_metrics, for example, run the following command:

```
rpk generate grafana-dashboard \
--datasource prometheus \
--metrics-endpoint <node-addr>:9644/public_metrics> redpanda-dashboard.json
```

For details about the command, refer to the rpk generate grafana-dashboard reference.

In Grafana, import the generated JSON file to create a dashboard. Out of the box, Grafana generates panels tracking latency for 50%, 95%, and 99% (based on the maximum latency set), throughput, and error segmentation by type.

You can use the imported dashboard to create new panels.

  1. Click + in the left pane, and select Add a new panel.

  2. On the Query tab, select Prometheus data source.

  3. Decide which metric you want to monitor, click Metrics browser, and type redpanda to show available public metrics from the Redpanda cluster.

Enable statistics

Redpanda ships with an additional systemd service that runs periodically and reports resource usage and configuration data to Redpanda’s metrics API. This is enabled by default, and the data is anonymous.

To identify your cluster’s data, so Redpanda can monitor it and alert you to possible issues, set the organization (your company domain) and cluster_id (usually your team or project name) configuration fields. For example:

rpk config set organization 'vectorized.io'
rpk config set cluster_id 'us-west-2'

To opt out of all metrics reporting, set rpk.enable_usage_stats to false:

rpk config set rpk.enable_usage_stats false

Monitor metrics

Labels associate a metric with a Redpanda object. For example: redpanda_kafka_request_bytes_total{redpanda_topic="topic1", redpanda_request="consume"} 100 indicates that 100 total bytes were consumed for topic1.

Cluster-level metrics

Metric Definition Labels Type

redpanda_cluster_brokers

Number of configured brokers in the cluster (that is, the size of the cluster).

none

gauge

redpanda_cluster_partitions

Number of partitions managed by the cluster. This includes partitions of the controller topic, but not replicas.

none

gauge

redpanda_cluster_topics

Number of topics in the cluster.

none

gauge

redpanda_cluster_unavailable_partitions

Number of unavailable partitions in the cluster (that is, partitions that lack quorum among replicants).

none

gauge

Cluster-level queries

Brokers in a cluster

redpanda_cluster_brokers

Topics in a cluster

redpanda_cluster_topics

Partitions in a cluster

redpanda_cluster_partitions

Unavailable partitions in a cluster

redpanda_cluster_unavailable_partitions

Kafka consumer rate

sum(rate(redpanda_kafka_request_bytes_total{redpanda_request="consume"} [5m] )) by (redpanda_request)

Kafka producer rate

sum(rate(redpanda_kafka_request_bytes_total{redpanda_request="produce"} [5m] )) by (redpanda_request)

Infrastructure-level metrics

Metric Definition Labels Type

redpanda_io_queue_total_read_ops

Total read operations passed in the queue.

class ("default" | "compaction" | "raft"), ioshard, mountpoint, shard

counter

redpanda_io_queue_total_write_ops

Total write operations passed in the queue.

class ("default" | "compaction" | "raft"), ioshard, mountpoint, shard

counter

redpanda_storage_disk_free_bytes

Disk storage bytes free.

none

gauge

redpanda_storage_disk_total_bytes

Total size of attached storage, in bytes.

none

gauge

redpanda_storage_disk_free_space_alert

Status of low storage space alert: 0-OK, 1-low space 2-degraded

none

gauge

redpanda_memory_allocated_memory

Allocated memory size in bytes.

shard

gauge

redpanda_memory_free_memory

Free memory size in bytes.

shard

gauge

redpanda_rpc_request_errors_total

Number of RPC errors.

redpanda_server ("kafka" | "internal")

counter

redpanda_rpc_request_latency_seconds

RPC latency.

redpanda_server ("kafka" | "internal")

histogram

Infrastructure-level queries

Redpanda CPU

rate(redpanda_runtime_seconds_total[5m])
rate(redpanda_cpu_busy_seconds_total[5m])

Redpanda memory

redpanda_memory_allocated_memory / (redpanda_memory_free_memory + redpanda_memory_allocated_memory)

Redpanda disk usage

1 - (redpanda_storage_disk_free_bytes / redpanda_storage_disk_total_bytes)

Redpanda I/O queue

rate(redpanda_io_queue_total_read_ops[5m]),
rate(redpanda_io_queue_total_write_ops[5m])

Redpanda RPC latency

histogram_quantile(0.99, (sum(rate(redpanda_rpc_request_latency_seconds_bucket[5m])) by (le, provider, region, instance, namespace, pod, redpanda_server)))

Redpanda RPC request rate

rate(redpanda_rpc_request_latency_seconds_count[5m])

Redpanda RPC error rate

rate(redpanda_rpc_request_errors_total[5m])

Cloud storage errors

rate(redpanda_cloud_storage_errors_total[5m])

Service-level metrics

Metric Definition Labels Type

redpanda_pandaproxy_request_latency_seconds

Latency of the request indicated by the label in HTTP Proxy. The measurement includes the time spent waiting for resources to become available, processing the request, and dispatching the response.

shard, operation

histogram

redpanda_schema_registry_request_errors_total

Total number of Schema Registry server errors.

redpanda_status ("5xx" | "4xx" | "3xx")

counter

redpanda_schema_registry_request_latency_seconds

Latency of the request indicated by the label in the Schema Registry. The measurement includes the time spent waiting for resources to become available, processing the request, and dispatching the response.

none

histogram

Service-level queries

Schema Registry request latency

histogram_quantile(0.99, (sum(rate(redpanda_schema_registry_request_latency_seconds_bucket[5m])) by (le, provider, region, instance, namespace, pod)))

Schema Registry request rate

rate(redpanda_schema_registry_request_latency_seconds_count[5m]) + sum without(redpanda_status)(rate(redpanda_schema_registry_request_errors_total[5m]))

Schema Registry request error rate

rate(redpanda_schema_registry_request_errors_total[5m])

Partition-level metrics

Metric Definition Labels Type

redpanda_kafka_max_offset

Latest committed offset for the partition (that is, the offset of the last message safely persisted on most replicas).

redpanda_namespace, redpanda_topic, redpanda_partition

gauge

redpanda_kafka_under_replicated_replicas

Number of under-replicated replicas for the partition (that is, replicas that are live but are not at the latest offset.)

redpanda_namespace, redpanda_topic, redpanda_partition

gauge

Partition-level queries

Max offset

redpanda_kafka_max_offset

Under-replicated replicas

redpanda_kafka_under_replicated_replicas > 0

Topic-level metrics

Metric Definition Labels Type

redpanda_kafka_replicas

Number of configured replicas per topic.

redpanda_namespace, redpanda_topic

gauge

redpanda_kafka_request_bytes_total

Total number of bytes produced or consumed per topic.

redpanda_namespace, redpanda_topic, redpanda_request ("produce" | "consume")

counter

Topic-level queries

Kafka consumer rate

sum by(redpanda_namespace,redpanda_topic)(rate(redpanda_kafka_request_bytes_total{redpanda_request="consume"}[5m]))

Kafka producer rate

sum by(redpanda_namespace,redpanda_topic)(rate(redpanda_kafka_request_bytes_total{redpanda_request="produce"}[5m]))

Partition count

sum(group(redpanda_kafka_max_offset) by (redpanda_namespace, redpanda_topic, redpanda_partition, redpanda_topic)) by (redpanda_namespace, redpanda_topic)

Broker-level metrics

Metric Definition Labels Type

redpanda_kafka_request_latency_seconds

Latency of produce/consume requests per broker. This measures from the moment a request is initiated on the partition to the moment the response is fulfilled.

request ("produce" | "consume")

histogram

Broker-level queries

Kafka consumer request latency

histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{redpanda_request="consume"}[5m])) by (le, provider, region, instance, namespace, pod))

Kafka producer request latency

histogram_quantile(0.99, sum(rate(redpanda_kafka_request_latency_seconds_bucket{redpanda_request="produce"}[5m])) by (le, provider, region, instance, namespace, pod))

Rate of Kafka consumer requests

rate(redpanda_kafka_request_latency_seconds_count{redpanda_request="consume"}[5m])

Rate of Kafka producer requests

rate(redpanda_kafka_request_latency_seconds_count{redpanda_request="produce"}[5m])

Rate of Kafka consumers

sum(rate(redpanda_kafka_request_bytes_total{redpanda_request="consume"} [5m])) by (provider, region, instance, namespace, pod)

Rate of Kafka producers

sum(rate(redpanda_kafka_request_bytes_total{redpanda_request="produce"}[5m])) by (instance, provider, region, namespace, pod)

Consumer group metrics

Metric Definition Labels Type

redpanda_kafka_consumer_group_committed_offset

Consumer group committed offset.

group, topic, partition

gauge

redpanda_kafka_consumer_group_consumers

Number of consumers in a group.

group

gauge

redpanda_kafka_consumer_group_topics

Number of topics in a group.

group

gauge

Consumer group queries

Consumers per consumer group

sum by(redpanda_group)(redpanda_kafka_consumer_group_consumers)

Topics per consumer group

sum by(redpanda_group)(redpanda_kafka_consumer_group_topics)

Consumer group lag

max by(redpanda_namespace, redpanda_topic, redpanda_partition)(redpanda_kafka_max_offset{redpanda_namespace="kafka"}) - on(redpanda_topic, redpanda_partition) group_right max by(redpanda_group, redpanda_topic, redpanda_partition)(redpanda_kafka_consumer_group_committed_offset)

REST proxy metrics

Metric Definition Labels Type

redpanda_rest_proxy_request_errors_total

Total number of rest_proxy server errors.

redpanda_status ("5xx"|"4xx"|"3xx")

counter

redpanda_rest_proxy_request_latency_seconds

Internal latency of request for rest_proxy.

none

histogram

REST proxy queries

REST proxy request latency

histogram_quantile(0.99, (sum(rate(redpanda_rest_proxy_request_latency_seconds_bucket[5m])) by (le, provider, region, instance, namespace, pod)))

REST proxy request rate

rate(redpanda_rest_proxy_request_latency_seconds_count[5m]) + sum without(redpanda_status)(rate(redpanda_rest_proxy_request_errors_total[5m]))

REST proxy request error rate

rate(redpanda_rest_proxy_request_errors_total[5m])