Skip to main content
Version: 22.2

Internal Metrics

This page describes the original metrics available on <node ip>:9644/metrics. These are provided for internal support, troubleshooting, and backward compatibility.

For information about the latest metrics available on :9644/public_metrics, see Monitoring.

notes

The /public_metrics endpoint exports different metrics than the /metrics endpoint.

Use /public_metrics for your primary dashboards for system health.

Use /metrics for detailed analysis and debugging.

Configure Prometheus: internal metrics

Prometheus is a systems monitoring and alerting tool. It collects and stores metrics as time-series data identified by metric name and key/value pairs.

Redpanda exports Prometheus metrics on :9644/public_metrics, which has been available since v22.2. To generate the Redpanda configuration on an existing Prometheus instance for both the /public_metrics endpoint as well as the /metrics endpoint, run:

rpk generate prometheus-config --job-name redpanda-metrics-test --node-addrs 'localhost:9644' --internal-metrics

The output is a YAML object that you can add to the scrape_configs list.

- job_name: redpanda-node
static_configs:
- targets:
- 172.31.18.239:9644
- 172.31.18.238:9643
- 172.31.18.237:9642

Edit the prometheus.yml file in the Prometheus root folder to add the Redpanda configuration under the scrape_configs:

scrape_configs:

- job_name: redpanda
static_configs:
- targets:
- 172.31.18.239:9644
- 172.31.18.238:9643
- 172.31.18.237:9642

The number of targets may change depending on the total number of running nodes.

Save the configuration file, and restart Prometheus to apply changes.

note

Configure Grafana: internal metrics

Now that you have the metrics, you need a tool to query, visualize, and generate alerts. Grafana talks to Prometheus and lets you create dashboards with multiple graphic components.

To generate a comprehensive Grafana dashboard, run:

rpk generate grafana-dashboard --datasource <name> --metrics-endpoint <url>

where:

  • <name> is the name of the Prometheus data source configured in your Grafana instance.
  • <url> is the address to a Redpanda node's metrics endpoint. The default is <node ip>:9644/metrics.

See details in the rpk command reference for Grafana.

Out of the box, Grafana generates panels tracking latency for 50%, 95%, and 99% (based on the maximum latency set), throughput, and error segmentation by type.

Pipe the command's output to a file, and import it into Grafana.

rpk generate grafana-dashboard \
--datasource prometheus \
--metrics-endpoint 172.31.18.237:9642/metrics > redpanda-dashboard.json

In Grafana, import this generated JSON file. The default Grafana dashboard shows information about available Redpanda nodes, partitions, latency, and throughput graphics.

You can use the imported dashboard to create new panels.

  1. Click + in the left pane, and select Add a new panel.
  2. On the Query tab, select Prometheus data source.
  3. Decide which metric you want to monitor, click Metrics browser, and start typing vectorized to show available metrics from the Redpanda cluster.

Another way to see available metrics with their description is to access the cluster using a web browser on port 9644. (The port can change depending on your configuration.) You can dynamically set queries to calculate the values in the format you want.

For example, to display the total number of partitions in your cluster, run this query on Grafana:

count(count by (topic,partition) 
(vectorized_storage_log_partition_size{namespace="kafka"}))

To show the number of partitions for a specific topic, run:

count(count by (topic,partition) 
(vectorized_storage_log_partition_size{topic="<topic_name>"}))

Enable statistics

Redpanda ships with an additional systemd service that runs periodically and reports resource usage and configuration data to Redpanda's metrics API. This is enabled by default, and the data is anonymous.

To identify your cluster's data, so Redpanda can monitor it and alert you to possible issues, set the organization (your company domain) and cluster_id (usually your team or project name) configuration fields. For example:

rpk config set organization 'vectorized.io'
rpk config set cluster_id 'us-west-2'

To opt out of all metrics reporting, set rpk.enable_usage_stats to false:

rpk config set rpk.enable_usage_stats false

Monitor internal metrics

Most metrics are useful for debugging, but the following metrics can be useful to measure system health:

MetricDefinitionDiagnostics
vectorized_application_uptimeRedpanda uptime in milliseconds
vectorized_cluster_partition_last_stable_offsetLast stable offsetIf this is the last record received by the cluster, then the cluster is up-to-date and ready for maintenance
vectorized_io_queue_delayTotal delay time in the queueCan indicate latency caused by disk operations in seconds
vectorized_io_queue_queue_lengthNumber of requests in the queueCan indicate latency caused by disk operations
vectorized_kafka_rpc_active_connectionskafka_rpc: Currently active connectionsShows the number of clients actively connected
vectorized_kafka_rpc_connectskafka_rpc: Number of accepted connectionsCompare to the value at a previous time to derive the rate of accepted connections
vectorized_kafka_rpc_received_byteskafka_rpc: Number of bytes received from the clients in valid requestsCompare to the value at a previous time to derive the throughput in Kafka layer in bytes/sec received
vectorized_kafka_rpc_requests_completedkafka_rpc: Number of successful requestsCompare to the value at a previous time to derive the messages per second per shard
vectorized_kafka_rpc_requests_pendingkafka_rpc: Number of requests being processed by server
vectorized_kafka_rpc_sent_byteskafka_rpc: Number of bytes sent to clients
vectorized_kafka_rpc_service_errorskafka_rpc: Number of service errors
vectorized_raft_leadership_changesNumber of leadership changesHigh value can indicate nodes failing and causing leadership changes
vectorized_reactor_utilizationCPU utilizationShows the true utilization of the CPU by Redpanda process
vectorized_storage_log_compacted_segmentNumber of compacted segments
vectorized_storage_log_log_segments_createdNumber of created log segments
vectorized_storage_log_partition_sizeCurrent size of partition in bytes
vectorized_storage_log_read_bytesTotal number of bytes read
vectorized_storage_log_written_bytesTotal number of bytes written

What do you like about this doc?




Optional: Share your email address if we can contact you about your feedback.

Let us know what we do well: