Internal Metrics

This section provides reference descriptions about the internal metrics exported from Redpanda’s /metrics endpoint.

Use /public_metrics for your primary dashboards for system health.

Use /metrics for detailed analysis and debugging.

In a live system, Redpanda metrics are exported only for features that are in use. For example, a metric for consumer groups is not exported when no groups are registered.

To see the available internal metrics in your system, query the /metrics endpoint:

curl http://<node-addr>:9644/metrics | grep "[HELP|TYPE]"

Internal metrics

Most internal metrics are useful for debugging. The following subset of internal metrics can be useful to monitor system health.


vectorized_application_uptime

Redpanda uptime in milliseconds.


vectorized_cloud_storage_read_bytes

Number of bytes Redpanda has read from cloud storage. This is tracked on a per-topic and per-partition basis, and increments each time a cloud storage read operation successfully completes.


vectorized_cluster_partition_last_stable_offset

Last stable offset.

If this is the last record received by the cluster, then the cluster is up-to-date and ready for maintenance.


vectorized_cluster_features_enterprise_license_expiry_sec

Number of seconds remaining until the Enterprise Edition license expires.


vectorized_cluster_partition_schema_id_validation_records_failed

Number of records that failed schema ID validation.


vectorized_cluster_partition_start_offset

Raft snapshot start offset.


vectorized_io_queue_delay

Total delay time in the queue.

Can indicate latency caused by disk operations in seconds.


vectorized_io_queue_queue_length

Number of requests in the queue.

Can indicate latency caused by disk operations.


vectorized_kafka_quotas_balancer_runs

Number of times the throughput quota balancer has executed.

Type: counter


vectorized_kafka_quotas_quota_effective

Current effective quota for the quota balancer, in bytes per second.

Type: counter


vectorized_kafka_quotas_client_quota_throttle_time

Client quota throttling delay, in seconds, per rule and quota type based on client throughput limits.

Type: histogram

Labels:

  • quota_rule=("not_applicable" | "kafka_client_default" | "cluster_client_default" | "kafka_client_prefix" | "cluster_client_prefix" | "kafka_client_id")

  • quota_type=("produce_quota" | "fetch_quota" | "partition_mutation_quota")


vectorized_kafka_quotas_client_quota_throughput

Client quota throughput per rule and quota type.

Type: histogram

Labels:

  • quota_rule=("not_applicable" | "kafka_client_default" | "cluster_client_default" | "kafka_client_prefix" | "cluster_client_prefix" | "kafka_client_id")

  • quota_type=("produce_quota" | "fetch_quota" | "partition_mutation_quota")


vectorized_kafka_quotas_throttle_time

Histogram of throttle times, in seconds, based on broker-wide throughput limits.

Type: histogram


vectorized_kafka_quotas_traffic_intake

Total amount of Kafka traffic (in bytes) taken in from clients for processing that was considered by the throttler.

Type: counter


vectorized_kafka_quotas_traffic_egress

Total amount of Kafka traffic (in bytes) published to clients that was considered by the throttler.

Type: counter


vectorized_kafka_rpc_active_connections

Number of currently active Kafka RPC connections, or clients.


vectorized_kafka_rpc_connects

Number of accepted Kafka RPC connections.

Compare to the value at a previous time to derive the rate of accepted connections.


vectorized_kafka_rpc_produce_bad_create_time

An incrementing counter for the number of times a producer created a message with a timestamp skewed from the broker’s date and time. This metric is related to the following properties:

  • log_message_timestamp_alert_before_ms: Increment this gauge when the create_timestamp on a message is too far in the past as compared to the broker’s time.

  • log_message_timestamp_alert_after_ms: Increment this gauge when the create_timestamp on a message is too far in the future as compared to the broker’s time.


vectorized_kafka_rpc_received_bytes

Number of bytes received from Kafka RPC clients in valid requests.

Compare to the value at a previous time to derive the throughput in Kafka layer in bytes/sec received.


vectorized_kafka_rpc_requests_completed

Number of successful Kafka RPC requests.

Compare to the value at a previous time to derive the messages per second per shard.


vectorized_kafka_rpc_requests_pending

Number of Kafka RPC requests being processed by a server.


vectorized_kafka_rpc_sent_bytes

Number of bytes sent to Kafka RPC clients.


vectorized_kafka_rpc_service_errors

Number of Kafka RPC service errors.


vectorized_ntp_archiver_compacted_replaced_bytes

Number of bytes removed from cloud storage by compaction operations. This is tracked on a per-topic and per-partition basis.

This metric resets every time partition leadership changes. It tracks whether or not compaction is performing operations on cloud storage.

The namespace label supports the following options for this metric:

  • kafka - User topics

  • kafka_internal - Internal Kafka topic, such as consumer groups

  • redpanda - Redpanda-only internal data

Labels:

  • namespace=("kafka" | "kafka_internal" | "redpanda")

  • topic

  • partition


vectorized_ntp_archiver_pending

The difference between the last committed offset and the last offset uploaded to Tiered Storage for each partition. A value of zero for this metric indicates that all data for a partition is uploaded to Tiered Storage.

This metric is impacted by the cloud_storage_segment_max_upload_interval_sec tunable property. If this interval is set to 5 minutes, the archiver will upload committed segments to Tiered Storage every 5 minutes or less. If this metric continues growing for longer than the configured interval, it can indicate a potential network issue with the upload path for that partition.

The namespace label supports the following options for this metric:

  • kafka - User topics

  • kafka_internal - Internal Kafka topic, such as consumer groups

  • redpanda - Redpanda-only internal data

Labels:

  • namespace=("kafka" | "kafka_internal" | "redpanda")

  • topic

  • partition


vectorized_raft_leadership_changes

Number of leadership changes.

High value can indicate nodes failing and causing leadership changes.


vectorized_reactor_utilization

Redpanda process utilization.

Shows the true utilization of the CPU by a Redpanda process. This metric has per-broker and per-shard granularity. If a shard (CPU core) is at 100% utilization for a continuous period of real-time processing, for example more than a few seconds, you will likely observe high latency for partitions assigned to that shard. Use topic-aware intra-broker partition balancing to balance partition assignments and alleviate load on individual shards.


vectorized_storage_log_compacted_segment

Number of compacted segments.


vectorized_storage_log_compaction_removed_bytes

Number of bytes removed from local storage by compaction operations. This is tracked on a per-topic and per-partition basis. It tracks whether compaction is performing operations on local storage.

The namespace label supports the following options for this metric:

  • kafka - User topics

  • kafka_internal - Internal Kafka topic, such as consumer groups

  • redpanda - Redpanda-only internal data

Labels:

  • namespace=("kafka" | "kafka_internal" | "redpanda")

  • topic

  • partition


vectorized_storage_log_log_segments_created

Number of created log segments.


vectorized_storage_log_partition_size

Current size of partition in bytes.


vectorized_storage_log_read_bytes

Total number of bytes read.


vectorized_storage_log_written_bytes

Total number of bytes written.