Public Metrics Reference
This section provides reference descriptions about the public metrics exported from Redpanda’s /public_metrics
endpoint.
Use /public_metrics for your primary dashboards for system health. Use /metrics for detailed analysis and debugging. |
In a live system, Redpanda metrics are exported only for features that are in use. For example, a metric for consumer groups is not exported when no groups are registered. To see the available public metrics in your system, query the
bash |
Cluster metrics
redpanda_cluster_brokers
Number of configured, fully commissioned brokers in a cluster.
Type: gauge
How to monitor: create an alert for when this gauge dips below a steady-state threshold, as a node(s) has become unresponsive.
redpanda_cluster_controller_log_limit_requests_available_rps
Limit on the available requests per second (RPS) for a cluster controller log.
Type: gauge
Labels:
-
redpanda_cmd_group=("move_operations" | "topic_operations" | "configuration_operations" | "node_management_operations" | "acls_and_users_operations")
redpanda_cluster_controller_log_limit_requests_dropped
Number of requests dropped by a cluster controller log due to exceeding redpanda_cluster_controller_log_limit_requests_available_rps.
Type: counter
Labels:
-
redpanda_cmd_group=("move_operations" | "topic_operations" | "configuration_operations" | "node_management_operations" | "acls_and_users_operations")
Usage: when this counter increases, it indicates that the controller is dropping requests.
redpanda_cluster_partition_moving_from_node
Number of partition replicas in the cluster that are currently being removed from a node.
Type: gauge
Usage: when this gauge is non-zero, determine whether there is an expected or unexpected reassignment of partitions that is causing movement of partition replicas.
redpanda_cluster_partition_moving_to_node
Number of partition replicas in the cluster that are currently being added or moved to a node.
Type: gauge
Usage: when this gauge is non-zero, determine whether there is an expected or unexpected reassignment of partitions that is causing movement of partition replicas.
redpanda_cluster_partition_node_cancelling_movements
During a partition movement cancellation operation, the number of partition replicas that were being moved that now need to be cancelled.
Type: gauge
How to monitor: when this gauge is non-zero, determine whether there is an expected or unexpected reassignment of partitions that is causing movement of partition replicas.
redpanda_cluster_partitions
Number of partitions managed by a cluster. Includes partitions of the controller topic, but not replicas.
Type: gauge
redpanda_cluster_unavailable_partitions
Number of unavailable partitions (the partitions that lack quorum among their replica group) in the cluster.
Type: gauge
Usage: when this gauge is non-zero, it indicates that a partition(s) doesn’t have quorum and thus doesn’t have an active leader. To mitigate, consider increasing the number of brokers and the partition replication factor.
Infrastructure metrics
redpanda_io_queue_total_read_ops
Total read operations passed in the I/O queue.
Type: counter
Labels:
-
class=("default" | "compaction" | "raft")
-
ioshard
-
mountpoint
-
shard
redpanda_io_queue_total_write_ops
Total write operations passed in the I/O queue.
Type: counter
Labels:
-
class=("default" | "compaction" | "raft")
-
ioshard
-
mountpoint
-
shard
redpanda_memory_available_memory
Total memory potentially available (free plus reclaimable memory) to a CPU shard (core), in bytes.
Type: gauge
Labels:
-
shard
redpanda_memory_available_memory_low_water_mark
The low watermark for available memory at process start.
Type: gauge
Labels:
-
shard
redpanda_rpc_active_connections
The total number of currently-active clients the RPC server on a given shard has connections to.
Type: gauge
Labels:
-
redpanda_server=("kafka" | "internal")
redpanda_rpc_request_errors_total
Number of RPC errors.
Type: counter
Labels:
-
redpanda_server=("kafka" | "internal")
Usage: when this counter increases, analyze the logged errors.
redpanda_rpc_request_latency_seconds
RPC latency in seconds.
Type: histogram
Labels:
-
redpanda_server=("kafka" | "internal")
redpanda_scheduler_runtime_seconds_total
Accumulated runtime of the task queue associated with a scheduling group.
Type: counter
Labels:
-
redpanda_scheduling_group=("admin" | "archival_upload" | "cache_background_reclaim" | "cluster" | "coproc" | "kafka" | "log_compaction" | "main" | "node_status" | "raft" | "raft_learner_recovery")
-
shard
Service metrics
redpanda_pandaproxy_request_latency_seconds
Latency in seconds of the request indicated by the label in HTTP Proxy. The measurement includes the time spent waiting for resources to become available, processing the request, and dispatching the response.
Type: histogram
Partition metrics
redpanda_kafka_max_offset
The high watermark offset of a partition. Suitable for calculating consumer group lag.
Type: gauge
Labels:
-
redpanda_namespace
-
redpanda_partition
-
redpanda_topic
Related topics:
redpanda_kafka_under_replicated_replicas
Number of replicas in the partition that are live but not at the latest offset, redpanda_kafka_max_offset.
Type: gauge
Labels:
-
redpanda_namespace
-
redpanda_partition
-
redpanda_topic
Topic metrics
redpanda_cluster_partition_schema_id_validation_records_failed
Number of records that failed schema ID validation.
Type: counter
redpanda_kafka_partitions
Configured number of partitions for a topic.
Type: gauge
Labels:
-redpanda_namespace
-redpanda_topic
redpanda_kafka_replicas
The number of configured replicas per topic.
Type: gauge
Labels:
-
redpanda_namespace
-
redpanda_topic
Consumer group metrics
Cloud metrics
redpanda_cloud_client_backoff
Total number of requests that backed off.
Type: counter
Labels:
-
S3
-
redpanda_endpoint
-
redpanda_region
-
-
Azure Blob Storage (ABS)
-
redpanda_endpoint
-
redpanda_storage_account
-
redpanda_cloud_client_download_backoff
Total number of download requests that backed off.
Type: counter
Labels:
-
S3
-
redpanda_endpoint
-
redpanda_region
-
-
Azure Blob Storage (ABS)
-
redpanda_endpoint
-
redpanda_storage_account
-
redpanda_cloud_client_downloads
Total number of requests that downloaded an object from cloud storage.
Type: counter
Labels:
-
S3
-
redpanda_endpoint
-
redpanda_region
-
-
Azure Blob Storage (ABS)
-
redpanda_endpoint
-
redpanda_storage_account
-
redpanda_cloud_client_not_found
Total number of requests for which the object was not found.
Type: counter
Labels:
-
S3
-
redpanda_endpoint
-
redpanda_region
-
-
Azure Blob Storage (ABS)
-
redpanda_endpoint
-
redpanda_storage_account
-
TLS metrics
redpanda_tls_truststore_expires_at_timestamp_seconds
The expiration time, represented as a Unix-style timestamp, of the shortest-lived certificate authority (CA) in the installed CA chain. This lists all resources that have at least one CA configured.
Type: gauge
Labels:
-
area
-
detail
Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area
and detail
labels provide a hierarchical way of identifying a single resource you need information on.
redpanda_tls_certificate_expires_at_timestamp_seconds
The expiration time, represented as a Unix-style timestamp, for the shortest-lived installed certificate. While more than one certificate may be installed for a resource, this provides only the timestamp of the next certificate to expire. This lists all resources that have at least one certificate installed.
Type: gauge
Labels:
-
area
-
detail
Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area
and detail
labels provide a hierarchical way of identifying a single resource you need information on.
redpanda_tls_certificate_serial
The serial number of the shortest-lived installed certificate. These serial numbers are technically unbounded. As such, this will return the least significant 4B digits of the integer representation certificate’s serial number. While more than one certificate may be installed for a resource, this provides only the serial number of the next certificate to expire. This lists all resources that have at least one certificate installed.
Type: gauge
Labels:
-
area
-
detail
Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area
and detail
labels provide a hierarchical way of identifying a single resource you need information on.
redpanda_tls_loaded_at_timestamp_seconds
The last time, represented as a Unix-style timestamp, that at least one TLS certificate was loaded for a resource. This lists all resources that have at least one certificate installed.
Type: gauge
Labels:
-
area
-
detail
Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area
and detail
labels provide a hierarchical way of identifying a single resource you need information on.
redpanda_tls_certificate_valid
Indicates whether a given resource has at least one valid installed TLS certificate. Querying this will provide a check on all resources with at least one certificate installed. The value of this gauge is either 1
to indicate presence of a valid certificate or 0
to indicate the absence of a valid certificate.
Type: gauge
Labels:
-
area
-
detail
Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area
and detail
labels provide a hierarchical way of identifying a single resource you need information on.
Data transforms metrics
redpanda_transform_execution_latency_sec
A histogram of the latency of transforming a single record, in seconds.
Type: histogram
Labels:
-
function_name
redpanda_transform_execution_errors
Data transform invocation errors.
Type: counter
Labels:
-
function_name
redpanda_wasm_engine_cpu_seconds_total
Total CPU time spent inside a WebAssembly function, in seconds.
Type: counter
Labels:
-
function_name
redpanda_wasm_engine_memory_usage
Amount of memory usage for a WebAssembly function.
Type: gauge
Labels:
-
function_name
redpanda_wasm_engine_max_memory
Maximum amount of memory for a WebAssembly function.
Type: gauge
Labels:
-
function_name
redpanda_wasm_binary_executable_memory_usage
The amount of executable memory usage for WebAssembly binaries.
Type: gauge
redpanda_transform_read_bytes
A counter for all the bytes that has been input to a transform.
Type: counter
Labels:
-
function_name
redpanda_transform_write_bytes
A counter for all the bytes that has been output from a transform.
Type: counter
Labels:
-
function_name
redpanda_transform_processor_lag
The amount of pending records on the input topic that have yet to be processed by the transform.
Type: gauge
Labels:
-
function_name