Docs Self-Managed Reference Monitoring Metrics Public Metrics This is documentation for Self-Managed v24.1. To view the latest available version of the docs, see v24.2. Public Metrics This section provides reference descriptions about the public metrics exported from Redpanda’s /public_metrics endpoint. In a live system, Redpanda metrics are exported only for features that are in use. For example, Redpanda does not export metrics for consumer groups if no groups are registered. To see the available public metrics in your system, query the /public_metrics endpoint: curl http://<node-addr>:9644/public_metrics | grep "[HELP|TYPE]" Use /public_metrics for your primary dashboards for system health. Use /metrics for detailed analysis and debugging. Cluster metrics redpanda_cluster_brokers Number of configured, fully commissioned brokers in a cluster. Type: gauge How to monitor: Create an alert for when this gauge dips below a steady-state threshold, as a node(s) has become unresponsive. redpanda_cluster_controller_log_limit_requests_available_rps Limit on the available requests per second (RPS) for a cluster controller log. Type: gauge Labels: redpanda_cmd_group=("move_operations" | "topic_operations" | "configuration_operations" | "node_management_operations" | "acls_and_users_operations") redpanda_cluster_controller_log_limit_requests_dropped Number of requests dropped by a cluster controller log due to exceeding redpanda_cluster_controller_log_limit_requests_available_rps. Type: counter Labels: redpanda_cmd_group=("move_operations" | "topic_operations" | "configuration_operations" | "node_management_operations" | "acls_and_users_operations") Usage: When this counter increases, it indicates that the controller is dropping requests. redpanda_cluster_partition_moving_from_node Number of partition replicas in the cluster that are currently being removed from a node. Type: gauge Usage: When this gauge is non-zero, determine whether there is an expected or unexpected reassignment of partitions that is causing movement of partition replicas. redpanda_cluster_partition_moving_to_node Number of partition replicas in the cluster that are currently being added or moved to a node. Type: gauge Usage: When this gauge is non-zero, determine whether there is an expected or unexpected reassignment of partitions that is causing movement of partition replicas. redpanda_cluster_partition_node_cancelling_movements During a partition movement cancellation operation, the number of partition replicas that were being moved that now need to be cancelled. Type: gauge How to monitor: When this gauge is non-zero, determine whether there is an expected or unexpected reassignment of partitions that is causing movement of partition replicas. redpanda_cluster_partitions Number of partitions managed by a cluster. Includes partitions of the controller topic, but not replicas. Type: gauge redpanda_cluster_topics Number of topics in a cluster. Type: gauge redpanda_cluster_unavailable_partitions Number of unavailable partitions (the partitions that lack quorum among their replica group) in the cluster. Type: gauge Usage: When this gauge is non-zero, it indicates that a partition(s) doesn’t have quorum and thus doesn’t have an active leader. To mitigate, consider increasing the number of brokers and the partition replication factor. redpanda_cluster_latest_cluster_metadata_manifest_age The amount of time in seconds since the last time Redpanda uploaded metadata files to Tiered Storage for your cluster. A value of 0 indicates metadata has not yet been uploaded. When performing a whole cluster restore operation, the metadata for the new cluster will not have any changes made to a source cluster that is newer than this age. Type: gauge Usage: On a healthy system, this should not exceed the value set for cloud_storage_cluster_metadata_upload_interval_ms. You may consider setting an alert if this remains 0 for longer than 1.5 * cloud_storage_cluster_metadata_upload_interval_ms as that may indicate a configuration issue. Related topics: Whole Cluster Restore for Disaster Recovery Infrastructure metrics redpanda_cpu_busy_seconds_total Total CPU busy time in seconds. Type: counter redpanda_io_queue_total_read_ops Total read operations passed in the I/O queue. Type: counter Labels: class=("default" | "compaction" | "raft") ioshard mountpoint shard redpanda_io_queue_total_write_ops Total write operations passed in the I/O queue. Type: counter Labels: class=("default" | "compaction" | "raft") ioshard mountpoint shard redpanda_memory_allocated_memory Total allocated memory in bytes. Type: gauge Labels: shard redpanda_memory_available_memory Total memory potentially available (free plus reclaimable memory) to a CPU shard (core), in bytes. Type: gauge Labels: shard redpanda_memory_available_memory_low_water_mark The low watermark for available memory at process start. Type: gauge Labels: shard redpanda_memory_free_memory Available memory in bytes. Type: gauge Labels: shard redpanda_rpc_active_connections The total number of currently-active clients the RPC server on a given shard has connections to. Type: gauge Labels: redpanda_server=("kafka" | "internal") redpanda_rpc_request_errors_total Number of RPC errors. Type: counter Labels: redpanda_server=("kafka" | "internal") Usage: When this counter increases, analyze the logged errors. redpanda_rpc_request_latency_seconds RPC latency in seconds. Type: histogram Labels: redpanda_server=("kafka" | "internal") redpanda_scheduler_runtime_seconds_total Accumulated runtime of the task queue associated with a scheduling group. Type: counter Labels: redpanda_scheduling_group=("admin" | "archival_upload" | "cache_background_reclaim" | "cluster" | "coproc" | "kafka" | "log_compaction" | "main" | "node_status" | "raft" | "raft_learner_recovery") shard redpanda_storage_disk_free_bytes Available disk storage in bytes. Type: gauge redpanda_storage_disk_free_space_alert Alert for low disk storage: 0-OK, 1-low space, 2-degraded. Type: gauge redpanda_storage_disk_total_bytes Total size in bytes of attached storage. Type: gauge redpanda_uptime_seconds_total Total CPU runtime (uptime) in seconds. Type: gauge Raft metrics redpanda_node_status_rpcs_received Number of node status RPCs received by a node. Type: gauge redpanda_node_status_rpcs_sent Number of node status RPCs sent by a node. Type: gauge redpanda_node_status_rpcs_timed_out Number of timed out node status RPCs from a node. Type: gauge Service metrics redpanda_pandaproxy_request_latency_seconds Latency in seconds of the request indicated by the label in HTTP Proxy. The measurement includes the time spent waiting for resources to become available, processing the request, and dispatching the response. Type: histogram redpanda_schema_registry_request_errors_total Total number of Schema Registry errors. Type: counter Labels: redpanda_status=("5xx" | "4xx" | "3xx") redpanda_schema_registry_request_latency_seconds Latency of the request indicated by the label in the Schema Registry. The measurement includes the time spent waiting for resources to become available, processing the request, and dispatching the response. Type: histogram Partition metrics redpanda_kafka_max_offset The high watermark offset of a partition. Suitable for calculating consumer group lag. Type: gauge Labels: redpanda_namespace redpanda_partition redpanda_topic Related topics: Consumer group lag redpanda_kafka_under_replicated_replicas Number of replicas in the partition that are live but not at the latest offset, redpanda_kafka_max_offset. Type: gauge Labels: redpanda_namespace redpanda_partition redpanda_topic redpanda_raft_recovery_partition_movement_available_bandwidth Bandwidth available for partition movement, in bytes per sec. Type: gauge Labels: shard Topic metrics redpanda_cluster_partition_schema_id_validation_records_failed Number of records that failed schema ID validation. Type: counter redpanda_kafka_partitions Configured number of partitions for a topic. Type: gauge Labels: -redpanda_namespace -redpanda_topic redpanda_kafka_records_fetched_total Total number of records fetched. Type: counter redpanda_kafka_records_produced_total Total number of records produced. Type: counter redpanda_kafka_replicas The number of configured replicas per topic. Type: gauge Labels: redpanda_namespace redpanda_topic redpanda_kafka_request_bytes_total Total number of bytes produced or consumed per topic. Type: counter Labels: redpanda_namespace redpanda_topic redpanda_request=("produce" | "consume") redpanda_raft_leadership_changes Number of leadership changes across all partitions of a given topic. Type: counter Labels: redpanda_namespace redpanda_topic Broker metrics redpanda_kafka_request_latency_seconds Latency of produce/consume requests per broker. This duration measures from when a request is initiated on the partition to when the response is fulfilled. Type: histogram Labels: redpanda_request=("produce" | "consume") Consumer group metrics redpanda_kafka_consumer_group_committed_offset Committed offset of a consumer group. Type: gauge Labels: group topic partition redpanda_kafka_consumer_group_consumers Number of consumers in a consumer group. Type: gauge Labels: group redpanda_kafka_consumer_group_topics Number of topics in a consumer group. Type: gauge Labels: group REST proxy metrics redpanda_rest_proxy_request_errors_total Total number of REST proxy server errors. Type: counter Labels: redpanda_status("5xx" | "4xx" | "3xx") redpanda_rest_proxy_request_latency_seconds_bucket Internal latency of REST proxy requests. Type: histogram Application metrics redpanda_application_build Redpanda build information. Type: gauge Labels: redpanda_revision=<redpanda-revision-ID> redpanda_version=<redpanda-version-number> redpanda_application_uptime_seconds_total Redpanda application uptime in seconds. Type gauge Cloud metrics redpanda_cloud_client_backoff Total number of requests that backed off. Type: counter Labels: S3 redpanda_endpoint redpanda_region Azure Blob Storage (ABS) redpanda_endpoint redpanda_storage_account redpanda_cloud_client_download_backoff Total number of download requests that backed off. Type: counter Labels: S3 redpanda_endpoint redpanda_region Azure Blob Storage (ABS) redpanda_endpoint redpanda_storage_account redpanda_cloud_client_downloads Total number of requests that downloaded an object from cloud storage. Type: counter Labels: S3 redpanda_endpoint redpanda_region Azure Blob Storage (ABS) redpanda_endpoint redpanda_storage_account redpanda_cloud_client_not_found Total number of requests for which the object was not found. Type: counter Labels: S3 redpanda_endpoint redpanda_region Azure Blob Storage (ABS) redpanda_endpoint redpanda_storage_account redpanda_cloud_client_upload_backoff Total number of upload requests that backed off. Type: counter Labels: S3 redpanda_endpoint redpanda_region Azure Blob Storage (ABS) redpanda_endpoint redpanda_storage_account redpanda_cloud_client_uploads Total number of requests that uploaded an object to cloud storage. Type: counter Labels: S3 redpanda_endpoint redpanda_region Azure Blob Storage (ABS) redpanda_endpoint redpanda_storage_account repdanda_cloud_storage_cloud_log_size The amount of data, in bytes, that exists in Tiered Storage and is accessible by Kafka. Increases every time Redpanda offloads a segment to Tiered Storage, and decreases every time compaction or retention deletes data. The redpanda_namespace label supports the following options for this metric: kafka - User topics kafka_internal - Internal Kafka topic, such as consumer groups redpanda - Redpanda-only internal data Type: gauge Usage: This metric reports the log size for each topic and partition in your cluster Labels: redpanda_namespace=("kafka" | "kafka_internal" | "redpanda") redpanda_topic redpanda_partition TLS metrics redpanda_tls_truststore_expires_at_timestamp_seconds The expiration time, represented as a Unix-style timestamp, of the shortest-lived certificate authority (CA) in the installed CA chain. This lists all resources that have at least one CA configured. Type: gauge Labels: area detail Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area and detail labels provide a hierarchical way of identifying a single resource you need information on. redpanda_tls_certificate_expires_at_timestamp_seconds The expiration time, represented as a Unix-style timestamp, for the shortest-lived installed certificate. While more than one certificate may be installed for a resource, this provides only the timestamp of the next certificate to expire. This lists all resources that have at least one certificate installed. Type: gauge Labels: area detail Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area and detail labels provide a hierarchical way of identifying a single resource you need information on. redpanda_tls_certificate_serial The serial number of the shortest-lived installed certificate. These serial numbers are technically unbounded. As such, this will return the least significant 4B digits of the integer representation certificate’s serial number. While more than one certificate may be installed for a resource, this provides only the serial number of the next certificate to expire. This lists all resources that have at least one certificate installed. Type: gauge Labels: area detail Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area and detail labels provide a hierarchical way of identifying a single resource you need information on. redpanda_tls_loaded_at_timestamp_seconds The last time, represented as a Unix-style timestamp, that at least one TLS certificate was loaded for a resource. This lists all resources that have at least one certificate installed. Type: gauge Labels: area detail Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area and detail labels provide a hierarchical way of identifying a single resource you need information on. redpanda_tls_certificate_valid Indicates whether a given resource has at least one valid installed TLS certificate. Querying this will provide a check on all resources with at least one certificate installed. The value of this gauge is either 1 to indicate presence of a valid certificate or 0 to indicate the absence of a valid certificate. Type: gauge Labels: area detail Usage: This metric is recorded for all resources that may have a TLS certificate installed. The area and detail labels provide a hierarchical way of identifying a single resource you need information on. Data transforms metrics redpanda_transform_execution_latency_sec A histogram of the latency of transforming a single record, in seconds. Type: histogram Labels: function_name redpanda_transform_execution_errors Data transform invocation errors. Type: counter Labels: function_name redpanda_wasm_engine_cpu_seconds_total Total CPU time spent inside a WebAssembly function, in seconds. Type: counter Labels: function_name redpanda_wasm_engine_memory_usage Amount of memory usage for a WebAssembly function. Type: gauge Labels: function_name redpanda_wasm_engine_max_memory Maximum amount of memory for a WebAssembly function. Type: gauge Labels: function_name redpanda_wasm_binary_executable_memory_usage The amount of executable memory usage for WebAssembly binaries. Type: gauge redpanda_transform_read_bytes A counter for all the bytes that has been input to a transform. Type: counter Labels: function_name redpanda_transform_write_bytes A counter for all the bytes that has been output from a transform. Type: counter Labels: function_name redpanda_transform_processor_lag The amount of pending records on the input topic that have yet to be processed by the transform. Type: gauge Labels: function_name redpanda_transform_failures A counter for each time that a processor encounters a failure. Type: counter Labels: function_name redpanda_transform_state The number of transform processors in a specific state (running, inactive, errored). Type: gauge Labels: function_name state=("running" | "inactive" | "errored") Cloud storage metrics In a live system, Redpanda metrics are exported only for features that are in use. For example, Redpanda does not export metrics for consumer groups if no groups are registered. To see the available public metrics in your system, query the /public_metrics endpoint: curl http://<node-addr>:9644/public_metrics | grep "[HELP|TYPE]" NOTE: Cloud storage metrics are only available if you have: Tiered Storage enabled The cluster property cloud_storage_enabled set to true redpanda_cloud_storage_cache_space_size_bytes Sum of size of cached objects. redpanda_cloud_storage_housekeeping_drains Number of times upload housekeeping queue was drained. redpanda_cloud_storage_spillover_manifests_materialized_bytes Bytes of memory used for spilled manifests currently cached in memory. redpanda_cloud_storage_cache_op_put Number of objects written into cache. redpanda_cloud_storage_segments Total number of accounted topic segments in the cloud. redpanda_cloud_storage_jobs_local_segment_reuploads Number of segment reuploads from local data directory. redpanda_cloud_storage_cache_trim_failed_trims Number of times Redpanda could not free the expected amount of space, indicating possible bug or configuration issue. redpanda_cloud_storage_cache_trim_exhaustive_trims Number of times a fast trim could not free enough space and had to fall back to a slower exhaustive trim. redpanda_cloud_storage_deleted_segments Count of deleted remote segments. redpanda_cloud_storage_segment_uploads_total Successful data segment uploads. redpanda_cloud_storage_active_segments Number of remote log segments currently hydrated for read. redpanda_cloud_storage_cache_trim_fast_trims Number of times Redpanda trimmed the cache using the normal (fast) mode. redpanda_cloud_storage_housekeeping_jobs_failed Number of failed housekeeping jobs. redpanda_cloud_storage_partition_readers_delayed How many partition readers were delayed due to hitting reader limit. This indicates cluster is saturated with Tiered Storage reads. redpanda_cloud_storage_segments_pending_deletion Total number of topic segments pending deletion from the cloud. redpanda_cloud_storage_housekeeping_rounds Number of upload housekeeping rounds. redpanda_cloud_storage_segment_readers_delayed Number of segment readers delayed due to hitting reader limit. This indicates cluster is saturated with Tiered Storage reads. redpanda_cloud_storage_cache_space_hwm_size_bytes High watermark of sum of size of cached objects. redpanda_cloud_storage_cache_space_hwm_files High watermark of number of objects in cache. redpanda_cloud_storage_cache_op_in_progress_files Number of files that are being added to cache. redpanda_cloud_storage_jobs_cloud_segment_reuploads Number of segment reuploads from cloud storage sources (cloud storage cache or direct download from cloud storage). redpanda_cloud_storage_jobs_manifest_reuploads Number of manifest reuploads performed by all housekeeping jobs. redpanda_cloud_storage_housekeeping_pauses Number of times upload housekeeping was paused. redpanda_cloud_storage_segment_index_uploads_total Successful segment index uploads. redpanda_cloud_storage_cache_op_miss Number of failed get requests because of missing object in the cache. redpanda_cloud_storage_errors_total Number of transmit errors. redpanda_cloud_storage_spillover_manifest_uploads_total Successful spillover manifest uploads. redpanda_cloud_storage_housekeeping_requests_throttled_average_rate Average rate per shard of requests from the read and write path that were throttled by Tiered Storage. redpanda_cloud_storage_jobs_segment_deletions Number of segments deleted by all housekeeping jobs. redpanda_cloud_storage_segment_materializations_delayed Number of segment materializations delayed due to hitting reader limit. This indicates cluster is saturated with Tiered Storage reads. redpanda_cloud_storage_jobs_metadata_syncs Number of archival configuration updates performed by all housekeeping jobs. redpanda_cloud_storage_housekeeping_jobs_completed Number of successful housekeeping jobs. redpanda_cloud_storage_readers Number of segment read cursors for hydrated remote log segments. redpanda_cloud_storage_partition_manifest_uploads_total Successful partition manifest uploads. redpanda_cloud_storage_limits_downloads_throttled_sum Total amount of throttling applied to cloud storage downloads. redpanda_cloud_storage_housekeeping_resumes Number of times upload housekeeping was resumed. redpanda_cloud_storage_cache_op_hit Number of get requests for objects that are already in cache. redpanda_cloud_storage_spillover_manifests_materialized_count Number of spilled manifests currently cached in memory. redpanda_cloud_storage_uploaded_bytes Total number of uploaded bytes for the topic. redpanda_cloud_storage_cache_space_files Number of objects in cache. redpanda_cloud_storage_housekeeping_jobs_skipped Number of skipped housekeeping jobs. redpanda_cloud_storage_partition_readers Number of partition reader instances, based on the number of current fetch/timequery requests reading from Tiered Storage. Related topics Learn how to monitor Redpanda Internal metrics reference Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution Monitoring Metrics Internal Metrics