Production Readiness Checklist

Before running a production workload on Redpanda in Kubernetes, follow this readiness checklist.

By completing this checklist, you will be able to:

  • Validate a Kubernetes-deployed Redpanda cluster against production readiness standards

For Linux deployments, see the Production Readiness Checklist for Linux.

Critical requirements

The Critical requirements checklist helps ensure that:

  • You have specified all required defaults and configuration items.

  • You have the optimal hardware setup.

  • You have enabled security.

  • You are set up to run in production.

Redpanda license

If using Enterprise features, validate that you are using a valid Enterprise license:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster license info -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
LICENSE INFORMATION
===================
Organization:      Your Company Name
Type:              enterprise
Expires:           Dec 31 2026

Production deployments using Enterprise features (such as Tiered Storage, Schema Registry, or Continuous Data Balancing) must have a valid Enterprise license with a sufficient expiration date.

See also: Redpanda Licensing

SASL authentication flags

The rpk commands throughout this checklist include SASL authentication flags (-X user, -X pass, -X sasl.mechanism). If your cluster does not use SASL authentication, you can omit these flags from all commands. For example:

Input
# With SASL authentication
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster health -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>

# Without SASL authentication
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster health

Common SASL mechanisms are SCRAM-SHA-256 or SCRAM-SHA-512. Update these values as needed for your deployment.

Cluster health

Check that all brokers are connected and running. Run rpk cluster health to check the health of the cluster. No nodes should be down, and there should be zero leaderless or under-replicated partitions.

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster health -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
CLUSTER HEALTH OVERVIEW
=======================
Healthy:                          true
Unhealthy reasons:                []
Controller ID:                    0
All nodes:                        [0 1 2]
Nodes down:                       []
Leaderless partitions (0):        []
Under-replicated partitions (0):  []

Minimum broker count

You must have at least three brokers running to ensure production-level fault tolerance.

Production clusters should have an odd number of brokers (3, 5, 7, etc.) for optimal consensus behavior.

Verify the running broker count:

Input
kubectl get pods -n <namespace> -l app.kubernetes.io/component=redpanda-statefulset
Output
NAME         READY   STATUS    RESTARTS   AGE
redpanda-0   2/2     Running   0          10d
redpanda-1   2/2     Running   0          10d
redpanda-2   2/2     Running   0          10d

Verify the configured replica count in your deployment:

  • Helm

  • Operator

Input
helm get values redpanda -n <namespace> | grep -A 1 "statefulset:"
Output
statefulset:
  replicas: 3
Input
kubectl get redpanda redpanda -n <namespace> -o jsonpath='{.spec.clusterSpec.statefulset.replicas}'
Output
3

Active broker membership

Verify that all brokers are in active state and not being decommissioned.

Decommissioning is used to permanently remove a broker from the cluster, such as during node pool migrations or cluster downsizing. Brokers in a decommissioned state should not be present in production clusters unless actively performing a planned migration.

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk redpanda admin brokers list -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
NODE-ID  NUM-CORES  MEMBERSHIP-STATUS  IS-ALIVE  BROKER-VERSION
0        4          active             true      v24.2.4
1        4          active             true      v24.2.4
2        4          active             true      v24.2.4

All brokers must show active status. If any broker shows the status draining or decommissioned, investigate immediately.

No brokers in maintenance mode

Check that no brokers are in maintenance mode during normal operations.

Maintenance mode is used when modifying brokers that will remain as members of the cluster, such as during rolling upgrades or hardware maintenance. While necessary during planned maintenance windows, brokers should not remain in maintenance mode during normal operations.

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster maintenance status -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
NODE-ID  ENABLED  FINISHED  ERRORS  PARTITIONS  ELIGIBLE  TRANSFERRING  FAILED
0        false    -         -       -           -         -             -
1        false    -         -       -           -         -             -
2        false    -         -       -           -         -             -

All brokers should show ENABLED: false. If any broker shows ENABLED: true outside of a planned maintenance window, investigate immediately.

See also: Maintenance Mode

Consistent Redpanda version

Check that Redpanda is running the latest point release for the major version you’re on and that all brokers run the same version.

Verify Redpanda broker versions:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk redpanda admin brokers list -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
NODE-ID  NUM-CORES  MEMBERSHIP-STATUS  IS-ALIVE  BROKER-VERSION
0        4          active             true      v25.2.4
1        4          active             true      v25.2.4
2        4          active             true      v25.2.4

All brokers must show the same BROKER-VERSION. Version mismatches between brokers can cause compatibility issues and must be resolved before advancing to production.

Verify Helm Chart or Operator version compatibility:

For Kubernetes deployments, you must also verify that your deployment tool (Helm Chart or Operator) version is compatible with your Redpanda version. The Helm Chart or Operator version must be within one minor version of the Redpanda version.

For example, if running Redpanda v25.2.x, the Helm Chart or Operator version must be v25.1.x, v25.2.x, or v25.3.x.

  • Helm

  • Operator

Input
helm list -n <namespace>
Output
NAME     NAMESPACE  REVISION  UPDATED                               STATUS    CHART            APP VERSION
redpanda redpanda   1         2024-01-15 10:30:00.123456 -0800 PST deployed  redpanda-5.2.4   v25.2.4

The CHART column shows the Helm Chart version (for example, redpanda-5.2.4), which should be compatible with the APP VERSION (Redpanda version).

Input
kubectl get deployment redpanda-controller-manager -n <namespace> -o jsonpath='{.spec.template.spec.containers[0].image}'
Output
docker.redpanda.com/redpandadata/redpanda-operator:v25.2.4

The Operator version is shown in the image tag (for example, v25.2.4), which should be compatible with your Redpanda broker version.

You can also check the Operator version using:

Input
kubectl get redpanda redpanda -n <namespace> -o jsonpath='{.metadata.annotations.redpanda\.com/operator-version}'

Version compatibility requirements:

  • All Redpanda brokers must run the same version

  • The Helm Chart or Operator version must be within ±1 minor version of Redpanda version

  • Example: Redpanda v25.2.x requires Helm/Operator v25.1.x, v25.2.x, or v25.3.x

  • Running incompatible versions can lead to deployment failures or cluster instability.

Version pinning

Verify that versions are explicitly pinned in your deployment configuration:

  • Helm

  • Operator

image:
  tag: v24.2.4  # Pin specific Redpanda version

console:
  enabled: true
  image:
    tag: v2.4.5  # Pin specific Console version

connectors:
  enabled: true
  image:
    tag: v1.0.15  # Pin specific Connectors version

Verify pinned versions:

Input
helm get values redpanda -n <namespace>
Output
image:
  tag: v24.2.4
console:
  image:
    tag: v2.4.5
connectors:
  image:
    tag: v1.0.15
apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
  name: redpanda
spec:
  clusterSpec:
    image:
      tag: v24.2.4  # Pin specific Redpanda version

  console:
    enabled: true
    image:
      tag: v2.4.5  # Pin specific Console version

  connectors:
    enabled: true
    image:
      tag: v1.0.15  # Pin specific Connectors version

Verify pinned versions:

Input
kubectl get redpanda redpanda -n <namespace> -o yaml | grep -A 1 "tag:"

Pin specific versions for Redpanda and all related components (Console, Connectors). This ensures all environments (dev/staging/prod) run the same tested versions, allows controlled upgrade testing before production rollout, and provides rollback capability to known-good versions.

Avoid using the latest tag, version ranges (for example, v24.2.x), or unspecified tags, as these can result in unexpected upgrades that introduce breaking changes or cause downtime.

Default topic replication factor

Check that the default replication factor (≥3) is set appropriately for production.

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get default_topic_replications -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
3

Setting default_topic_replications to 3 or greater ensures new topics are created with adequate fault tolerance.

Existing topics replication factor

Check that all existing topics have adequate replication (default is 3).

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk topic list -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
NAME              PARTITIONS  REPLICAS
_schemas          1           3
orders            12          3
payments          8           3
user-events       16          3

All production topics should have REPLICAS of three or greater. Topics with single-digit replication are at risk of data loss if a broker fails.

Persistent storage configuration

Verify that you have configured persistent storage (not hostPath or emptyDir) for data persistence.

Input
kubectl get pvc -n <namespace>
Output
NAME                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
datadir-redpanda-0      Bound    pvc-a1b2c3d4-e5f6-7890-abcd-ef1234567890   100Gi      RWO            fast-ssd       10d
datadir-redpanda-1      Bound    pvc-b2c3d4e5-f6g7-8901-bcde-fg2345678901   100Gi      RWO            fast-ssd       10d
datadir-redpanda-2      Bound    pvc-c3d4e5f6-g7h8-9012-cdef-gh3456789012   100Gi      RWO            fast-ssd       10d

Verify the StatefulSet uses PersistentVolumeClaims:

Input
kubectl describe statefulset -n <namespace> redpanda | grep -A 5 "Volume Claims"
Output
Volume Claims:
  Name:          datadir
  StorageClass:  fast-ssd
  Labels:        <none>
  Annotations:   <none>
  Capacity:      100Gi

HostPath and emptyDir storage are not suitable for production as they lack durability guarantees.

See also: Persistent Storage

RAID/LVM stripe configuration (multiple disks only)

If using multiple physical disks, verify they are configured to stripe data across the disks as RAID-0 or LVM stripe (not linear/concat). Striping distributes data across multiple disks in parallel for improved I/O performance.

Input
# Check block device configuration on nodes
kubectl debug node/<node-name> -it -- chroot /host /bin/bash
lsblk -o NAME,TYPE,SIZE,MOUNTPOINT,FSTYPE
lvs -o lv_name,stripes,stripe_size
mdadm --detail /dev/md*  # if using software RAID
Output
# lsblk output
NAME          TYPE  SIZE   MOUNTPOINT        FSTYPE
nvme0n1       disk  1.8T
nvme1n1       disk  1.8T
vg0-data      lvm   3.6T   /var/lib/redpanda xfs

# lvs output - note stripes > 1 indicates striping
LV    #Stripes StripeSize
data  2        256.00k
Output
# mdadm output
/dev/md0:
    Raid Level : raid0
    Array Size : 3515625472 (3.27 TiB)
  Raid Devices : 2

    Number   Major   Minor   RaidDevice State
       0     259        0        0      active sync   /dev/nvme0n1
       1     259        1        1      active sync   /dev/nvme1n1

Using LVM linear/concat or JBOD instead of stripe/RAID-0 across multiple disks will severely degrade performance because data writes are serialized rather than parallelized. For optimal I/O throughput, configure multiple disks in a striped array that writes data across all disks simultaneously. Single disk configurations do not require striping.

See also: Storage

Storage performance requirements

Ensure storage classes provide adequate IOPS and throughput for your workload by using the following specifications when selection a storage class:

Performance specifications:

  • Use NVMe-based storage classes for production deployments

  • Specify a minimum 16,000 IOPS (Input/Output Operations Per Second)

  • Consider provisioned IOPS where available to meet or exceed the minimum

  • Enable write caching to help Redpanda perform better in environments with disks that don’t meet the recommended IOPS

  • NFS (Network File System) is not supported

  • Test storage performance under load

Avoid cloud instance types that use multi-tenant or shared disks, as these can lead to unpredictable performance due to noisy neighbor effects. Examples of instances with shared/multi-tenant storage include AWS is4gen.xlarge and similar instance types across cloud providers. Instead, use instances with dedicated local NVMe storage or provisioned IOPS volumes that guarantee consistent performance.

Multi-tenant disks can experience:

  • Unpredictable latency spikes from other tenants' workloads

  • Inconsistent throughput that varies based on neighbor activity

  • IOPS throttling that impacts Redpanda’s performance

  • Difficulty troubleshooting performance issues due to external factors

See also:

CPU and memory resource limits

Verify Pods have resource requests and limits configured.

Input
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[?(@.name=="redpanda")].resources}' | jq
Output
{
  "limits": {
    "cpu": "4",
    "memory": "8Gi"
  },
  "requests": {
    "cpu": "4",
    "memory": "8Gi"
  }
}

All Redpanda Pods must have:

  • Identical CPU requests and limits (requests.cpu == limits.cpu)

  • Identical memory requests and limits (requests.memory == limits.memory)

Setting requests equal to limits ensures the Pod receives the Guaranteed QoS class, which prevents CPU throttling and reduces the risk of Pod eviction.

CPU to memory ratio

Ensure adequate memory allocation relative to CPU for optimal performance.

Production deployments should provision at least 2 GiB of memory per CPU core. The ratio should be at least 1:2 (2 GiB per core).

Verify the CPU to memory ratio in your configuration:

  • Helm

  • Operator

Input
helm get values redpanda -n <namespace> | grep -A 2 "resources:"
Output
resources:
  cpu:
    cores: 4
  memory:
    container:
      min: 8Gi
      max: 8Gi
Input
kubectl get redpanda redpanda -n <namespace> -o jsonpath='{.spec.clusterSpec.resources}' | jq
Output
{
  "cpu": {
    "cores": 4
  },
  "memory": {
    "container": {
      "min": "8Gi",
      "max": "8Gi"
    }
  }
}

In the preceding examples, 4 CPU cores with 8 GiB memory provides a 1:2 ratio (2 GiB per core).

See also: Memory

No fractional CPU requests

Ensure CPU requests use whole numbers for consistent performance.

Fractional CPUs can lead to performance variability in production. Use whole integer values (4, 8, or 16 are acceptable, while 3.5 or 7.5 are not).

Verify CPU configuration:

Input
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[?(@.name=="redpanda")].resources.requests.cpu}'
Output
4

Authorization enabled

Verify Kafka authorization is enabled for access control.

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get kafka_enable_authorization -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
true

Without authorization enabled, any client can access Kafka APIs without authentication.

See also: Authorization

Production mode enabled

Verify that developer mode and overprovisioned mode are disabled for production stability.

Check developer mode:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- grep developer_mode /etc/redpanda/redpanda.yaml
Output
developer_mode: false

Developer mode should never be enabled in production environments. Developer mode disables fsync and bypasses safety checks designed for production workloads.

Check overprovisioned mode:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- grep overprovisioned /etc/redpanda/redpanda.yaml
Output
overprovisioned: false

Overprovisioned mode bypasses critical resource checks and should never be enabled in production. This mode is intended only for development environments with constrained resources.

Verify in Helm values that resources.cpu.overprovisioned is not explicitly set to true (it’s automatically calculated based on CPU allocation).

TLS enabled

Configure TLS encryption for all client and inter-broker communication. TLS prevents eavesdropping and man-in-the-middle attacks on network traffic.

Verify TLS is enabled on all listeners:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config export -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism> | grep -A 10 "kafka_api:"
Output
redpanda:
  kafka_api:
    - address: 0.0.0.0
      port: 9093
      name: internal
      authentication_method: sasl
  kafka_api_tls:
    - name: internal
      enabled: true
      cert_file: /etc/tls/certs/tls.crt
      key_file: /etc/tls/certs/tls.key

Required TLS listeners include:

  • kafka_api - Client connections to Kafka API

  • admin_api - Administrative REST API access

  • rpc_server - Inter-broker communication

  • schema_registry - Schema Registry API (if used)

Verify certificates are properly mounted:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- ls -la /etc/tls/certs/
Output
total 16
-rw-r--r-- 1 redpanda redpanda 1234 Dec 15 10:00 ca.crt
-rw-r--r-- 1 redpanda redpanda 1675 Dec 15 10:00 tls.crt
-rw------- 1 redpanda redpanda 1704 Dec 15 10:00 tls.key

See also: TLS Encryption

Authentication enabled

Configure appropriate authentication mechanisms to control access to Redpanda resources.

Verify SASL users are configured:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk acl user list -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
USERNAME
admin
app-producer
app-consumer
monitoring

Be sure to adhere to the following authentication requirements:

  • Set up SASL authentication for client connections

  • Configure TLS certificates for encryption (see preceding TLS configuration guidance)

  • Implement proper user management with principle of least privilege

  • Configure ACLs (Access Control Lists) for resource authorization

Verify ACLs are configured:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk acl list -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
PRINCIPAL       HOST  RESOURCE-TYPE  RESOURCE-NAME      OPERATION  PERMISSION
User:app-producer  *     TOPIC          orders.*          WRITE      ALLOW
User:app-consumer  *     TOPIC          orders.*          READ       ALLOW
User:app-consumer  *     GROUP          consumer-group-1  READ       ALLOW

See also:

Network security

Secure network access to the cluster using Kubernetes-native controls.

Verify NetworkPolicies are configured:

Input
kubectl get networkpolicy -n <namespace>
Output
NAME                          POD-SELECTOR                        AGE
redpanda-allow-internal       app.kubernetes.io/name=redpanda    10d
redpanda-allow-clients        app.kubernetes.io/name=redpanda    10d
redpanda-deny-all-ingress     app.kubernetes.io/name=redpanda    10d

Check NetworkPolicy rules:

Input
kubectl describe networkpolicy -n <namespace>

Be sure to satisfy the following network security requirements:

  • Configure NetworkPolicies to restrict pod-to-pod communication

  • Use TLS for all client connections (see TLS configuration)

  • Secure admin API endpoints with authentication and authorization

  • Limit ingress traffic to only necessary ports and sources

  • Use Kubernetes Services to control external access

Verify services and exposed ports:

Input
kubectl get svc -n <namespace>
Output
NAME               TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)
redpanda           ClusterIP      None            <none>        9093/TCP,9644/TCP,8082/TCP
redpanda-external  LoadBalancer   10.100.200.50   <pending>     9093:30001/TCP

Pod Disruption Budget

Set up PDBs to control voluntary disruptions during maintenance.

Input
kubectl get pdb -n <namespace>
Output
NAME       MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
redpanda   N/A             1                 1                     10d

Production deployments must have a PodDisruptionBudget with maxUnavailable: 1 to prevent simultaneous broker disruptions during voluntary operations like node drains, upgrades, or autoscaler actions.

Rack awareness and topology spread

Configure topology spread constraints to distribute brokers across availability zones. For configuration instructions, see Multi-AZ deployment.

Production deployments require each Redpanda broker to run in a different availability zone to ensure that a single zone failure does not cause loss of quorum. For a three-broker cluster, brokers must be distributed across three separate zones.

To verify zone distribution, check your cluster configuration:

  • Verify topologySpreadConstraints are configured in your Helm values or Redpanda CR

  • Confirm nodes have zone labels (typically topology.kubernetes.io/zone)

  • Check that brokers are scheduled on nodes in different zones

See also: Rack Awareness

Operator CRDs (Operator deployments only)

If your deployment uses the Redpanda Operator, all required Custom Resource Definitions (CRDs) must be installed with compatible versions. Without correct CRDs, the Operator cannot manage the cluster, leading to configuration drift, failed updates, and potential data loss.

The required CRDs are below:

  • clusters.cluster.redpanda.com - Manages Redpanda cluster configuration

  • topics.cluster.redpanda.com - Manages topic lifecycle

  • users.cluster.redpanda.com - Manages SASL users

  • schemas.cluster.redpanda.com - Manages Schema Registry schemas

If any CRDs are missing or incompatible with your Operator version, the Operator will fail to reconcile resources.

Verify all required CRDs are installed:

Input
kubectl get crd | grep redpanda.com
Output
clusters.cluster.redpanda.com
topics.cluster.redpanda.com
users.cluster.redpanda.com
schemas.cluster.redpanda.com

Run Redpanda tuners

Check that you have configured tuners for optimal performance. Tuners can significantly impact latency and throughput. In Kubernetes, tuners are configured through the Helm chart or may need to be run on worker nodes themselves. For details, see Tune Kubernetes Worker Nodes for Production.

The Recommended requirements checklist ensures that you can monitor and support your environment on a sustained basis. It includes the following checks:

  • You have adhered to day-2 operations best practices.

  • You can diagnose and recover from backup issues or failures.

  • You have configured monitoring, backup, and security scanning.

Deployment method

Verify that the deployment method (Helm or Operator) is correctly identified for your cluster. Understanding your deployment method is important for troubleshooting, upgrades, and configuration management.

  • Helm

  • Operator

Input
helm list -n <namespace>
Output
NAME     NAMESPACE  REVISION  UPDATED                               STATUS    CHART            APP VERSION
redpanda redpanda   1         2024-01-15 10:30:00.123456 -0800 PST deployed  redpanda-5.0.0   v24.1.1

The presence of a Helm release (CHART displays redpanda-5.0.0) indicates a Helm-managed deployment.

Input
kubectl get redpanda -n <namespace>
Output
NAME       READY   STATUS
redpanda   True    Redpanda reconciliation succeeded

The presence of a Redpanda custom resource indicates an Operator-managed deployment.

Knowing your deployment method helps determine which configuration approach to use (Helm values vs. Redpanda CR), how to perform upgrades and rollbacks, where to find deployment logs and troubleshooting information, and which documentation sections apply to your environment. See Production Deployment Workflow for the complete deployment process.

XFS filesystem

Verify that data directories use XFS filesystem for optimal performance.

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- df -khT /var/lib/redpanda/data
Output
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/nvme0n1   xfs   1.8T   14G  1.8T   1% /var/lib/redpanda/data

XFS provides better performance characteristics for Redpanda workloads compared to ext4. While ext4 is supported, XFS is strongly recommended for production deployments.

Pod anti-affinity

Configure Pod anti-affinity to spread brokers across nodes.

Input
kubectl get statefulset redpanda -n <namespace> -o jsonpath='{.spec.template.spec.affinity}' | jq
Output
{
  "podAntiAffinity": {
    "requiredDuringSchedulingIgnoredDuringExecution": [
      {
        "labelSelector": {
          "matchLabels": {
            "app.kubernetes.io/name": "redpanda"
          }
        },
        "topologyKey": "kubernetes.io/hostname"
      }
    ]
  }
}

This prevents single node failures from affecting multiple brokers by ensuring each Redpanda Pod runs on a different node.

See also: Pod Anti-Affinity

Node isolation

Configure taints/tolerations or nodeSelector for workload isolation.

Input
kubectl get statefulset redpanda -n <namespace> -o jsonpath='{.spec.template.spec.nodeSelector}' | jq
Output
{
  "workload-type": "redpanda"
}

Isolating Redpanda workloads on dedicated nodes improves performance predictability by preventing resource contention with other applications.

Partition balancing

Configure automatic partition balancing across brokers and CPU cores.

Continuous Data Balancing

Continuous Data Balancing can help you manage production deployments by automatically rebalancing partition replicas across brokers based on disk usage and node changes. It also eliminates manual intervention and prevents performance degradation.

You should enable Continuous Data Balancing for all licensed production clusters.

Verify that Continuous Data Balancing is configured:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get partition_autobalancing_mode -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
continuous

The continuous setting enables automatic partition rebalancing based on:

  • Node additions or removals

  • High disk usage conditions

  • Broker availability changes

Without Continuous Data Balancing, partition distribution becomes skewed over time, leading to hotspots and manual rebalancing operations.

Core Balancing

Intra-broker partition balancing distributes partition replicas across CPU cores within individual brokers.

Check core balancing for CPU core partition distribution:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get core_balancing_on_core_count_change -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
true

When enabled, Redpanda continuously rebalances partitions between CPU cores on a broker for optimal resource utilization, which is especially beneficial after broker restarts or configuration changes.

System requirements

Run system checks to get more details regarding your system configuration.

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk redpanda check -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
CONDITION                         REQUIRED       CURRENT   SEVERITY  PASSED
Data directory is writable        true           true      Fatal     true
Free memory per CPU [MB]          >= 2048        8192      Warning   true
NTP Synced                        true           true      Warning   true
Swappiness                        1              1         Warning   true

Review any failed checks and remediate before proceeding to production. See rpk redpanda check for details on each validation.

Debug bundle

Verify that you can successfully generate and collect a debug bundle from your cluster. This proactive check ensures that if an issue occurs and you need to contact Redpanda support, you won’t face permission issues or silent collection failures that could delay troubleshooting.

Generate a debug bundle:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk debug bundle -o /tmp/bundle.zip

For additional options and arguments, see rpk debug bundle.

Output
Creating bundle file...
Collecting cluster info...
Collecting logs...
Collecting configuration...
Debug bundle saved to '/tmp/bundle.zip'

Debug bundles collect critical diagnostic information including cluster configuration and metadata, Redpanda logs from all brokers, system resource usage and performance metrics, and Kubernetes resource definitions.

When testing bundle generation, watch for permission errors preventing log collection, insufficient disk space for bundle creation, network policies blocking bundle transfer, or RBAC restrictions on accessing Pod logs or exec. Testing bundle generation early ensures this critical troubleshooting tool works when you need it most. Debug bundles are often required by Redpanda support to diagnose production issues efficiently.

Tiered Storage

Configure Tiered Storage for extended data retention using object storage. Tiered Storage automatically offloads older data to cloud storage (S3, GCS, Azure Blob), enabling extended retention without expanding local disk capacity.

Verify Tiered Storage configuration:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get cloud_storage_enabled -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
true

Benefits of Tiered Storage

  • Reduced local storage costs from offloading cold data to cheaper object storage

  • Longer data retention periods without provisioning additional disk

  • Required for advanced features like Remote Read Replicas and Iceberg integration

  • Disaster recovery capabilities through cloud-backed data

To verify your Tiered Storage configuration:

Input
# Check bucket configuration
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get cloud_storage_bucket -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>

# Check region/endpoint
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get cloud_storage_region -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>

See also: Tiered Storage

Security scanning

Regularly scan container images and configurations for vulnerabilities to maintain security.

Container image scanning

Verify that container images are scanned before deployment:

Input
# Check current image in use
kubectl get statefulset redpanda -n <namespace> -o jsonpath='{.spec.template.spec.containers[?(@.name=="redpanda")].image}'
Output
docker.redpanda.com/redpandadata/redpanda:v24.2.4

Security scanning best practices

Security scanning best practices include:

  • Scan images using tools like Trivy, Snyk, or cloud-native scanners before deployment

  • Set up automated scanning in CI/CD pipelines

  • Monitor for CVE announcements and security advisories

  • Keep Redpanda and related components up-to-date with security patches (see Rolling Upgrades)

  • Review Kubernetes RBAC policies and ServiceAccount permissions (see Role Controller)

Configuration scanning

Input
# Scan Kubernetes manifests
kubectl get redpanda,statefulset,deployment -n <namespace> -o yaml > cluster-config.yaml
# Use kubesec, kube-bench, or similar tools to scan cluster-config.yaml

Establish a regular cadence for security scanning (for example, weekly or with each deployment).

Backup and recovery

Implement and test backup and recovery processes to ensure business continuity.

Backup strategy with Tiered Storage

Tiered Storage provides built-in backup capabilities by storing data in object storage. Verify Tiered Storage is configured:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get cloud_storage_enabled -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>

Recovery testing

Regularly test recovery procedures to validate RTO/RPO targets:

Input
# Test topic restoration from Tiered Storage
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk topic describe <topic-name> -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>

For mission-critical workloads requiring active disaster recovery, consider implementing Shadowing to asynchronously replicate data to a standby cluster. Shadowing provides offset-preserving replication that maintains consumer positions, enabling faster recovery with lower RTO compared to restoration from backups. This Enterprise feature (available in Redpanda v25.3 or later) supports cross-region or cross-cloud disaster recovery with automatic failover capabilities.

Configure and validate Tiered Storage for automatic data backup to object storage. Document and regularly test recovery procedures for different failure scenarios in non-production environments. Establish clear Recovery Time Objective (RTO) and Recovery Point Objective (RPO) targets, and maintain runbooks for disaster recovery scenarios. For Shadowing deployments, use the Shadowing Failover Runbook as a starting point. Verify that IAM roles and permissions for object storage access are correctly configured and tested.

See also:

Audit logging

Enable and configure audit logging for compliance and security monitoring requirements.

Verify your audit log configuration:

Input
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config get audit_enabled -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism>
Output
true

Check to ensure you know where audit logs are being written:

Input
# Check audit log topic
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk topic list -X user=<sasl-username> -X pass=<sasl-password> -X sasl.mechanism=<sasl-mechanism> | grep audit
Output
_redpanda.audit_log    1    3

The output values of 1 and 3 indicate the number of partitions and replicas, respectively, for the audit log topic.

For production environments with compliance requirements (SOC 2, HIPAA, PCI DSS, GDPR), forward audit logs to your SIEM system and configure retention policies according to your regulatory obligations. Ensure the audit log topic has adequate replication and retention settings.

See also: Audit Logging

Monitoring

Check that monitoring is configured with Prometheus and Grafana to scrape metrics from all Redpanda brokers.

Verify ServiceMonitor is configured:

Input
kubectl get servicemonitor -n <namespace>

System log retention

Check that Redpanda logs are being captured and stored for an appropriate period of time (minimally, seven days). Configure log forwarding using tools like Fluentd or your cloud provider’s logging solution to send logs to a central location for troubleshooting and compliance purposes.

Environment configuration

Check that you have a development or test environment configured to evaluate upgrades and configuration changes before applying them to production.

Upgrade policy

Check that you have an upgrade policy defined and implemented. Redpanda supports rolling upgrades, so upgrades do not require downtime. However, make sure that upgrades are scheduled on a regular basis, ideally using automation with Helm or GitOps workflows.

Advanced requirements

The Advanced requirements checklist ensures full enterprise readiness, indicates that your system is operating at the highest level of availability, and can prevent or recover from the most serious incidents. The Advanced requirements checklist confirms the following:

  • You are proactively monitoring mission-critical workloads.

  • You have business continuity solutions in place.

  • You have integrated into enterprise security and operational systems.

  • Your enterprise is ready to run mission-critical workloads.

Configure alerts

A standard set of alerts for Grafana or Prometheus is provided in the GitHub Redpanda observability repo. Customize these alerts for your specific needs.

See also: Monitoring Metrics

Deployment automation

Review your deployment automation. Ensure that cluster configuration is managed using Helm or GitOps workflows, and that all configuration is saved in source control.

Monitor security settings

Regularly review your cluster’s security settings using the /v1/security/report Admin API endpoint. Investigate and address any issues identified in the alerts section.

Input
curl 'http://localhost:9644/v1/security/report'
View output
{
  "interfaces": {
    "kafka": [
      {
        "name": "test_kafka_listener",
        "host": "0.0.0.0",
        "port": 9092,
        "advertised_host": "0.0.0.0",
        "advertised_port": 9092,
        "tls_enabled": false,
        "mutual_tls_enabled": false,
        "authentication_method": "None",
        "authorization_enabled": false
      }
    ],
    "rpc": {
      "host": "0.0.0.0",
      "port": 33145,
      "advertised_host": "127.0.0.1",
      "advertised_port": 33145,
      "tls_enabled": false,
      "mutual_tls_enabled": false
    },
    "admin": [
      {
        "name": "test_admin_listener",
        "host": "0.0.0.0",
        "port": 9644,
        "tls_enabled": false,
        "mutual_tls_enabled": false,
        "authentication_methods": [],
        "authorization_enabled": false
      }
    ]
  },
  "alerts": [
    {
      "affected_interface": "kafka",
      "listener_name": "test_kafka_listener",
      "issue": "NO_TLS",
      "description": "\"kafka\" interface \"test_kafka_listener\" is not using TLS. This is insecure and not recommended."
    },
    {
      "affected_interface": "kafka",
      "listener_name": "test_kafka_listener",
      "issue": "NO_AUTHN",
      "description": "\"kafka\" interface \"test_kafka_listener\" is not using authentication. This is insecure and not recommended."
    },
    {
      "affected_interface": "kafka",
      "listener_name": "test_kafka_listener",
      "issue": "NO_AUTHZ",
      "description": "\"kafka\" interface \"test_kafka_listener\" is not using authorization. This is insecure and not recommended."
    },
    {
      "affected_interface": "rpc",
      "issue": "NO_TLS",
      "description": "\"rpc\" interface is not using TLS. This is insecure and not recommended."
    },
    {
      "affected_interface": "admin",
      "listener_name": "test_admin_listener",
      "issue": "NO_TLS",
      "description": "\"admin\" interface \"test_admin_listener\" is not using TLS. This is insecure and not recommended."
    },
    {
      "affected_interface": "admin",
      "listener_name": "test_admin_listener",
      "issue": "NO_AUTHZ",
      "description": "\"admin\" interface \"test_admin_listener\" is not using authorization. This is insecure and not recommended."
    },
    {
      "affected_interface": "admin",
      "listener_name": "test_admin_listener",
      "issue": "NO_AUTHN",
      "description": "\"admin\" interface \"test_admin_listener\" is not using authentication. This is insecure and not recommended."
    }
  ]
}