Docs Self-Managed Deploy Linux Deployment Options Production Readiness Checklist This is documentation for Self-Managed v24.1. To view the latest available version of the docs, see v24.3. Production Readiness Checklist Before running a production workload on Redpanda, follow this readiness checklist to ensure that you’re set up for success. Redpanda Data recommends using the automated deployment instructions with Ansible. If you cannot deploy with Ansible, use the manual deployment instructions. Level 1 production readiness The Level 1 readiness checklist helps you to confirm that: All required defaults and configuration items are specified. You have the optimal hardware setup. Security is enabled. You are set up to run in production. Redpanda license Check that the Redpanda License has been loaded into the cluster configuration. This is required to enable Enterprise features. Input rpk cluster license info Output LICENSE INFORMATION =================== Organization: Redpanda Owlshop LLC Type: enterprise Expires: Mar 25 2025 Cluster health Check that all brokers are connected and running. Run rpk cluster info to check the health of the cluster. No nodes should be down, and there should be zero leaderless or under-replicated partitions. Then run rpk cluster health. The cluster should be listed as healthy. Input rpk cluster info Output CLUSTER ======= redpanda.be267958-279d-49cd-ae86-98fc7ed2de48 BROKERS ======= ID HOST PORT RACK 0* 54.70.51.189 9092 us-west-2a 1 35.93.178.18 9092 us-west-2b 2 35.91.121.126 9092 us-west-2c Input rpk cluster health Output CLUSTER HEALTH OVERVIEW ======================= Healthy: true Unhealthy reasons: [] Controller ID: 0 All nodes: [0 1 2] Nodes down: [] Leaderless partitions (0): [] Under-replicated partitions (0): [] Production mode enabled Check that Redpanda is running in production mode. To check the status of a Redpanda broker, check its broker configuration in /etc/redpanda/redpanda.yaml. Both developer_mode and overprovisioned should be false or should not be present in the file. If either configuration is set to true on any broker, then the cluster is not in full production mode and must be corrected. Input grep -E 'developer_mode|overprovisioned' /etc/redpanda/redpanda.yaml Output developer_mode: false overprovisioned: false System meets Redpanda requirements Run sudo rpk redpanda check to ensure that your system meets Redpanda’s requirements. This command requires sudo because it’s looking in /proc or /sys, which may be read restricted. Input sudo rpk redpanda check Output System check results CONDITION REQUIRED CURRENT SEVERITY PASSED Ballast file present true true Warning true Clock Source tsc tsc Warning true Config file valid true true Fatal true Connections listen backlog size >= 4096 4096 Warning true Data directory filesystem type xfs xfs Warning true Data directory is writable true true Fatal true Data partition free space [GB] >= 10 1755.29 Warning true Dir '/var/lib/redpanda/data' IRQs affinity set true true Warning true Dir '/var/lib/redpanda/data' IRQs affinity static true true Warning true Dir '/var/lib/redpanda/data' nomerges tuned true true Warning true Dir '/var/lib/redpanda/data' scheduler tuned true true Warning true Free memory per CPU [MB] 2048 per CPU 7659 Warning true Fstrim systemd service and timer active true true Warning true I/O config file present true true Warning true Kernel Version 3.19 5.15.0-1056-aws Warning true Max AIO Events >= 1048576 1048576 Warning true Max syn backlog size >= 4096 4096 Warning true NIC IRQs affinity static true true Warning true NTP Synced true true Warning true RFS Table entries >= 32768 32768 Warning true Swap enabled true true Warning true Swappiness 1 1 Warning true Transparent huge pages active true true Warning true Latest Redpanda version Check that Redpanda is running the latest point release on every node for the major version you’re on. Input /usr/bin/redpanda --version Output 24.1.1 - b5ade3f40 Correct CPUs and memory configured Check that you have the correct number of CPUs and sufficient memory to run Redpanda. Input journalctl -u redpanda | grep "System resources" Output Mar 25 12:16:18 ip-172-31-10-199 rpk[3957]: INFO 2024-03-25 12:16:18,105 [shard 0:main] main - application.cc:350 - System resources: { cpus: 8, available memory: 55.578GiB, reserved memory: 3.890GiB} Disks correctly mounted Check that the correct disks are mounted, and if multiple devices are used, they are configured as RAID-0. Other RAID configurations can have significantly worse latencies. The file system should be type XFS. If XFS is unavailable, ext4 is an appropriate alternative. Input grep data_directory /etc/redpanda/redpanda.yaml data_directory: /var/lib/redpanda/data df -khT /var/lib/redpanda/data Output for NVMe with XFS Filesystem Type Size Used Avail Use% Mounted on /dev/nvme0n1 xfs 1.8T 14G 1.8T 1% /mnt/vectorized Output for madm RAID mount point Filesystem Type Size Used Avail Use% Mounted on /dev/md0 xfs 14T 99G 14T 1% /mnt/vectorized Example for how to get more details about the RAID array: Input mdadm --detail /dev/md0 Output /dev/md0: Version : 1.2 Creation Time : Thu Apr 18 11:03:46 2024 Raid Level : raid0 Array Size : 14648172544 (13969.59 GiB 14999.73 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Thu Apr 18 11:03:46 2024 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Layout : -unknown- Chunk Size : 512K Consistency Policy : none Name : ip-172-31-24-82:0 (local to host ip-172-31-24-82) UUID : e9574118:10d562bf:ed3ca2d9:68ccc3a6 Events : 0 Number Major Minor RaidDevice State 0 259 2 0 active sync /dev/nvme2n1 1 259 0 1 active sync /dev/nvme1n1 Use these results to verify that the expected disks are present and the expected RAID level is set. (Typically, this would be raid0 in a production system, as data resilience is provided by Raft across Redpanda brokers, rather than by RAID.) Authentication enabled Check that authentication is set up (or other mitigations are in place). Without SASL authentication enabled, anybody can potentially connect to the Redpanda brokers. Input rpk cluster config get kafka_enable_authorization Output true Superusers configured Check that the Admin API is secured, and any users defined in the superusers configuration are appropriately protected with strong credentials. See also: Create superusers TLS enabled Check that all public interfaces have TLS enabled. Input journalctl -u redpanda.service | grep tls Output Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,513 [shard 0:main] main - application.cc:772 - redpanda.cloud_storage_disable_tls:0 - Disable TLS for all S3 connections Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,514 [shard 0:main] main - application.cc:772 - redpanda.kafka_mtls_principal_mapping_rules:{nullopt} - Principal Mapping Rules for mTLS Authentication on the Kafka API Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,514 [shard 0:main] main - application.cc:772 - **redpanda.admin_api_tls:{{name: , tls_config: { enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 }}} - TLS configuration for admin HTTP server Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **redpanda.kafka_api_tls:{{name: , tls_config: { enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 }}} - TLS configuration for Kafka API endpoint Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **redpanda.rpc_server_tls:{ enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 } - TLS configuration for RPC server Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - pandaproxy.pandaproxy_api_tls:{} - TLS configuration for Pandaproxy api Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **pandaproxy_client.broker_tls:{ enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 } - TLS configuration for the brokers Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **schema_registry.schema_registry_api_tls:{{name: , tls_config: { enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 }}} - TLS configuration for Schema Registry API Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **schema_registry_client.broker_tls:{ enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 } - TLS configuration for the brokers Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - audit_log_client.broker_tls:{ enabled: 1 key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 } - TLS configuration for the brokers Using the logs on each broker, check to verify that the following interfaces have TLS enabled: Kafka API Admin REST API Internal RPC Server Schema Registry HTTP Proxy (Pandaproxy) In the logs, verify enabled: 1. See also: Multiple listeners Run Redpanda tuners Check that you have run tuners on all cluster hosts. This can have a significant impact on latency and throughput. Redpanda tuners ensure that the operating system is configured for optimal performance. In Kubernetes, you may need to run the tuners on the hosts themselves, rather than in containers. Input systemctl status redpanda-tuner Output redpanda-tuner.service - Redpanda Tuner Loaded: loaded (/lib/systemd/system/redpanda-tuner.service; enabled; vendor preset: enabled) Active: active (exited) since Mon 2024-03-25 12:03:51 UTC; 48min ago Process: 3795 ExecStart=/usr/bin/rpk redpanda tune all $CPUSET (code=exited, status=0/SUCCESS) Main PID: 3795 (code=exited, status=0/SUCCESS) Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: cpu true true true Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: disk_irq true true true Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: disk_nomerges true true true Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: disk_scheduler true true true Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: disk_write_cache false true false Disk write cache tuner is only supported in GCP Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: fstrim false false true Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: net true true true Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: swappiness true true true Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: transparent_hugepages false false true Mar 25 12:03:51 ip-172-31-10-199 systemd[1]: Finished Redpanda Tuner. Check that rpk iotune has been run on all hosts. Ensure that the mountpoint listed in this configuration file matches the mountpoint for Redpanda’s data directory, usually /var/lib/redpanda. See Generate optimal I/O configuration settings. See also: Tune the Linux kernel for production Tune Kubernetes Worker Nodes for Production Input cat /etc/redpanda/io-config.yaml disks: - mountpoint: /mnt/vectorized read_iops: 413115 read_bandwidth: 1882494592 write_iops: 182408 write_bandwidth: 788050688 Check disk performance Run rpk cluster self-test status to ensure that disk performance is within an acceptable range. See also: Cluster Diagnostics Input rpk cluster self-test status Output NODE ID: 1 | STATUS: IDLE ========================= NAME 512KB sequential r/w throughput disk test INFO write run TYPE disk TEST ID e13b2c93-2417-458b-87be-fac409089513 TIMEOUTS 0 DURATION 30000ms IOPS 984 req/sec THROUGHPUT 492.1MiB/sec LATENCY P50 P90 P99 P999 MAX 4095us 4095us 4351us 4607us 5119us Advertised hostnames use correct interfaces Check that the advertised hostnames are operating on the correct network interfaces. For clusters with multiple interfaces (for example, a public and private IP address), set advertised_kafka_api to the public interface and set advertised_rpc_api to the private interface. These should be hostnames, not IP addresses. Example grep -A2 advertised /etc/redpanda/redpanda.yaml advertised_kafka_api: - address: myhostname.customdomain.com port: '9092' advertised_rpc_api: address: myinternalhostname.customdomain.com port: '33145' Confirm Continuous Data Balancing configuration Run rpk cluster config get partition_autobalancing_mode to ensure that Continuous Data Balancing is configured and enabled. Input rpk cluster config get partition_autobalancing_mode Output continuous Generate debug bundle Check that you can generate a debug bundle from each host and upload it to Redpanda support. This is how you can collect data and export it to Redpanda support. Input sudo rpk debug bundle Output Creating bundle file... Debug bundle saved to '1711372017-bundle.zip' See also: rpk debug bundle Diagnostics Bundles in Kubernetes Topic replication factor Check that all topics have a replication factor greater than one. Input rpk topic list Output NAME PARTITIONS REPLICAS bad 1 1 good 1 3 Redpanda Data recommends that you set minimum_topic_replications and default_topic_replications to at least 3. rpk cluster config set minimum_topic_replications=3 rpk cluster config set default_topic_replications=3 See also: Change topic replication factor No brokers in maintenance mode Check that no brokers are in maintenance mode. Input rpk cluster maintenance status Output NODE-ID ENABLED FINISHED ERRORS PARTITIONS ELIGIBLE TRANSFERRING FAILED 1 false - - - - - - 2 false - - - - - - 3 false - - - - - - See also: Remove a broker from maintenance mode No brokers in decommissioned state Check that no brokers are in a decommissioned state. Input rpk redpanda admin brokers list Output NODE-ID NUM-CORES MEMBERSHIP-STATUS IS-ALIVE BROKER-VERSION 0 1 active true v24.1.6 - 5e880f6fd1a610d0991b00e32c012a03b14888ca 1 1 active true v24.1.6 - 5e880f6fd1a610d0991b00e32c012a03b14888ca 2 1 active true v24.1.6 - 5e880f6fd1a610d0991b00e32c012a03b14888ca See also: Decommission Brokers Level 2 production readiness The Level 2 readiness checklist confirms that you can monitor and support your environment on a sustained basis. It includes the following checks: You have adhered to 2-day operations best practices. You can diagnose and recover from issues or failures. Environment configuration Check that you have a development environment or test environment configured to evaluate upgrades and new versions before rolling them straight to production. Monitoring Check that monitoring is configured with Prometheus, Grafana, or Datadog to scrape metrics from all Redpanda brokers at a regular interval. System log retention Check that system logs are being captured and stored for an appropriate period of time (minimally, 7 days). On bare metal, this may be journald. On Kubernetes you may need to have fluentd or an equivalent configured, with logs sent to a central location. See also: rpk debug bundle Upgrade policy Check that you have an upgrade policy defined and implemented. Redpanda Enterprise Edition supports rolling upgrades, so upgrades do not require downtime. However, make sure that upgrades are scheduled on a regular basis, ideally using automation such as Ansible or Helm. High availability If you have high availability requirements, check that the cluster is configured across multiple availability zones or fault domains. Input rpk cluster info Output CLUSTER ======= redpanda.be267958-279d-49cd-ae86-98fc7ed2de48 BROKERS ======= ID HOST PORT RACK 0* 54.70.51.189 9092 us-west-2a 1 35.93.178.18 9092 us-west-2b 2 35.91.121.126 9092 us-west-2c Check that rack awareness is configured correctly. Input rpk cluster config get enable_rack_awareness Output true See also: Multi-AZ deployments Configure rack awareness in Kubernetes Level 3 production readiness The Level 3 readiness checklist ensures full enterprise readiness. This indicates that your system is operating at the highest level of availability and can prevent or recover from the most serious incidents. The Level 3 readiness confirms the following: You are proactively monitoring mission-critical workloads, business continuity solutions, and integration into enterprise security systems. Your enterprise is ready to run mission-critical workloads. Configure alerts A standard set of alerts for Grafana or Prometheus is provided in the GitHub Redpanda observability repo. However, you should customize these alerts for your specific needs. See also: Monitoring Metrics Backup and disaster recovery (DR) solution Check that you have a backup and disaster recovery (DR) solution in place. You can configure backup and restore using Tiered Storage Whole Cluster Recovery. Be sure to confirm that the backup and DR solution has been tested. For disaster recovery, confirm that a standby cluster is configured and running with replication (such as MirrorMaker2). Also verify that your monitoring ensures that MirrorMaker2 is running and checks replication traffic. See High-availability deployment of Redpanda: Patterns and considerations for more details about HA and DR options. Deployment automation Review your deployment automation. Specifically, if you need to reprovision a cluster, ensure that cluster installation is managed using automation such as Terraform, Ansible, or Helm, and that the configuration is saved in source control. Audit logs Check that your audit logs are forwarded to an enterprise security information and event management (SIEM) system. Suggested reading Deploy for Production: Manual Deploy for Production: Automated Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution Deploy for Production: Manual High Availability