Production Readiness Checklist
Before running a production workload on Redpanda, follow this readiness checklist to ensure that you’re set up for success. Redpanda Data recommends using the automated deployment instructions with Ansible. If you cannot deploy with Ansible, use the manual deployment instructions.
Level 1 production readiness
The Level 1 readiness checklist helps you to confirm that:
-
All required defaults and configuration items are specified.
-
You have the optimal hardware setup.
-
Security is enabled.
-
You are set up to run in production.
Redpanda license
Check that the Redpanda License has been loaded into the cluster configuration. This is required to enable Enterprise features.
rpk cluster license info
LICENSE INFORMATION
===================
Organization: Redpanda Owlshop LLC
Type: enterprise
Expires: Mar 25 2025
Cluster health
Check that all brokers are connected and running. Run rpk cluster info
to check the health of the cluster. No nodes should be down, and there should be zero leaderless or under-replicated partitions. Then run rpk cluster health
. The cluster should be listed as healthy.
rpk cluster info
CLUSTER
=======
redpanda.be267958-279d-49cd-ae86-98fc7ed2de48
BROKERS
=======
ID HOST PORT RACK
0* 54.70.51.189 9092 us-west-2a
1 35.93.178.18 9092 us-west-2b
2 35.91.121.126 9092 us-west-2c
rpk cluster health
CLUSTER HEALTH OVERVIEW
=======================
Healthy: true
Unhealthy reasons: []
Controller ID: 0
All nodes: [0 1 2]
Nodes down: []
Leaderless partitions (0): []
Under-replicated partitions (0): []
Production mode enabled
Check that Redpanda is running in production mode. To check the status of a Redpanda broker, check its broker configuration in /etc/redpanda/redpanda.yaml
. Both developer_mode
and overprovisioned
should be false
or should not be present in the file. If either configuration is set to true
on any broker, then the cluster is not in full production mode and must be corrected.
grep -E 'developer_mode|overprovisioned' /etc/redpanda/redpanda.yaml
developer_mode: false
overprovisioned: false
System meets Redpanda requirements
Run sudo rpk redpanda check
to ensure that your system meets Redpanda’s requirements.
This command requires sudo because it’s looking in /proc or /sys , which may be read restricted.
|
sudo rpk redpanda check
System check results
CONDITION REQUIRED CURRENT SEVERITY PASSED
Ballast file present true true Warning true
Clock Source tsc tsc Warning true
Config file valid true true Fatal true
Connections listen backlog size >= 4096 4096 Warning true
Data directory filesystem type xfs xfs Warning true
Data directory is writable true true Fatal true
Data partition free space [GB] >= 10 1755.29 Warning true
Dir '/var/lib/redpanda/data' IRQs affinity set true true Warning true
Dir '/var/lib/redpanda/data' IRQs affinity static true true Warning true
Dir '/var/lib/redpanda/data' nomerges tuned true true Warning true
Dir '/var/lib/redpanda/data' scheduler tuned true true Warning true
Free memory per CPU [MB] 2048 per CPU 7659 Warning true
Fstrim systemd service and timer active true true Warning true
I/O config file present true true Warning true
Kernel Version 3.19 5.15.0-1056-aws Warning true
Max AIO Events >= 1048576 1048576 Warning true
Max syn backlog size >= 4096 4096 Warning true
NIC IRQs affinity static true true Warning true
NTP Synced true true Warning true
RFS Table entries >= 32768 32768 Warning true
Swap enabled true true Warning true
Swappiness 1 1 Warning true
Transparent huge pages active true true Warning true
Latest Redpanda version
Check that Redpanda is running the latest point release on every node for the major version you’re on.
/usr/bin/redpanda --version
24.3.8 - 1deb708
Correct CPUs and memory configured
Check that you have the correct number of CPUs and sufficient memory to run Redpanda.
journalctl -u redpanda | grep "System resources"
Mar 25 12:16:18 ip-172-31-10-199 rpk[3957]: INFO 2024-03-25 12:16:18,105 [shard 0:main] main - application.cc:350 - System resources: { cpus: 8, available memory: 55.578GiB, reserved memory: 3.890GiB}
Disks correctly mounted
Check that the correct disks are mounted, and if multiple devices are used, they are configured as RAID-0. Other RAID configurations can have significantly worse latencies. The file system should be type XFS. If XFS is unavailable, ext4 is an appropriate alternative.
grep data_directory /etc/redpanda/redpanda.yaml
data_directory: /var/lib/redpanda/data
df -khT /var/lib/redpanda/data
Filesystem Type Size Used Avail Use% Mounted on
/dev/nvme0n1 xfs 1.8T 14G 1.8T 1% /mnt/vectorized
Filesystem Type Size Used Avail Use% Mounted on
/dev/md0 xfs 14T 99G 14T 1% /mnt/vectorized
Example for how to get more details about the RAID array:
mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Thu Apr 18 11:03:46 2024
Raid Level : raid0
Array Size : 14648172544 (13969.59 GiB 14999.73 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Apr 18 11:03:46 2024
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Layout : -unknown-
Chunk Size : 512K
Consistency Policy : none
Name : ip-172-31-24-82:0 (local to host ip-172-31-24-82)
UUID : e9574118:10d562bf:ed3ca2d9:68ccc3a6
Events : 0
Number Major Minor RaidDevice State
0 259 2 0 active sync /dev/nvme2n1
1 259 0 1 active sync /dev/nvme1n1
Use these results to verify that the expected disks are present and the expected RAID level is set. (Typically, this would be raid0
in a production system, as data resilience is provided by Raft across Redpanda brokers, rather than by RAID.)
Authentication enabled
Check that authentication is set up (or other mitigations are in place). Without SASL authentication enabled, anybody can potentially connect to the Redpanda brokers.
rpk cluster config get kafka_enable_authorization
true
Superusers configured
Check that the Admin API is secured, and any users defined in the superusers configuration are appropriately protected with strong credentials.
See also: Create superusers
TLS enabled
Check that all public interfaces have TLS enabled.
journalctl -u redpanda.service | grep tls
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,513 [shard 0:main] main - application.cc:772 - redpanda.cloud_storage_disable_tls:0 - Disable TLS for all S3 connections
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,514 [shard 0:main] main - application.cc:772 - redpanda.kafka_mtls_principal_mapping_rules:{nullopt} - Principal Mapping Rules for mTLS Authentication on the Kafka API
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,514 [shard 0:main] main - application.cc:772 - **redpanda.admin_api_tls:{{name: , tls_config: { enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 }}} - TLS configuration for admin HTTP server
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **redpanda.kafka_api_tls:{{name: , tls_config: { enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 }}} - TLS configuration for Kafka API endpoint
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **redpanda.rpc_server_tls:{ enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 } - TLS configuration for RPC server
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - pandaproxy.pandaproxy_api_tls:{} - TLS configuration for Pandaproxy api
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **pandaproxy_client.broker_tls:{ enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 } - TLS configuration for the brokers
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **schema_registry.schema_registry_api_tls:{{name: , tls_config: { enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 }}} - TLS configuration for Schema Registry API
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - **schema_registry_client.broker_tls:{ enabled: 1** key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 } - TLS configuration for the brokers
Jun 06 12:41:35 ip-172-31-31-199 rpk[9673]: INFO 2024-06-06 12:41:35,515 [shard 0:main] main - application.cc:772 - audit_log_client.broker_tls:{ enabled: 1 key/cert files: {{ key_file: /etc/redpanda/certs/node.key cert_file: /etc/redpanda/certs/node.crt }} ca file: {/etc/redpanda/certs/truststore.pem} client_auth_required: 0 } - TLS configuration for the brokers
Using the logs on each broker, check to verify that the following interfaces have TLS enabled:
-
Kafka API
-
Admin REST API
-
Internal RPC Server
-
Schema Registry
-
HTTP Proxy (Pandaproxy)
In the logs, verify enabled: 1
.
See also: Multiple listeners
Run Redpanda tuners
Check that you have run tuners on all cluster hosts. This can have a significant impact on latency and throughput. Redpanda tuners ensure that the operating system is configured for optimal performance. In Kubernetes, you may need to run the tuners on the hosts themselves, rather than in containers.
systemctl status redpanda-tuner
redpanda-tuner.service - Redpanda Tuner
Loaded: loaded (/lib/systemd/system/redpanda-tuner.service; enabled; vendor preset: enabled)
Active: active (exited) since Mon 2024-03-25 12:03:51 UTC; 48min ago
Process: 3795 ExecStart=/usr/bin/rpk redpanda tune all $CPUSET (code=exited, status=0/SUCCESS)
Main PID: 3795 (code=exited, status=0/SUCCESS)
Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: cpu true true true
Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: disk_irq true true true
Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: disk_nomerges true true true
Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: disk_scheduler true true true
Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: disk_write_cache false true false Disk write cache tuner is only supported in GCP
Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: fstrim false false true
Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: net true true true
Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: swappiness true true true
Mar 25 12:03:51 ip-172-31-10-199 rpk[3795]: transparent_hugepages false false true
Mar 25 12:03:51 ip-172-31-10-199 systemd[1]: Finished Redpanda Tuner.
Check that rpk iotune
has been run on all hosts. Ensure that the mountpoint listed in this configuration file matches the mountpoint for Redpanda’s data directory, usually /var/lib/redpanda
. See Generate optimal I/O configuration settings.
See also:
cat /etc/redpanda/io-config.yaml
disks:
- mountpoint: /mnt/vectorized
read_iops: 413115
read_bandwidth: 1882494592
write_iops: 182408
write_bandwidth: 788050688
Check disk performance
Run rpk cluster self-test status
to ensure that disk performance is within an acceptable range.
See also: Cluster Diagnostics
rpk cluster self-test status
NODE ID: 1 | STATUS: IDLE
=========================
NAME 512KB sequential r/w throughput disk test
INFO write run
TYPE disk
TEST ID e13b2c93-2417-458b-87be-fac409089513
TIMEOUTS 0
DURATION 30000ms
IOPS 984 req/sec
THROUGHPUT 492.1MiB/sec
LATENCY P50 P90 P99 P999 MAX
4095us 4095us 4351us 4607us 5119us
Advertised hostnames use correct interfaces
Check that the advertised hostnames are operating on the correct network interfaces. For clusters with multiple interfaces (for example, a public and private IP address), set advertised_kafka_api
to the public interface and set advertised_rpc_api
to the private interface. These should be hostnames, not IP addresses.
grep -A2 advertised /etc/redpanda/redpanda.yaml
advertised_kafka_api:
- address: myhostname.customdomain.com
port: '9092'
advertised_rpc_api:
address: myinternalhostname.customdomain.com
port: '33145'
Confirm Continuous Data Balancing configuration
Run rpk cluster config get partition_autobalancing_mode
to ensure that Continuous Data Balancing is configured and enabled.
rpk cluster config get partition_autobalancing_mode
continuous
Generate debug bundle
Check that you can generate a debug bundle from each host and upload it to Redpanda support. This is how you can collect data and export it to Redpanda support.
sudo rpk debug bundle
Creating bundle file...
Debug bundle saved to '1711372017-bundle.zip'
See also:
Topic replication factor
Check that all topics have a replication factor greater than one.
rpk topic list
NAME PARTITIONS REPLICAS
bad 1 1
good 1 3
Redpanda Data recommends that you set minimum_topic_replications
and default_topic_replications
to at least 3.
rpk cluster config set minimum_topic_replications=3
rpk cluster config set default_topic_replications=3
See also: Change topic replication factor
No brokers in maintenance mode
Check that no brokers are in maintenance mode.
rpk cluster maintenance status
NODE-ID ENABLED FINISHED ERRORS PARTITIONS ELIGIBLE TRANSFERRING FAILED
1 false - - - - - -
2 false - - - - - -
3 false - - - - - -
See also: Remove a broker from maintenance mode
No brokers in decommissioned state
Check that no brokers are in a decommissioned state.
rpk redpanda admin brokers list
NODE-ID NUM-CORES MEMBERSHIP-STATUS IS-ALIVE BROKER-VERSION
0 1 active true v24.1.6 - 5e880f6fd1a610d0991b00e32c012a03b14888ca
1 1 active true v24.1.6 - 5e880f6fd1a610d0991b00e32c012a03b14888ca
2 1 active true v24.1.6 - 5e880f6fd1a610d0991b00e32c012a03b14888ca
See also: Decommission Brokers
Level 2 production readiness
The Level 2 readiness checklist confirms that you can monitor and support your environment on a sustained basis. It includes the following checks:
-
You have adhered to 2-day operations best practices.
-
You can diagnose and recover from issues or failures.
Environment configuration
Check that you have a development environment or test environment configured to evaluate upgrades and new versions before rolling them straight to production.
Monitoring
Check that monitoring is configured with Prometheus, Grafana, or Datadog to scrape metrics from all Redpanda brokers at a regular interval.
System log retention
Check that system logs are being captured and stored for an appropriate period of time (minimally, 7 days). On bare metal, this may be journald. On Kubernetes you may need to have fluentd or an equivalent configured, with logs sent to a central location.
See also: rpk debug bundle
Upgrade policy
Check that you have an upgrade policy defined and implemented. Redpanda Enterprise Edition supports rolling upgrades, so upgrades do not require downtime. However, make sure that upgrades are scheduled on a regular basis, ideally using automation such as Ansible or Helm.
High availability
If you have high availability requirements, check that the cluster is configured across multiple availability zones or fault domains.
rpk cluster info
CLUSTER
=======
redpanda.be267958-279d-49cd-ae86-98fc7ed2de48
BROKERS
=======
ID HOST PORT RACK
0* 54.70.51.189 9092 us-west-2a
1 35.93.178.18 9092 us-west-2b
2 35.91.121.126 9092 us-west-2c
Check that rack awareness is configured correctly.
rpk cluster config get enable_rack_awareness
true
See also:
Level 3 production readiness
The Level 3 readiness checklist ensures full enterprise readiness. This indicates that your system is operating at the highest level of availability and can prevent or recover from the most serious incidents. The Level 3 readiness confirms the following:
-
You are proactively monitoring mission-critical workloads, business continuity solutions, and integration into enterprise security systems.
-
Your enterprise is ready to run mission-critical workloads.
Configure alerts
A standard set of alerts for Grafana or Prometheus is provided in the GitHub Redpanda observability repo. However, you should customize these alerts for your specific needs.
See also: Monitoring Metrics
Backup and disaster recovery (DR) solution
Check that you have a backup and disaster recovery (DR) solution in place. You can configure backup and restore using Tiered Storage Whole Cluster Recovery.
Be sure to confirm that the backup and DR solution has been tested.
For disaster recovery, confirm that a standby cluster is configured and running with replication (such as MirrorMaker2). Also verify that your monitoring ensures that MirrorMaker2 is running and checks replication traffic. See High-availability deployment of Redpanda: Patterns and considerations for more details about HA and DR options.
Audit logs
Check that your audit logs are forwarded to an enterprise security information and event management (SIEM) system.