Kubernetes Cluster Requirements and Recommendations

This topic provides the requirements and recommendations for provisioning Kubernetes clusters and worker nodes for running Redpanda in production.

Operating system

  • Minimum version required of RHEL/CentOS: 8. Recommended: 9+

  • Minimum version required of Ubuntu: 20.04 LTS. Recommended: 22.04+

Enabling SELinux (Security-Enhanced Linux) can result in latency issues. If you wish to avoid such latency issues, do not use this mechanism.

Recommendation: Linux kernel 4.19 or later for better performance.

Kubernetes version

Minimum required Kubernetes version: 1.25.0-0

Make sure to do the following:

Helm version

Minimum required Helm version: 3.10.0

Number of nodes

Provision one physical node or virtual machine (VM) for each Redpanda broker that you plan to deploy in your Redpanda cluster. Each Redpanda broker requires its own dedicated node for the following reasons:

  • Resource isolation: Redpanda brokers are designed to make full use of available system resources, including CPU and memory. By dedicating a node to each broker, you ensure that these resources aren’t shared with other applications or processes, avoiding potential performance bottlenecks or contention.

  • External networking: External clients should connect directly to the broker that owns the partition they’re interested in. This means that each broker must be individually addressable. As clients must connect to the specific broker that is the leader of the partition, they need a mechanism to directly address each broker in the cluster. Assigning each broker to its own dedicated node makes this direct addressing feasible, since each node will have a unique address. See External networking.

  • Fault tolerance: Ensuring each broker operates on a separate node enhances fault tolerance. If one node experiences issues, it won’t directly impact the other brokers.

The Redpanda Helm chart configures podAntiAffinity rules to make sure that each Redpanda broker runs on its own node.

Node maintenance and operating system upgrades

Ensure that node and operating system (OS) upgrades are manually managed when running Redpanda in production. Manual control avoids unplanned reboots or replacements that disrupt Redpanda brokers, causing service downtime, data loss, or quorum instability.

Limitations of automatic updates

Redpanda is stateful. Redpanda brokers manage partition data and leadership, making them sensitive to disruptions. Proper handling during maintenance is required to:

  • Avoid data loss, especially for nodes with ephemeral or local storage.

  • Ensure smooth leadership transitions by decommissioning brokers before removing a node.

  • Minimize service downtime by upgrading nodes one at a time during planned maintenance windows.

However, automatic update mechanisms provided by cloud platforms may not meet Redpanda’s stateful requirements. Common issues include:

  • Hard timeouts for graceful shutdowns that may not allow Redpanda brokers enough time to complete decommissioning or leadership transitions.

  • Replacements or reboots without ensuring data has been safely migrated or replicated, risking data loss.

  • Parallel upgrades across multiple nodes, which can disrupt quorum or reduce cluster availability.

Recommendations:

CPU and memory

Requirements:

  • Two physical, not virtual, cores for each node.

  • x86_64 (Westmere or newer) and AWS Graviton family processors are supported.

  • 2 GB or more of memory per core.

  • 4 MB of memory for each topic partition replica. You can enforce this requirement in the tunable topic_memory_per_partition property.

Recommendations:

Storage

Requirements:

  • An XFS or ext4 file system.

    The Redpanda data directory (/var/lib/redpanda/data) and the Tiered Storage cache must be mounted on an XFS or ext4 file system.

    For information about supported volume types for different data in Redpanda, see Supported Volume Types for Data in Redpanda.

    The Network File System (NFS) is unsupported for use as the storage mechanism for the Redpanda data directory or for the Tiered Storage cache.
  • A default StorageClass that can provision PersistentVolumes with at least 20Gi of storage.

Recommendations:

  • Use an XFS file system for its enhanced performance with Redpanda workloads.

  • For setups with multiple disks, use a RAID-0 (striped) array. It boosts speed but lacks redundancy. A disk failure can lead to data loss.

  • Use local PersistentVolumes backed by NVMe disks.

Security

Recommendations:

  • If you’re using a cloud platform, use IAM roles to restrict access to resources in your cluster.

  • Secure your Redpanda cluster with TLS encryption and SASL authentication.

External networking

  • For external access, each node in your cluster must have a static, externally accessible IP address.

  • Minimum 10 GigE (10 Gigabit Ethernet) connection to ensure:

    • High data throughput

    • Reduced data transfer latency

    • Scalability for increased network traffic

Tuning

Before deploying Redpanda to production, each node that runs Redpanda must be tuned to optimize the Linux kernel for Redpanda processes.

Object storage providers for Tiered Storage

Redpanda supports the following storage providers for Tiered Storage:

  • Amazon Simple Storage Service (S3)

  • Google Cloud Storage (GCS), using the Google Cloud Platform S3 API

  • Azure Blob Storage (ABS)

Cloud instance types

Recommendations:

  • Use a cloud instance type that supports locally attached NVMe devices with an XFS file system. NVMe devices offer high I/O operations per second (IOPS) and minimal latency, while XFS offers enhanced performance with Redpanda workloads.

Amazon

EKS defaults to the ext4 file system. Use XFS instead where possible.

  • General purpose: General-purpose instances provide a balance of compute, memory, and networking resources, and they can be used for a variety of diverse workloads.

  • Memory optimized: Memory-optimized instances are designed to deliver fast performance for workloads that process large data sets in memory.

  • Storage optimized: Storage-optimized instances are designed for workloads that require high, sequential read and write access to very large data sets on local storage. They are optimized to deliver tens of thousands of low-latency, random IOPS to applications.

  • Compute optimized: Compute-optimized instances deliver cost-effective high performance at a low price per compute ratio for running advanced compute-intensive workloads.

Azure

AKS often defaults to the ext4 file system. Use XFS instead where possible.

Google

GKE often defaults to the ext4 file system. Use XFS instead where possible.

  • General purpose: The general-purpose machine family has the best price-performance with the most flexible vCPU to memory ratios, and provides features that target most standard and cloud-native workloads.

  • Memory optimized: The memory-optimized machine family provides the most compute and memory resources of any Compute Engine machine family offering. They are ideal for workloads that require higher memory-to-vCPU ratios than the high-memory machine types in the general-purpose N1 machine series.

  • Compute optimized: Compute-optimized VM instances are ideal for compute-intensive and high-performance computing (HPC) workloads.