How Redpanda Works

At its core, Redpanda is a fault-tolerant transaction log for storing event streams. Producers and consumers interact with Redpanda using the Kafka API. To achieve high scalability, producers and consumers are fully decoupled. Redpanda provides strong guarantees to producers that events are stored durably within the system, and consumers can subscribe to Redpanda and read the events asynchronously.

Redpanda achieves this decoupling by organizing events into topics. Topics represent a logical grouping of events that are written to the same log. A topic can have multiple producers writing events to it and multiple consumers reading events from it.

This page provides details about how Redpanda works. For a high-level overview, see Introduction to Redpanda.

Tiered Storage

Redpanda Tiered Storage is a multi-tiered object storage solution that provides the ability to offload log segments to object storage in near real time. Tiered Storage can be combined with local storage to provide long-term data retention and disaster recovery on a per-topic basis.

Consumers that read from more recent offsets continue to read from local storage, and consumers that read from historical offsets read from object storage, all with the same API. Consumers can read and reread events from any point within the maximum retention period, whether the events reside on local or object storage.

As data in object storage grows, the metadata for it grows. To support efficient long-term data retention, Redpanda splits the metadata in object storage, maintaining metadata of only recently-updated segments in memory or local disk, while safely archiving the remaining metadata in object storage and caching it locally on disk. Archived metadata is then loaded only when historical data is accessed. This allows Tiered Storage to handle partitions of virtually any size or retention length.

Partitions

To scale topics, Redpanda shards them into one or more partitions that are distributed across the nodes in a cluster. This allows for concurrent writing and reading from multiple nodes. When producers write to a topic, they route events to one of the topic’s partitions. Events with the same key (like a stock ticker) are always routed to the same partition, and Redpanda guarantees the order of events at the partition level. Consumers read events from a partition in the order that they were written. If a key is not specified, then events are sent to all topic-partitions in a round-robin fashion.

Raft consensus algorithm

Redpanda provides strong guarantees for data safety and fault tolerance. Events written to a topic-partition are appended to a log file on disk. They can be replicated to other nodes in the cluster and appended to their copies of the log file on disk to prevent data loss in the event of failure. The Raft consensus algorithm is used for data replication.

Every topic-partition forms a Raft group consisting of a single elected leader and zero or more followers (as specified by the topic’s replication factor). A Raft group can tolerate ƒ failures given 2ƒ+1 nodes. For example, in a cluster with five nodes and a topic with a replication factor of five, the topic remains fully operational if two nodes fail.

Raft is a majority vote algorithm. For a leader to acknowledge that an event has been committed to a partition, a majority of its replicas must have written that event to their copy of the log. When a majority (quorum) of responses have been received, the leader can make the event available to consumers and acknowledge receipt of the event when acks=all (-1). Producer acknowledgement settings define how producers and leaders communicate their status while transferring data.

As long as the leader and a majority of the replicas are stable, Redpanda can tolerate disturbances in a minority of the replicas. If gray failures cause a minority of replicas to respond slower than normal, then the leader does not have to wait for their responses to progress, and any additional latency is not passed on to the clients. The result is that Redpanda is less sensitive to faults and can deliver predictable performance.

Partition leadership elections

Raft uses a heartbeat mechanism to maintain leader authority and to trigger leader elections. The partition leader sends a periodic heartbeat to all followers to assert its leadership in the current term (default = 150 milliseconds). A term is an arbitrary period of time that starts when a leader election is triggered. If a follower does not receive a heartbeat over a period of time (default = 1.5 seconds), then it triggers an election to choose a new partition leader. The follower increments its term and votes for itself to be the leader for that term. It then sends a vote request to the other nodes and waits for one of the following scenarios:

  • It receives a majority of votes and becomes the leader. Raft guarantees that at most one candidate can be elected the leader for a given term.

  • Another follower establishes itself as the leader. While waiting for votes, the candidate may receive communication from another node in the group claiming to be the leader. The candidate only accepts the claim if its term is greater than or equal to the candidate’s term; otherwise, the communication is rejected and the candidate continues to wait for votes.

  • No leader is elected over a period of time. If multiple followers timeout and become election candidates at the same time, it’s possible that no candidate gets a majority of votes. When this happens, each candidate increments its term and triggers a new election round. Raft uses a random timeout between 150-300 milliseconds to ensure that split votes are rare and resolved quickly.

As long as there is a timing inequality between heartbeat time, election timeout, and mean time between node failures (MTBF), then Raft can elect and maintain a steady leader and make progress. A leader can maintain its position as long as one of the ten heartbeat messages it sends to all of its followers every 1.5 seconds is received; otherwise, a new leader is elected.

If a follower triggers an election, but the incumbent leader subsequently springs back to life and starts sending data again, then it’s too late. As part of the election process, the follower (now an election candidate) incremented the term and rejects requests from the previous term, essentially forcing a leadership change. If a cluster is experiencing wider network infrastructure problems that result in latencies above the heartbeat timeout, then back-to-back election rounds can be triggered. During this period, unstable Raft groups may not be able to form a quorum. This results in partitions rejecting writes, but data previously written to disk is not lost. Redpanda has a Raft-priority implementation that allows the system to settle quickly after network outages.

Controller partition and snapshots

Redpanda stores metadata update commands (such as creating and deleting topics or users) in a system partition called the controller partition. A new snapshot is created after each controller command is added, or, with rapid updates, after a set period of time (default is 60 seconds). Controller snapshots save the current cluster metadata state to disk, so startup is fast. For example, with a partition that has moved several times, a snapshot can restore the latest state without replaying every move command.

Each broker has a snapshot file stored in the controller log directory, such as /var/lib/redpanda/data/redpanda/controller/0_0/snapshot. The controller partition is replicated by a Raft group that includes all cluster brokers, and the controller snapshot is the Raft snapshot for this group. Snapshots are hydrated when a broker joins the cluster or restarts. Snapshots are enabled by default for all clusters, both new and upgraded.

Optimized platform performance

Redpanda is designed to exploit advances in modern hardware, from the network down to the disks. Network bandwidth has increased considerably, especially in object storage, and spinning disks have been replaced by SSD devices that deliver better I/O performance. CPUs are faster too, but this is largely due to the increased core counts as opposed to the increase in single-core speeds. Redpanda has tuners that detect your hardware configuration to automatically optimize itself.

Examples of platform and kernel features that Redpanda uses to optimize its performance:

  • Direct Memory Access (DMA) for disk I/O

  • Sparse file system support with XFS

  • Distribution of interrupt request (IRQ) processing between CPU cores

  • Isolated processes with control groups (cgroups)

  • Disabled CPU power-saving modes

  • Upfront memory allocation, partitioned and pinned to CPU cores

Thread-per-core model

Redpanda implements a thread-per-core programming model through its use of the Seastar library. This allows Redpanda to pin each of its application threads to a CPU core to avoid context switching and blocking. It combines this with structured message passing (SMP) to asynchronously communicate between the pinned threads. With this, Redpanda avoids the overhead of context switching and expensive locking operations to improve processing performance and efficiency.

From a sizing perspective, Redpanda’s ability to efficiently use all available hardware enables it to scale up to get the most out of your infrastructure, before you’re forced to scale out to meet the demands of your workload. Redpanda delivers better performance with a smaller footprint, resulting in reduced operational costs and complexity.