Sizing Use Cases

The following scenarios provide estimates and advice for sizing Redpanda clusters for different throughput and retention use cases in your data center and in object storage. For details about sizing considerations, see Sizing Guidelines.

These use cases assume a happy path with known metrics and expected outputs, but many other factors can influence performance, such as batch size and other sources of network traffic.

Low throughput

Metric Value

Producer throughput

75 MB/sec (600 Mbps)

Producer rate

300 messages per second

Consumer throughput

75 MB/sec (600 Mbps)

Consumer rate

300 messages per second

Data retention

3 days

Average message size

250 KB

Failure tolerance

1 node

In this use case, despite the relatively low throughput of 150 MB/sec (producer plus consumer), it’s important to calculate the expected bandwidth utilization and to use a network testing tool like iPerf to verify that the bandwidth is available and sustainable. With a single topic with a replication factor of three, producing 75 MB/sec generates an additional 150 MB/sec of data transmitted over the network for replication, and it generates a further 75 MB/sec for the consumers.

The 150 MB/sec of bandwidth for replication is full duplex (where each byte sent by a broker is received by some other broker). The 75 MB/sec producer and consumer flows, however, are half-duplex, because the client endpoint in each case is outside of the cluster. Therefore, the intra-cluster bandwidth is 225 MB for incoming and outgoing flows:

  • 150 MB/sec of intra-cluster full duplex bandwidth

  • 75 MB/sec of ingress from producers

  • 75 MB/sec of egress to consumers

Three nodes satisfy Redpanda’s minimum deployment requirement (so Raft can form quorums) and also the single node failure tolerance. Divide the bandwidth total by the node count (3) to get the per-node bandwidth requirements. The throughput is not high enough to warrant any more than two cores and a single NVMe SSD disk. Be mindful of predicted growth of CPU and disk usage, and estimate when the cluster might need to scale up or scale out.

With an average producer throughput of 75 MB/sec and a replication factor of three, each node writes 254 GB of data each hour and 6.4 TB of data each day. For three days of data retention, each node needs at least 20 TB of storage.

This assumes that each node could be a leader or a follower, and there are a sufficient number of partitions for good distribution. A typical node is the leader for 1/Nth of the partitions in a cluster with N nodes and a follower for 2/Nths of the partitions. However, the per node bandwidth could vary if distribution is uneven. You may have an inexact distribution of load during Redpanda partition balancing or when the client library doesn’t write to each partition evenly.

The following machine specifications provide a minimum for a bare metal cluster or its cloud-based equivalent.

Bare Metal AWS GCP Azure

Instance Type

-

m5.large

n2-standard-2

F2s_v2

Nodes

3

3

3

3

Cores

2

2

2

2

Memory

4 GB

8 GB

8 GB

4 GB

Instance Storage

20 TB (NVMe)

-

-

16 GB (SSD)

Persistent Storage

-

20 TB (gb3)

20 TB (Zonal SSD PD)

20 TB (Standard SSD)

Network

4 Gbps

Up to 10 Gbps

10 Gbps

5 Gbps

Tiered Storage

False

False

False

False

Medium throughput

Metric Value

Producer throughput

~500 MB/sec (~4,000 Mbps)

Producer rate

2,000 messages per second

Consumer throughput

~1,000 MB/sec (~8,000 Mbps)

Consumer rate

4,000 messages per second

Data retention

24 hours

Average message size

250 KB

Failure tolerance

1 node

Producing an average of 500 MB/sec and consuming an average of 1,000 MB/sec equates to 2,500 MB/sec (20 Gbps) of network bandwidth for replication traffic. This is attainable but expensive with cloud providers, and these speeds are not as prevalent within a typical data center.

With at least one partition for each core, the 500 MB/sec of data from producers is evenly distributed between the nodes. For example, with three nodes, each node receives approximately 167 MB/sec. However, that bandwidth value increases with data replication.

Producer MB/sec Consumer MB/sec Avg. Replication Factor Nodes Writes per node MB/sec Reads per node MB/sec

500

1,500

3

3

500/3 * 3 = 500

1500/3 = 500

500

1,500

3

5

500/5 * 3 = 300

1500/5 = 300

500

1,500

3

7

500/7 * 3 = 215

1500/7 = 215

500

1,500

5

7

500/7 * 5 = 358

1500/7 = 215

The additional 500 MB/sec for consumer throughput is for Tiered Storage and the bandwidth required to archive log segments to object storage. When Tiered Storage is enabled on a topic, it essentially adds another consumer’s worth of bandwidth on the network.

To balance the available local disk, consider exactly how many reads can be serviced from local storage. Different instance types or locally attached NVMe SSD disks provide different amounts of local storage, and therefore different amounts of available data without going back to object storage.

A topic with Tiered Storage enabled can write data to faster local storage managed by local retention settings, and at the same time, it can write data to object storage managed by different retention settings, or left to grow for a longer period. Consumers that generally keep up with producers stream from local storage, but at this velocity that window of opportunity is narrower. The object store enables a consumer to read from an older offset when necessary.

Bare Metal AWS GCP Azure

Instance Type

-

i3en.6xlarge

n2-standard-32

F48s_v2

Nodes

3

3

3

3

Cores

24

24

32

48

Memory

192 GB

192 GB

128 GB

96 GB

Instance Storage

30 TB (NVMe)

15 TB (NVM3)

9 TB (SSD)

384 TB (SSD)

Persistent Storage

-

-

-

20 TB (Standard SSD)

Available Local Retention

17 hrs

8 hrs

5 hrs

9 days

Network

25 Gbps

25 Gbps

32 Gbps

21 Gbps

Tiered Storage

True

True

True

True

High throughput

Metric Value

Producer throughput

1,000 MB/sec (8,000 Mbps)

Producer rate

4,000 messages per second

Consumer throughput

2,000 MB/sec (16,000 Mbps)

Consumer rate

8,000 messages per second

Data retention

24 hours

Average message size

250 KB

Failure tolerance

2 nodes

This use case has many topics, hundreds of partitions, and a high throughput. The combined producer and replication data equates to 8 Gbps of network traffic, plus 16 Gbps for the consumers and 8 Gbps for Tiered Storage. In total, that’s at least 32 Gbps of network bandwidth required to sustain this level of throughput. Writing at 1,000 MB/sec is near the upper limit of what a single NVMe disk can sustain.

At this scale, you get significant performance gains by distributing the writes over many cores and disks to better leverage Redpanda’s thread-per-core model. For example, given five nodes with 24 cores each, start with at least one partition for each core (120 partitions in total) and scale up. Redpanda generates over 3 TB of writes each hour and over 80 TB each day. Local storage is going to fill up quickly, and the window of opportunity for consumers to read from local storage is going to be shorter than in the other scenarios. In this use case, Tiered Storage is essential.

Bare Metal AWS GCP Azure

Instance Type

-

i3en.12xlarge

n2-standard-48

F48s_v2

Nodes

5

5

5

5

Cores

24

48

48

48

Memory

192 GB

384 GB

192 GB

96 GB

Instance Storage

30 TB (NVMe)

30 TB (NVM3)

9 TB (SSD)

384 TB (SSD)

Persistent Storage

-

-

-

30 TB (Ultra SSD)

Available Local Retention

14 hrs

7 hrs

4 hrs

7 days

Network

25 Gbps

25 Gbps

32 Gbps

21 Gbps

Tiered Storage

True

True

True

True