Docs Self-Managed Manage Cluster Maintenance Rolling Restart Perform a Rolling Restart A rolling restart involves restarting one broker at a time while the remaining brokers in your cluster continue running. This is to minimize downtime during a full cluster restart. You should perform a rolling restart during operations such as configuration updates that require a restart, version upgrades, or cluster maintenance. A rolling restart involves putting a broker into and out of maintenance mode, and then repeating the process on the next broker in the cluster. Placing brokers into maintenance mode ensures a smooth restart of your cluster while reducing the risk of interruption or degradation in service. When a broker is placed into maintenance mode, it reassigns its partition leadership to other brokers for all topics that have a replication factor greater than one. Reassigning partition leadership involves draining leadership from the broker and transferring that leadership to another broker. Check for topics that have a replication factor greater than one. If you have topics with replication.factor=1, and if you have sufficient disk space, Redpanda Data recommends temporarily increasing the replication factor. This can help limit outages for these topics during the rolling restart. Do this before the restart to make sure there’s time for the data to replicate to other brokers. For more information, see Change topic replication factor. Ensure that all brokers are active before restarting: rpk redpanda admin brokers list All brokers should show active for MEMBERSHIP-STATUS and true for IS-ALIVE: Example output NODE-ID NUM-CORES MEMBERSHIP-STATUS IS-ALIVE BROKER-VERSION 0 1 active true v22.3.11 1 1 active true v22.3.11 2 1 active true v22.3.11 Perform a rolling restart Enable maintenance mode Check that all brokers are healthy: rpk cluster health Example output: CLUSTER HEALTH OVERVIEW ======================= Healthy: true (1) Controller ID: 0 All nodes: [0 1 2] (2) Nodes down: [] (3) Leaderless partitions: [] (3) Under-replicated partitions: [] (3) 1 The cluster is either healthy (true) or unhealthy (false). 2 The node IDs of all brokers in the cluster. 3 If the cluster is unhealthy, these fields will contain data. Select a broker and place it into maintenance mode: rpk cluster maintenance enable <node-id> --wait The --wait option tells the command to wait until a given broker, 0 in this example, finishes draining all partitions it originally served. After the partition draining completes, the command completes. Expected output: Successfully enabled maintenance mode for node 0 Waiting for node to drain... Verify that the broker is in maintenance mode: rpk cluster maintenance status Expected output: NODE-ID DRAINING FINISHED ERRORS PARTITIONS ELIGIBLE TRANSFERRING FAILED 0 true true false 3 0 2 0 1 false false false 0 0 0 0 2 false false false 0 0 0 0 The Finished column should read true for the broker that you put into maintenance mode. Validate the health of the cluster again: rpk cluster health --watch --exit-when-healthy The combination of the --watch and --exit-when-healthy flags tell rpk to monitor the cluster health and exit only when the cluster is back in a healthy state. You can also evaluate metrics to determine cluster health. If the cluster has any issues, take the broker out of maintenance mode by running the following command before proceeding with other operations, such as decommissioning or retrying the rolling restart: rpk cluster maintenance disable <node-id> Check metrics Before continuing with the restart, check these important metrics to make sure the cluster is healthy and working as expected. Metric Name Description Recommendations redpanda_kafka_under_replicated_replicas Measures the number of under-replicated Kafka replicas. Non-zero: Replication lagging. Zero: All replicas replicated. Pause restart if non-zero. redpanda_cluster_unavailable_partitions Represents the number of partitions that are currently unavailable. Value of zero indicates all partitions are available. Non-zero indicates the respective count of unavailable partitions. Ensure metric shows zero unavailable partitions before restart. redpanda_kafka_request_bytes_total Total bytes processed for Kafka requests. Ensure produce and consume rate for each broker recovers to its pre-restart value. redpanda_kafka_request_latency_seconds Latency for processing Kafka requests. Indicates the delay between a Kafka request being initiated and completed. Ensure the p99 histogram value recovers to its pre-restart level. redpanda_rpc_request_latency_seconds Latency for processing RPC requests. Shows the delay between an RPC request initiation and completion. Ensure the p99 histogram value recovers to its pre-restart level. redpanda_cpu_busy_seconds_total CPU utilization for a given second. The value is a decimal between 0.0 and 1.0. A value of 1.0 means that the CPU was busy for the entire second, operating at 100% capacity. A value of 0.5 implies the CPU was busy for half the time (or 500 milliseconds) in the given second. A value of 0.0 indicates that the CPU was idle and not busy during the entire second. If you’re seeing high values consistently, investigate the reasons. It could be due to high traffic or other system bottlenecks. Restart the broker Restart the broker’s Redpanda service with rpk redpanda stop, then rpk redpanda start. Disable maintenance mode Take the broker out of maintenance mode: rpk cluster maintenance disable <node-id> Expected output: Successfully disabled maintenance mode for node 0 Ensure that the broker is no longer in maintenance mode: rpk cluster maintenance status Expected output: NODE-ID DRAINING FINISHED ERRORS PARTITIONS ELIGIBLE TRANSFERRING FAILED 0 false false false 0 0 0 0 1 false false false 0 0 0 0 2 false false false 0 0 0 0 Post-restart tasks To verify that the cluster is running properly, run: rpk cluster health To view additional information about your brokers, run: rpk redpanda admin brokers list Impact of broker restarts When brokers restart, clients may experience higher latency, nodes may experience CPU spikes when the broker becomes available again, and you may receive alerts about under-replicated partitions. Topics that weren’t using replication (that is, topics that had replication.factor=1) will be unavailable. Temporary increase in latency on clients (producers and consumers) When you restart one or more brokers in a cluster, clients (consumers and producers) may experience higher latency due to partition leadership reassignment. Because clients must communicate with the leader of a partition, they may send a request to a broker whose leadership has been transferred, and receive NOT_LEADER_FOR_PARTITION. In this case, clients must request metadata from the cluster to find out the address of the new leader. Clients refresh their metadata periodically, or when the client receives some retryable errors that indicate that the metadata may be stale. For example: Broker A shuts down. Client sends a request to broker A, and receives NOT_LEADER_FOR_PARTITION. Client requests metadata, and learns that the new leader is broker B. Client sends the request to broker B. CPU spikes upon broker restart When a restarted broker becomes available again, you may see your nodes' CPU usage increase temporarily. This temporary increase in CPU usage is due to the cluster rebalancing the partition replicas. Under-replicated partitions When a broker is in maintenance mode, Redpanda continues to replicate updates to that broker. When a broker is taken offline during a restart, partitions with replicas on the broker could become out of sync until it is brought back online. Once the broker is available again, data is copied to its under-replicated replicas until all affected partitions are in sync with the partition leader. Suggested reading Monitor Redpanda Suggested labs Enable Plain Login Authentication for Redpanda ConsoleOwl Shop Example Application in DockerStart a Single Redpanda Broker with Redpanda Console in DockerStart a Cluster of Redpanda Brokers with Redpanda Console in DockerSearch all labs Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution Maintenance Mode Audit Logging