Node Maintenance Mode
Node maintenance mode enables you to take a Redpanda node offline temporarily while minimizing disruption to client operations. When a node is in maintenance mode, you can safely perform operations that require the Redpanda process to be temporarily stopped; for example, performing hardware or system maintenance, or performing a rolling upgrade.
A node cannot be decommissioned while it is in maintenance mode. Take the node out of maintenance mode first by running
When a node is placed in maintenance mode, if the replication factor is greater than one, it reassigns partition leadership to other nodes in the cluster. The node is not eligible for partition leadership again until it is taken out of maintenance mode.
The amount of time it takes to drain a node and reassign leadership depends on the number of partitions and how healthy the cluster is. For healthy clusters, draining leadership should take less than a minute. If the cluster is unhealthy (for example, a follower is not in sync with the leader), then draining the node can take even longer. Note that the draining process won’t start until the cluster is healthy.
When a node is in maintenance mode, Redpanda continues to replicate updates to that node. When the node is taken offline, partitions with replicas on the node could become out of sync until the node is brought back online. Once the node is available again, data is copied to under-replicated replicas on the node until all affected partitions are in sync with the leader.
Before placing a node in maintenance mode, you may want to temporarily disable or ignore alerts related to under-replicated partitions. To prevent under-replicated partitions altogether, you can move all partitions to other nodes.
To place a node into maintenance mode:
rpk cluster maintenance enable <node-id> --wait
--wait option ensures that
rpk waits until leadership is drained from the node before responding.
To remove a node from maintenance mode (and thus enable the node to start taking leadership of partitions):
rpk cluster maintenance disable <node-id>
To see the maintenance status of nodes in the cluster:
rpk cluster maintenance status
The output of this command identifies which nodes in the cluster are in the process of draining leadership, which are finished with that process, and whether any nodes had resulting errors.
One of the primary uses for maintenance mode is to perform a rolling upgrade of the cluster. This process involves putting a node into maintenance mode, upgrading the node, waiting to see if there are any issues, removing the node from maintenance mode, and then repeating the process on the next node in the cluster. Placing nodes into maintenance mode ensures a smooth upgrade of your cluster with reduced risk of interruption or degradation in service.
To perform a rolling upgrade:
Check to ensure all nodes are healthy by running
rpk cluster health.
Select a non-upgraded node and place it into maintenance mode using
rpk cluster maintenance enable <node-id> --wait. This may take some time to complete.
Confirm the node is in maintenance mode by running
rpk cluster maintenance statusand confirming that
Validate the health of the cluster by running
rpk cluster healthand using external metrics, such as consumer lag. If you encounter any issues, take the node out of maintenance mode by running
rpk cluster maintenance disable <node-id>.
If there are no issues, perform a node upgrade by following the version upgrade process specific to your deployment model. The basic steps include node shutdown, upgrade, and restart.
Take the node out of maintenance mode by running
rpk cluster maintenance disable <node-id>.
rpk cluster maintenance statusto ensure that the node is ready.
Repeat this process on another node in the cluster.
|These steps are for a deployment that is not controlled by the Kubernetes Operator. With the Kubernetes Operator, maintenance mode is automatically employed during a Redpanda upgrade as described here.|
How do I use maintenance mode for rolling upgrades with a Kubernetes cluster?
If you are using the Redpanda Kubernetes Operator to trigger an upgrade, then node maintenance mode is automatically leveraged as part of the upgrade process.
What happens if a node is put into maintenance mode when it hosts a partition with a replica count of 1?
Maintenance mode only transfers leadership; it does not move any partitions to other nodes in the cluster. In the case where there is a partition with the sole replica lying on that node, you should either manually reassign the partition to another node, or increase the replication count to ensure that the partition is available for Apache Kafka clients.
How long does it take to drain a node?
The amount of time it takes to drain a node depends on the number of partitions and how healthy the cluster is. For healthy clusters (
acks = all) and small partition sizes, draining leadership should take less than a minute. If partition sizes are large, it may take a few minutes. If the cluster is unhealthy (for example, a follower is lagging behind), then draining the node can take even longer. Note that the draining process won’t start until the cluster is healthy.