Node maintenance mode enables you to bring a Redpanda node into a state where you can safely perform operations like an upgrade with reduced risk of data loss or downtime.
When a node is placed in maintenance mode, it relinquishes leadership of any partitions, drains that leadership to other nodes in the cluster, and ensures that the node cannot gain partition leadership until maintenance mode is disabled.
Note that maintenance mode doesn’t move any partitions out of the node, and Redpanda will continue to replicate follower partition data to nodes in maintenance mode. If the node becomes unreachable, then the cluster could accumulate under-replicated partitions until the node becomes available or it is decommissioned.
Using maintenance mode
Before placing a node in maintenance mode, ensure that you have replicas of partitions on other nodes in the cluster. Alternatively, you can manually reassign partitions to other nodes. You may also want to temporarily disable or ignore alerts related to under-replicated partitions.
To place a node into maintenance mode:
rpk cluster maintenance enable <node-id> --wait
--wait option ensures that
rpk waits until leadership draining is complete before responding.
To remove a node from maintenance mode (and thus enable the node to start taking leadership of partitions):
rpk cluster maintenance disable <node-id>
To see the maintenance status of nodes in the cluster:
rpk cluster maintenance status
The output of this command identifies which nodes in the cluster are in the process of draining leadership, which are finished with that process, and whether any nodes had resulting errors.
Using maintenance mode to perform a rolling upgrade
One of the primary uses for maintenance mode is to perform a rolling upgrade of the cluster. This process involves putting a node into maintenance mode, upgrading the node, waiting to see if there are any issues, removing the node from maintenance mode, and then repeating the process on the next node in the cluster. Placing nodes into maintenance mode ensures a smooth upgrade of your cluster with reduced risk of interruption or degradation in service.
To perform a rolling upgrade:
- Check to ensure all nodes are healthy by running
rpk cluster health.
- Select a non-upgraded node and place it into maintenance mode using
rpk cluster maintenance enable <node-id> --wait. This may take some time to complete.
- Confirm the node is in maintenance mode by running
rpk cluster maintenance statusand confirming that
- Validate the health of the cluster by running
rpk cluster healthand using external metrics, such as consumer lag. If you encounter any issues, take the node out of maintenance mode by running
rpk maintenance mode disable <node-id>.
- If there are no issues, perform a node upgrade by following the version upgrade process specific to your deployment model. The basic steps include node shutdown, upgrade, and restart.
- Take the node out of maintenance mode by running
rpk cluster maintenance disable <node-id>.
rpk cluster maintenance statusto ensure that the node is ready.
- Repeat this process on another node in the cluster.
These steps are for a deployment that is not controlled by the Kubernetes Operator. With the Kubernetes Operator, maintenance mode is automatically employed during a Redpanda upgrade as described here.
Frequently Asked Questions
How do I use maintenance mode for rolling upgrades with a Kubernetes cluster?
If you are using the Redpanda Kubernetes Operator to trigger an upgrade, then node maintenance mode is automatically leveraged as part of the upgrade process.
What happens if a node is put into maintenance mode when it hosts a partition with a replica count of 1?
Maintenance mode only transfers leadership; it does not move any partitions to other nodes in the cluster. In the case where there is a partition with the sole replica lying on that node, you should either manually reassign the partition to another node, or increase the replication count to ensure that the partition is available for Apache Kafka clients.
How long does it take to drain a node?
The amount of time it takes to drain a node depends on the number of partitions and how healthy the cluster is. For healthy clusters (
acks = all) and small partition sizes, draining leadership should take less than a minute. If partition sizes are large, it may take a few minutes. If the cluster is unhealthy (for example, a follower is lagging behind), then draining the node can take even longer. Note that the draining process won’t start until the cluster is healthy.