Docs Self-Managed Manage Cluster Maintenance Node-wise Partition Recovery Node-wise Partition Recovery Multi-broker or entire AZ failures (especially in cloud environments), along with some forms of human error, can result in ‘stuck’ partitions where there are fewer replicas than required to make a quorum. In such failure scenarios, some data loss may be unavoidable. Node-wise partition recovery provides a way to unsafely recover at least a portion of your data using remaining replicas, which are moved off of target brokers and allocated to healthy ones. In one step, this process repairs partitions while draining the target brokers of all partition replicas. This topic helps admins understand what they can or cannot recover using node-wise partition recovery. Only use this operation as a last-resort measure when all other recovery options have failed. In some cases, there may be no remaining replicas for the partitions on the dead brokers. This recovery method is intended for scenarios where you have already experienced data loss, with the goal being to stop the loss of additional data. Perform the recovery operation To start node-wise partition recovery, run rpk cluster partitions unsafe-recover. For example: rpk cluster partitions unsafe-recover --from-nodes 1,3,5 This command includes a prompt to confirm the generated recovery plan, as it is a destructive operation. When you run node-wise partition recovery, the partitions on the broker are rebuilt on a best-effort basis. When there are zero surviving partition replicas, such as a topic with a replication factor of 1 (RF=1), partition recovery rebuilds empty partitions with no data (although you may be able to recover the partition from Tiered Storage), allowing producers to continue writing to the partition even though no data can be recovered in such situations. The --from-nodes flag accepts a comma-separated list of the brokers' node IDs you wish to recover the data from. This example performs recovery operations on nodes 1, 3, and 5. Redpanda assesses these brokers to identify which partitions lack a majority. It then creates a plan to recover the impacted partitions and prompts you for confirmation. You must respond yes to continue with recovery. The --dry flag performs a dry run and allows you to view the recovery plan with no risk to your cluster. When running node-wise partition recovery, it’s possible that there may be more recent data (a higher offset) available in Tiered Storage if: Raft replication was stuck or slow before the node failure Zero live replicas remain in the cluster (because the partition had a replication factor of one, RF=1) For topics configured to use Tiered Storage, Redpanda also attempts to recover partition data from object storage, recovering the latest offset available for a partition in either storage tier (local or object storage). This allows for the maximum amount of data to be recovered in all cases, even for topics with a replication factor of 1, where no replicas remain in local storage. The recovery operation can take some time to complete, especially for a large amount of data. To monitor the status of the recovery operation in real-time, run: rpk cluster partitions balancer-status Example recovery operations The following example shows the node-wise partition recovery process in action: $ rpk cluster partitions unsafe-recover --from-nodes 1 NAMESPACE TOPIC PARTITION REPLICA-CORE DEAD-NODES kafka bar 0 [1-1] [1] ? Confirm recovery from these nodes? Yes Executing recovery plan... Successfully queued the recovery plan, you may check the status by running 'rpk cluster partitions balancer-status' $ rpk cluster partitions balancer-status Status: ready Seconds Since Last Tick: 26 Current Reassignment Count: 0 Partitions Pending Recovery (1): [kafka/bar/0] The following example shows the status of moved partitions: $ rpk cluster partitions move-status PARTITION MOVEMENTS =================== NAMESPACE-TOPIC PARTITION MOVING-FROM MOVING-TO COMPLETION-% PARTITION-SIZE BYTES-MOVED BYTES-REMAINING kafka/prod_tests 4 [045] [045] 0 56204032205 0 56204032205 kafka/prod_tests 7 [045] [045] 0 64607340009 0 64607340009 kafka/prod_tests 12 [014] [014] 0 29074311639 0 29074311639 kafka/prod_tests 20 [014] [014] 0 29673620476 0 29673620476 kafka/prod_tests 22 [045] [045] 0 28471089141 0 28471089141 kafka/prod_tests 23 [045] [045] 0 29692435312 0 29692435312 kafka/prod_tests 31 [014] [014] 0 66982232299 0 66982232299 kafka/prod_tests 33 [014] [014] 0 46329276747 0 46329276747 Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution Forced Partition Recovery Security