Recovery Mode

Recovery mode allows you to repair and restore a failed cluster that cannot start normally due to issues such as system crashes or out-of-memory (OOM) errors. In recovery mode, Redpanda limits functionality to cluster configuration changes and other manual administrative actions so that you can repair the cluster.

Enabled functionality

In recovery mode, Redpanda enables the following functionality so that you can repair the cluster:

  • Kafka API

    • Modify topic properties

    • Delete topics

    • Add and remove access control lists (ACLs)

    • Edit consumer group metadata

  • Admin API

    • Edit cluster configuration properties

    • Add and remove users

    • Add new brokers to the cluster

    • Delete WASM transforms

Disabled functionality

In recovery mode, Redpanda disables the following functionality to provide a more stable environment for troubleshooting issues and restoring the cluster to a usable state.

  • The following APIs are disabled because some connections, especially malicious ones, can disrupt availability for all users, including admin users:

    • Kafka API (fetch and produce requests)

    • HTTP Proxy

    • Schema Registry

  • The following node-wide and cluster-wide processes are disabled as they may disrupt recovery operations:

    • Partition and leader balancers

    • Tiered Storage housekeeping

    • Tiered Storage cache management

    • Compaction

  • Redpanda does not load user-managed partitions on disk to prevent triggering partition leadership elections and replication that may occur on startup.

Prerequisites

You must have the following:

  • rpk installed.

  • Local access to machines running Redpanda.

Start Redpanda in recovery mode

A broker can only enter recovery mode as it starts up, and not while it is already running. You first set the broker configuration property to enable recovery mode, and then do a broker restart.

  1. Run the rpk redpanda mode recovery command to set the recovery_mode_enabled broker configuration property to true.

    rpk redpanda mode recovery

    Enable recovery mode for all brokers. Although, you can start a mixed-mode cluster, where some brokers are in recovery mode while others are not, it’s not recommended.

  2. Restart the brokers.

  3. Check whether the cluster has entered recovery mode:

    rpk cluster health

    You should see a list of brokers that are in recovery mode. For example:

    CLUSTER HEALTH OVERVIEW
    =======================
    Healthy:                          true
    Unhealthy reasons:                []
    Controller ID:                    0
    All nodes:                        [0 1 2]
    Nodes down:                       []
    Nodes in recovery mode:           [0 1 2]
    Leaderless partitions (0):        []
    Under-replicated partitions (0):  []

In recovery mode, all private Redpanda topics such as __consumer_offsets are accessible. Data in user-created topics is not available, but you can still manage the metadata for these topics.

Exit recovery mode

Exit recovery mode by running one of the following commands:

  • To exit into developer mode:

    rpk redpanda mode developer
  • To exit into production mode:

    rpk redpanda mode production

Disable partitions

Problems that prevent normal cluster startup may be isolated to certain partitions or topics. You can use rpk or the Admin API to disable these partitions at the topic level, or individual partition level. A disabled partition or topic returns a Replica Not Available error code for Kafka API requests.

To disable a partition, you need a healthy controller in the cluster, so you must start the cluster in recovery mode if a problematic partition is affecting cluster startup. If you disable a partition while in recovery mode, starting Redpanda again in non-recovery mode leaves the partition in a deactivated state. You must explicitly re-enable the partition.

You can also disable a partition outside of recovery mode, if the issue is localized to the partition and does not interfere with cluster startup.

The following examples show you how to use the Admin API to enable or disable partitions. The examples are based on the assumption that the Admin API port is 9644.

Use kafka as the partition-namespace when making API calls to manage partitions in user topics.

Disable a specific partition of a topic

  • rpk

  • Curl

rpk cluster partitions disable <topic-name> --partitions <comma-delimited-partition-id>
curl -X POST -d '{"disabled": true}' http://localhost:9644/v1/cluster/partitions/<partition-namespace>/<topic-name>/<partition-id>

Enable a specific partition of a topic

  • rpk

  • Curl

rpk cluster partitions enable <topic-name> --partitions <comma-delimited-partition-id>
curl -X POST -d '{"disabled": false}' http://localhost:9644/v1/cluster/partitions/<partition-namespace>/<topic-name>/<partition-id>

Disable all partitions of a specific topic

  • rpk

  • Curl

rpk cluster partitions disable <topic-name> --all
curl -X POST -d '{"disabled": true}' http://localhost:9644/v1/cluster/partitions/<partition-namespace>/<topic-name>

Enable all partitions of a specific topic

  • rpk

  • Curl

rpk cluster partitions enable <topic-name> --all
curl -X POST -d '{"disabled": false}' http://localhost:9644/v1/cluster/partitions/<partition-namespace>/<topic-name>

List all disabled partitions

  • rpk

  • Curl

rpk cluster partitions list --all --disabled-only
curl http://localhost:9644/v1/cluster/partitions?disabled=true

List all disabled partitions of a specific topic

  • rpk

  • Curl

rpk cluster partitions list <topic-names> --disabled-only
curl http://localhost:9644/v1/cluster/partitions/<partition-namespace>/<topic-name>?disabled=true