# Failover Runbook

> For the complete documentation index, see [llms.txt](https://docs.redpanda.com/llms.txt). Component-specific: [cloud-data-platform-full.txt](https://docs.redpanda.com/cloud-data-platform-full.txt)

---
title: Failover Runbook
latest-operator-version: v26.1.4
latest-console-tag: v3.7.3
latest-connect-version: 4.93.0
latest-redpanda-tag: v26.1.9
docname: disaster-recovery/shadowing/failover-runbook
page-component-name: cloud-data-platform
page-version: master
page-component-version: master
page-component-title: Cloud
page-relative-src-path: disaster-recovery/shadowing/failover-runbook.adoc
page-edit-url: https://github.com/redpanda-data/cloud-docs/edit/main/modules/manage/pages/disaster-recovery/shadowing/failover-runbook.adoc
description: Step-by-step runbook for failover procedures in disaster recovery.
page-git-created-date: "2025-12-12"
page-git-modified-date: "2026-05-26"
---

<!-- Source: https://docs.redpanda.com/cloud-data-platform/manage/disaster-recovery/shadowing/failover-runbook.md -->

This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.

> ❗ **IMPORTANT**
>
> This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see [Configure Failover](https://docs.redpanda.com/cloud-data-platform/manage/disaster-recovery/shadowing/failover/). Ensure you have completed the [disaster readiness checklist](https://docs.redpanda.com/cloud-data-platform/manage/disaster-recovery/shadowing/overview/#disaster-readiness-checklist) before an emergency occurs.

> 📝 **NOTE**
>
> Shadowing is supported on BYOC and Dedicated clusters running Redpanda version 25.3 and later.

## [](#emergency-failover-procedure)Emergency failover procedure

Follow these steps during an active disaster:

1.  [Assess the situation](#assess-situation)

2.  [Verify shadow cluster status](#verify-shadow-status)

3.  [Document current state](#document-state)

4.  [Initiate failover](#initiate-failover)

5.  [Monitor failover progress](#monitor-progress)

6.  [Update application configuration](#update-applications)

7.  [Verify application functionality](#verify-functionality)

8.  [Clean up and stabilize](#cleanup-stabilize)


### [](#assess-situation)Assess the situation

Confirm that failover is necessary:

```bash
# Check if the primary cluster is responding
rpk cluster info --brokers prod-cluster-1.example.com:9092,prod-cluster-2.example.com:9092

# If primary cluster is down, check shadow cluster health
rpk cluster info --brokers shadow-cluster-1.example.com:9092,shadow-cluster-2.example.com:9092
```

**Decision point**: If the primary cluster is responsive, consider whether failover is actually needed. Partial outages may not require full disaster recovery.

**Examples that require full failover:**

-   Primary cluster is completely unreachable (network partition, regional outage)

-   Multiple broker failures preventing writes to critical topics

-   Data center failure affecting majority of brokers

-   Persistent authentication or authorization failures across the cluster


**Examples that may NOT require failover:**

-   Single broker failure with sufficient replicas remaining

-   Temporary network connectivity issues affecting some clients

-   High latency or performance degradation (but cluster still functional)

-   Non-critical topic or partition unavailability


### [](#verify-shadow-status)Verify shadow cluster status

Check the health of your shadow links:

#### Cloud UI

1.  From the **Shadow Link** page, select the shadow link you want to view.

2.  The **Overview** tab shows the state of the shadow link and its topics.

#### rpk

```bash
# List all shadow links
rpk shadow list

# Check the configuration of your shadow link
rpk shadow describe <shadow-link-name>

# Check the status of your disaster recovery link
rpk shadow status <shadow-link-name>
```

For detailed command options, see [`rpk shadow list`](https://docs.redpanda.com/cloud-data-platform/reference/rpk/rpk-shadow/rpk-shadow-list/), [`rpk shadow describe`](https://docs.redpanda.com/cloud-data-platform/reference/rpk/rpk-shadow/rpk-shadow-describe/), and [`rpk shadow status`](https://docs.redpanda.com/cloud-data-platform/reference/rpk/rpk-shadow/rpk-shadow-status/).

#### Cloud API

```bash
# List all shadow links
curl "https://api.redpanda.com/v1/shadow-links" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}"

# Check the configuration of your shadow link
curl "https://api.redpanda.com/v1/shadow-links/<shadow-link-id>" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}"

# Get Data Plane API URL of shadow cluster
export DATAPLANE_API_URL=`curl https://api.cloud.redpanda.com/v1/clusters/<shadow-cluster-id> \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}" | jq .cluster.dataplane_api`

# Check the status of your disaster recovery link
curl "https://$DATAPLANE_API_URL/v1/shadowlinks/<shadow-link-name>" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}"
```

Verify that the following conditions exist before proceeding with failover:

-   Shadow link state should be `ACTIVE`.

-   Topics should be in `ACTIVE` state (not `FAULTED`).

-   Replication lag should be reasonable for your RPO requirements.


#### [](#understanding-replication-lag)Understanding replication lag

Use [`rpk shadow status`](https://docs.redpanda.com/cloud-data-platform/reference/rpk/rpk-shadow/rpk-shadow-status/) or the [Data Plane API](https://docs.redpanda.com/api/doc/cloud-dataplane/operation/operation-shadowlinkservice_listshadowlinktopics) to check lag, which shows the message count difference between source and shadow partitions:

-   **Acceptable lag examples**: 0-1000 messages for low-throughput topics, 0-10000 messages for high-throughput topics

-   **Concerning lag examples**: Growing lag over 50,000 messages, or lag that continuously increases without recovering

-   **Critical lag examples**: Lag exceeding your data loss tolerance (for example, if you can only afford to lose 1 minute of data, lag should represent less than 1 minute of typical message volume)


### [](#document-state)Document current state

Record the current lag and status before proceeding:

#### Cloud UI

Capture the status from the **Shadow Link** page.

#### rpk

```bash
# Capture current status for post-mortem analysis
rpk shadow status <shadow-link-name> > failover-status-$(date +%Y%m%d-%H%M%S).log
```

Example output showing healthy replication before failover:

shadow link: <shadow-link-name>

Overview:
NAME                 <shadow-link-name>
UID                  <uid>
STATE                ACTIVE

Tasks:
Name                 Broker\_ID  State   Reason
<task-name>          1          ACTIVE
<task-name>          2          ACTIVE

Topics:
Name: <topic-name>, State: ACTIVE

 Partition  SRC\_LSO  SRC\_HWM  DST\_HWM  Lag
 0          1234     1468     1456     12
 1          2345     2579     2568     11

#### Cloud API

```bash
# Capture current status for post-mortem analysis
curl "https://$DATAPLANE_API_URL/v1/shadowlinks/<shadow-link-name>/topic" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}" > failover-status-$(date +%Y%m%d-%H%M%S).log
```

The partition information shows the following:

| Field | Description |
| --- | --- |
| source_last_stable_offset | Source partition last stable offset |
| source_high_watermark | Source partition high watermark |
| high_watermark | Shadow (destination) partition high watermark |
| Lag | Message count difference between source and shadow partitions |

> ❗ **IMPORTANT**
>
> Note the replication lag to estimate potential data loss during failover. The `Tasks` section shows the health of shadow link replication tasks. For details about what each task does, see [Shadow link tasks](https://docs.redpanda.com/cloud-data-platform/manage/disaster-recovery/shadowing/overview/#shadow-link-tasks).

### [](#initiate-failover)Initiate failover

A complete cluster failover is appropriate If you observe that the source cluster is no longer reachable:

#### Cloud UI

1.  On your **Shadow Link** page, click **Failover All Topics**.

2.  Click to confirm the failover action. The failover process promotes all topics to writable status.

#### rpk

```bash
# Fail over all topics in the shadow link
rpk shadow failover <shadow-link-name> --all
```

For detailed command options, see [`rpk shadow failover`](https://docs.redpanda.com/cloud-data-platform/reference/rpk/rpk-shadow/rpk-shadow-failover/).

#### Cloud API

```bash
# Fail over all topics in the shadow link
curl -X POST "$DATAPLANE_API_URL/v1/shadowlink/<shadow-link-name>/failover" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}"
```

For selective topic failover (when only specific services are affected):

#### Cloud UI

1.  On your **Shadow Link** page, click the **Failover** button for the topics you want to failover.

2.  Click to confirm the failover action. The failover process promotes the selected topics to writable status.

#### rpk

```bash
# Fail over individual topics
rpk shadow failover <shadow-link-name> --topic <topic-name-1>
rpk shadow failover <shadow-link-name> --topic <topic-name-2>
```

#### Cloud API

```bash
# Fail over individual topics
curl -X POST "$DATAPLANE_API_URL/v1/shadowlinks/<shadow-link-name>/failover" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}" \
  -d '{
    "shadowTopicName": "<topic-name-1>"
  }'

curl -X POST "$DATAPLANE_API_URL/v1/shadowlinks/<shadow-link-name>/failover" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}" \
  -d '{
    "shadowTopicName": "<topic-name-2>"
  }'
```

### [](#monitor-progress)Monitor failover progress

Track the failover process:

#### Cloud UI

1.  From the **Shadow Link** page, select the shadow link you want to view.

2.  Click the **Tasks** tab to view all tasks and their status.

#### rpk

```bash
# Monitor status until all topics show FAILED_OVER
watch -n 5 "rpk shadow status <shadow-link-name>"

# Check detailed topic status and lag during emergency
rpk shadow status <shadow-link-name> --print-topic
```

Example output during successful failover:

shadow link: <shadow-link-name>

Overview:
NAME                 <shadow-link-name>
UID                  <uid>
STATE                ACTIVE

Tasks:
Name                 Broker\_ID  State   Reason
<task-name>          1          ACTIVE
<task-name>          2          ACTIVE

Topics:
Name: <topic-name>, State: FAILED\_OVER
Name: <topic-name>, State: FAILED\_OVER
Name: <topic-name>, State: FAILING\_OVER

#### Cloud API

```bash
# Monitor status
watch -n 5 'curl "https://$DATAPLANE_API_URL/v1/shadowlinks/<shadow-link-name>" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}" | jq .'

# Check detailed topic status and lag during emergency
curl "https://$DATAPLANE_API_URL/v1/shadowlinks/<shadow-link-name>/topic" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}"
```

**Wait for**: All critical topics to reach `FAILED_OVER` state before proceeding.

### [](#update-applications)Update application configuration

Redirect your applications to the shadow cluster by updating connection strings in your applications to point to shadow cluster brokers. If using DNS-based service discovery, update DNS records accordingly. Restart applications to pick up new connection settings and verify connectivity from application hosts to shadow cluster.

### [](#verify-functionality)Verify application functionality

Test critical application workflows:

```bash
# Verify applications can produce messages
rpk topic produce <topic-name> --brokers <shadow-cluster-address>:9092

# Verify applications can consume messages
rpk topic consume <topic-name> --brokers <shadow-cluster-address>:9092 --num 1
```

Test message production and consumption, consumer group functionality, and critical business workflows to ensure everything is working properly.

### [](#cleanup-stabilize)Clean up and stabilize

After all applications are running normally:

#### Cloud UI

1.  On your **Shadow Link** page, click **Delete**.

2.  Type "delete" to confirm the action.

#### rpk

```bash
# Optional: Delete the shadow link (no longer needed)
rpk shadow delete <shadow-link-name>
```

For detailed command options, see [`rpk shadow delete`](https://docs.redpanda.com/cloud-data-platform/reference/rpk/rpk-shadow/rpk-shadow-delete/).

#### Cloud API

```bash
# Optional: Delete the shadow link (no longer needed)
curl -X DELETE https://api.redpanda.com/v1/shadow-links/<shadow-link-id> \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}"
```

For the full API reference, see [Control Plane API reference](https://docs.redpanda.com/api/doc/cloud-controlplane/operation/operation-shadowlinkservice_deleteshadowlink).

> 📝 **NOTE**
>
> This operation [force deletes](#force-delete-warning) the shadow link. Document the time of failover initiation and completion, applications affected and recovery times, data loss estimates based on replication lag, and issues encountered during failover.

## [](#troubleshoot-common-issues)Troubleshoot common issues

### [](#topics-stuck-in-failing_over-state)Topics stuck in FAILING_OVER state

**Problem**: Topics remain in `FAILING_OVER` state for extended periods

**Solution**: Check shadow cluster logs for specific error messages and ensure sufficient cluster resources (CPU, memory, disk space) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes and confirm that all shadow topic partitions have elected leaders and the controller partition is properly replicated with an active leader.

If topics remain stuck after addressing these cluster health issues and you need immediate failover, you can force delete the shadow link to failover all topics:

#### Cloud UI

All failover actions in the Cloud UI include force delete functionality by default. When you failover a shadow link, all topics are immediately promoted to writable status.

#### rpk

```bash
# Force delete the shadow link to failover all topics
rpk shadow delete <shadow-link-name>
```

`rpk shadow delete` force deletes the shadow link by default in Redpanda Cloud.

#### Cloud API

```bash
curl -X DELETE https://api.redpanda.com/v1/shadow-links/<shadow-link-id> \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${RP_CLOUD_TOKEN}"
```

The `DELETE /shadow-links/<shadow-link-id>` endpoint of the Control Plane API force deletes the shadow link by default in Redpanda Cloud.

> ⚠️ **WARNING**
>
> Force deleting a shadow link immediately fails over all topics in the link. This action is irreversible and should only be used when topics are stuck and you need immediate access to all replicated data.

### [](#topics-in-faulted-state)Topics in FAULTED state

**Problem**: Topics show `FAULTED` state and are not replicating

**Solution**: Check for authentication issues, network connectivity problems, or source cluster unavailability. Verify that the shadow link service account still has the required permissions on the source cluster. Review shadow cluster logs for specific error messages about the faulted topics.

### [](#application-connection-failures)Application connection failures

**Problem**: Applications cannot connect to shadow cluster after failover

**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for the shadow cluster and test network connectivity from application hosts.

### [](#consumer-group-offset-issues)Consumer group offset issues

**Problem**: Consumers start from beginning or wrong positions

**Solution**: Verify consumer group offsets were replicated (check your filters) and use `rpk group describe <group-name>` to check offset positions. If necessary, manually reset offsets to appropriate positions. See [How to manage consumer group offsets in Redpanda](https://support.redpanda.com/hc/en-us/articles/23499121317399-How-to-manage-consumer-group-offsets-in-Redpanda) for detailed reset procedures.

## [](#next-steps)Next steps

After successful failover, focus on recovery planning and process improvement. Begin by assessing the source cluster failure and determining whether to restore the original cluster or permanently promote the shadow cluster as your new primary.

**Immediate recovery planning:**

1.  **Assess source cluster**: Determine root cause of the outage

2.  **Plan recovery**: Decide whether to restore source cluster or promote shadow cluster permanently

3.  **Data synchronization**: Plan how to synchronize any data produced during failover

4.  **Fail forward**: Create a new shadow link with the failed over shadow cluster as source to maintain a DR cluster


**Process improvement:**

1.  **Document the incident**: Record timeline, impact, and lessons learned

2.  **Update runbooks**: Improve procedures based on what you learned

3.  **Test regularly**: Schedule regular disaster recovery drills

4.  **Review monitoring**: Ensure monitoring caught the issue appropriately