# Kubernetes Failover Runbook

> For the complete documentation index, see [llms.txt](https://docs.redpanda.com/llms.txt). Component-specific: [streaming-full.txt](https://docs.redpanda.com/streaming-full.txt)

---
title: Kubernetes Failover Runbook
latest-redpanda-tag: v25.3.11
latest-console-tag: v3.7.3
latest-operator-version: v26.1.4
# EOL = End-of-Life (support lifecycle status)
page-is-nearing-eol: "false"
page-is-past-eol: "false"
page-eol-date: November 19, 2026
latest-connect-version: 4.93.0
docname: kubernetes/shadowing/k-failover-runbook
page-component-name: streaming
page-version: "25.3"
page-component-version: "25.3"
page-component-title: Streaming
page-relative-src-path: kubernetes/shadowing/k-failover-runbook.adoc
page-edit-url: https://github.com/redpanda-data/docs/edit/v/25.3/modules/manage/pages/kubernetes/shadowing/k-failover-runbook.adoc
description: Step-by-step emergency guide for failing over Redpanda shadow links in Kubernetes during disasters.
page-git-created-date: "2025-12-16"
page-git-modified-date: "2025-12-16"
support-status: supported
---

<!-- Source: https://docs.redpanda.com/streaming/25.3/manage/kubernetes/shadowing/k-failover-runbook.md -->

> 📝 **NOTE**
>
> This feature requires an [enterprise license](https://docs.redpanda.com/streaming/25.3/get-started/licensing/). To get a trial license key or extend your trial period, [generate a new trial license key](https://redpanda.com/try-enterprise). To purchase a license, contact [Redpanda Sales](https://redpanda.com/upgrade).
>
> If Redpanda has enterprise features enabled and it cannot find a valid license, [restrictions](https://docs.redpanda.com/streaming/25.3/get-started/licensing/#self-managed) apply.

This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.

> ❗ **IMPORTANT**
>
> This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see [Failover](https://docs.redpanda.com/streaming/25.3/manage/disaster-recovery/shadowing/failover/). Ensure you have completed the [disaster readiness checklist](https://docs.redpanda.com/streaming/25.3/manage/disaster-recovery/shadowing/overview/#disaster-readiness-checklist) before an emergency occurs.

## [](#emergency-failover-procedure)Emergency failover procedure

Follow these steps during an active disaster:

1.  [Assess the situation](#assess-situation)

2.  [Verify shadow cluster status](#verify-shadow-status)

3.  [Document current state](#document-state)

4.  [Initiate failover](#initiate-failover)

5.  [Monitor failover progress](#monitor-progress)

6.  [Update application configuration](#update-applications)

7.  [Verify application functionality](#verify-functionality)

8.  [Clean up and stabilize](#cleanup-stabilize)


### [](#assess-situation)Assess the situation

Confirm that failover is necessary:

#### Operator

```bash
# Check if source cluster is responding
kubectl exec --namespace <source-namespace> <source-pod-name> --container redpanda -- \
  rpk cluster info

# If source cluster is down, check shadow cluster health
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk cluster info
```

#### Helm

```bash
# Check if source cluster is responding
kubectl exec --namespace <source-namespace> <source-pod-name> --container redpanda -- \
  rpk cluster info

# If source cluster is down, check shadow cluster health
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk cluster info
```

**Decision point**: If the primary cluster is responsive, consider whether failover is actually needed. Partial outages may not require full disaster recovery.

**Examples that require full failover:**

-   Primary cluster is completely unreachable (network partition, regional outage)

-   Multiple broker failures preventing writes to critical topics

-   Data center failure affecting majority of brokers

-   Persistent authentication or authorization failures across the cluster


**Examples that may NOT require failover:**

-   Single broker failure with sufficient replicas remaining

-   Temporary network connectivity issues affecting some clients

-   High latency or performance degradation (but cluster still functional)

-   Non-critical topic or partition unavailability


### [](#verify-shadow-status)Verify shadow cluster status

Check the health of your shadow links:

#### Operator

```bash
# List all shadow links
kubectl get shadowlink --namespace <shadow-namespace>

# Check the ShadowLink resource details
kubectl describe shadowlink --namespace <shadow-namespace> <shadowlink-name>
```

Verify that the following conditions exist before proceeding with failover:

-   ShadowLink resource shows `Synced: True` in conditions

-   Shadow topic statuses show `state: active` (not `faulted`)

-   Task statuses show `state: active`

#### Helm

```bash
# List all shadow links
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk shadow list

# Check the configuration of your shadow link
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk shadow describe <shadow-link-name>

# Check the status of your disaster recovery link
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk shadow status <shadow-link-name>
```

For detailed command options, see [`rpk shadow list`](https://docs.redpanda.com/streaming/25.3/reference/rpk/rpk-shadow/rpk-shadow-list/), [`rpk shadow describe`](https://docs.redpanda.com/streaming/25.3/reference/rpk/rpk-shadow/rpk-shadow-describe/), and [`rpk shadow status`](https://docs.redpanda.com/streaming/25.3/reference/rpk/rpk-shadow/rpk-shadow-status/).

Verify that the following conditions exist before proceeding with failover:

-   Shadow link state should be `ACTIVE`

-   Topics should be in `ACTIVE` state (not `FAULTED`)

-   Replication lag should be reasonable for your RPO requirements

#### [](#understanding-replication-lag)Understanding replication lag

Use status commands to check lag, which shows the message count difference between source and shadow partitions:

-   **Acceptable lag examples**: 0-1000 messages for low-throughput topics, 0-10000 messages for high-throughput topics

-   **Concerning lag examples**: Growing lag over 50,000 messages, or lag that continuously increases without recovering

-   **Critical lag examples**: Lag exceeding your data loss tolerance (for example, if you can only afford to lose 1 minute of data, lag should represent less than 1 minute of typical message volume)


### [](#document-state)Document current state

Record the current lag and status before proceeding:

#### Operator

```bash
# Capture current status for post-mortem analysis
kubectl describe shadowlink --namespace <shadow-namespace> <shadowlink-name> > failover-status-$(date +%Y%m%d-%H%M%S).log
```

#### Helm

```bash
# Capture current status for post-mortem analysis
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk shadow status <shadow-link-name> > failover-status-$(date +%Y%m%d-%H%M%S).log
```

> ❗ **IMPORTANT**
>
> Note the replication lag to estimate potential data loss during failover. For details about shadow link replication tasks, see [Shadow link tasks](https://docs.redpanda.com/streaming/25.3/manage/disaster-recovery/shadowing/overview/#shadow-link-tasks).

### [](#initiate-failover)Initiate failover

A complete cluster failover is appropriate if you observe that the source cluster is no longer reachable:

#### Operator

Delete the `ShadowLink` resource to fail over all topics:

```bash
kubectl delete shadowlink --namespace <shadow-namespace> <shadowlink-name>
```

Expected output

shadowlink.cluster.redpanda.com "<shadowlink-name>" deleted

This immediately converts all shadow topics to regular writable topics and stops replication.

> 📝 **NOTE**
>
> The Redpanda Operator does not support selective topic failover. For selective failover, use the `rpk` commands shown in the Helm tab.

#### Helm

For complete cluster failover (all topics):

```bash
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk shadow failover <shadow-link-name> --all
```

**Expected output**:

Successfully initiated the Fail Over for Shadow Link "<shadow-link-name>". To check the status, run:
  rpk shadow status <shadow-link-name>

For selective topic failover (when only specific services are affected):

```bash
# Fail over individual topics
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk shadow failover <shadow-link-name> --topic <topic-name>
```

For detailed command options, see [`rpk shadow failover`](https://docs.redpanda.com/streaming/25.3/reference/rpk/rpk-shadow/rpk-shadow-failover/).

### [](#monitor-progress)Monitor failover progress

Track the failover process:

#### Operator

After deleting the `ShadowLink` resource, verify topics are now writable:

```bash
# Check that shadow link is gone
kubectl get shadowlink --namespace <shadow-namespace>

# List topics on shadow cluster
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk topic list

# Test write to a previously shadow topic
echo "test message" | kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -i -- \
  rpk topic produce <topic-name>
```

Expected output for kubectl get

No resources found in <shadow-namespace> namespace.

Expected output for rpk topic produce

Produced to partition 0 at offset 123 with timestamp 1734567890123.

If the shadow link is deleted and you can successfully produce to topics, failover is complete.

#### Helm

Monitor status until all topics show `FAILED_OVER`:

```bash
# Monitor status during failover
watch -n 5 "kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- rpk shadow status <shadow-link-name>"

# Check detailed topic status
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk shadow status <shadow-link-name> --print-topic
```

Expected output during failover

OVERVIEW
===
NAME   disaster-recovery-link
STATE  ACTIVE

TOPICS
===
Name: orders, State: FAILED\_OVER
Name: inventory, State: FAILED\_OVER
Name: transactions, State: FAILING\_OVER

Wait for all critical topics to reach `FAILED_OVER` state before proceeding.

### [](#update-applications)Update application configuration

Redirect your applications to the shadow cluster by updating connection strings in your applications to point to shadow cluster brokers. If using DNS-based service discovery, update DNS records accordingly. Restart applications to pick up new connection settings and verify connectivity from application hosts to shadow cluster.

### [](#verify-functionality)Verify application functionality

Test critical application workflows:

```bash
# Verify applications can produce messages
echo "failover test" | kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -i -- \
  rpk topic produce <topic-name>

# Verify applications can consume messages
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk topic consume <topic-name> --num 1
```

Expected output for produce

Produced to partition 0 at offset 456 with timestamp 1734567890456.

Expected output for consume

{
  "topic": "<topic-name>",
  "value": "failover test",
  "timestamp": 1734567890456,
  "partition": 0,
  "offset": 456
}

Test message production and consumption, consumer group functionality, and critical business workflows to ensure everything is working properly.

### [](#cleanup-stabilize)Clean up and stabilize

After all applications are running normally:

#### Operator

The `ShadowLink` resource has already been deleted during failover. No additional cleanup is needed.

#### Helm

Optionally delete the shadow link (no longer needed):

```bash
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk shadow delete <shadow-link-name>
```

For detailed command options, see [`rpk shadow delete`](https://docs.redpanda.com/streaming/25.3/reference/rpk/rpk-shadow/rpk-shadow-delete/).

Document the time of failover initiation and completion, applications affected and recovery times, data loss estimates based on replication lag, and issues encountered during failover.

## [](#troubleshoot)Troubleshoot

### [](#application-connection-failures-after-failover)Application connection failures after failover

Applications may not be able to connect to the shadow cluster after failover.

Verify shadow cluster Kubernetes Service endpoints:

```bash
kubectl get service --namespace <shadow-namespace>
```

Check NetworkPolicy if using network policies:

```bash
kubectl get networkpolicy --namespace <shadow-namespace>
```

Confirm authentication credentials are valid for the shadow cluster and test network connectivity from application hosts.

### [](#consumer-group-offset-issues-after-failover)Consumer group offset issues after failover

After failover, consumers may start from the beginning or wrong positions.

Verify consumer group offsets were replicated (check your shadow link filters):

```bash
kubectl exec --namespace <shadow-namespace> <shadow-pod-name> --container redpanda -- \
  rpk group describe <consumer-group-name>
```

If necessary, manually reset offsets to appropriate positions. See [How to manage consumer group offsets in Redpanda](https://support.redpanda.com/hc/en-us/articles/23499121317399-How-to-manage-consumer-group-offsets-in-Redpanda) for detailed reset procedures.

## [](#next-steps)Next steps

After successful failover, focus on recovery planning and process improvement. Begin by assessing the source cluster failure and determining whether to restore the original cluster or permanently promote the shadow cluster as your new primary.

**Immediate recovery planning:**

1.  **Assess source cluster**: Determine root cause of the outage

2.  **Plan recovery**: Decide whether to restore source cluster or promote shadow cluster permanently

3.  **Data synchronization**: Plan how to synchronize any data produced during failover

4.  **Fail forward**: Create a new shadow link with the failed over shadow cluster as source to maintain a DR cluster


**Process improvement:**

1.  **Document the incident**: Record timeline, impact, and lessons learned

2.  **Update runbooks**: Improve procedures based on what you learned

3.  **Test regularly**: Schedule regular disaster recovery drills

4.  **Review monitoring**: Ensure monitoring caught the issue appropriately


For general failover concepts and procedures, see [Failover](https://docs.redpanda.com/streaming/25.3/manage/disaster-recovery/shadowing/failover/).