Whole Cluster Restore for Disaster Recovery in Kubernetes

With Tiered Storage enabled, you can use whole cluster restore to restore data from a failed cluster (source cluster), including its metadata, onto a new cluster (target cluster). This is a simpler and cheaper alternative to active-active replication, for example with MirrorMaker 2. Use this recovery method to restore your application to the latest functional state as quickly as possible.

You cannot use whole cluster restore if the target cluster is in recovery mode.

Whole cluster restore is not a fully-functional disaster recovery solution. It does not provide snapshot-style consistency. Some partitions in some topics will be more up-to-date than others. Committed transactions are not guaranteed to be atomic.

If you need to restore only a subset of topic data, consider using topic recovery instead of a whole cluster restore.

The following metadata is included in a whole cluster restore:

  • Topic definitions. If you have enabled Tiered Storage only for specific topics, topics without Tiered Storage enabled will be restored empty.

  • Users and access control lists (ACLs).

  • Schemas. To ensure that your schemas are also archived and restored, you must also enable Tiered Storage for the _schemas topic.

  • The consumer offsets topic. Some restored committed consumer offsets may be truncated to a lower value than in the original cluster, to keep offsets at or below the highest restored offset in the partition.

  • Transaction metadata, up to the highest committed transaction. In-flight transactions are treated as aborted and will not be included in the restore.

  • Cluster configurations, including your Redpanda license key, with the exception of the following properties:

    • cloud_storage_cache_size

    • cluster_id

    • cloud_storage_access_key

    • cloud_storage_secret_key

    • cloud_storage_region

    • cloud_storage_bucket

    • cloud_storage_api_endpoint

    • cloud_storage_credentials_source

    • cloud_storage_trust_file

    • cloud_storage_backend

    • cloud_storage_credentials_host

    • cloud_storage_azure_storage_account

    • cloud_storage_azure_container

    • cloud_storage_azure_shared_key

    • cloud_storage_azure_adls_endpoint

    • cloud_storage_azure_adls_port

Manage source metadata uploads

By default, Redpanda uploads cluster metadata to object storage periodically. You can manage metadata uploads for your source cluster, or disable them entirely, with the following cluster configuration properties:

You can monitor the redpanda_cluster_latest_cluster_metadata_manifest_age metric to track the age of the most recent metadata upload.

Restore data from a source cluster

To restore data from a source cluster:

  1. Start a target cluster (new cluster) with cluster restore enabled.

  2. Verify that the cluster restore is complete.

Prerequisites

You must have the following:

  • Tiered Storage enabled on the source cluster.

  • Physical or virtual machines on which to deploy the target cluster.

Limitations

  • Whole cluster restore supports only one source cluster. It is not possible to consolidate multiple clusters onto the target cluster.

  • If a duplicate cluster configuration is found in the target cluster, it will be overwritten by the restore.

  • The target cluster should not contain user-managed or application-managed topic data, schemas, users, ACLs, or ongoing transactions.

Start a target cluster

Deploy the target Redpanda cluster.

  • Helm + Operator

  • Helm

redpanda-cluster.yaml
apiVersion: cluster.redpanda.com/v1alpha1
kind: Redpanda
metadata:
  name: redpanda
spec:
  chartRef: {}
  clusterSpec:
    storage:
      tiered:
        <tiered-storage-settings>
    config:
      cluster:
        cloud_storage_attempt_cluster_restore_on_bootstrap: true
kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
  • --values

  • --set

cluster-restore.yaml
storage:
  tiered:
    <tiered-storage-settings>
config:
  cluster:
    cloud_storage_attempt_cluster_restore_on_bootstrap: true
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
--values cluster-restore.yaml
helm upgrade --install redpanda redpanda/redpanda --namespace <namespace> --create-namespace \
  --set storage.tiered.<tiered-storage-settings> \
  --set config.cluster.cloud_storage_attempt_cluster_restore_on_bootstrap=true
  • storage.tiered: Make sure to configure the target cluster with the same Tiered Storage settings as the failed source cluster.

  • config.cluster.cloud_storage_attempt_cluster_restore_on_bootstrap: Automate cluster restore in Kubernetes. Setting to true is recommended when using an automated method for deployment. When bootstrapping a cluster with a given bucket, make sure that any previous cluster using the bucket is fully destroyed, otherwise Tiered Storage subsystems may interfere with each other.

Verify that the cluster restore is complete

  1. Run the following command until it returns inactive:

    rpk cluster storage restore status
  2. Check if a rolling restart is required:

    rpk cluster config status

    Example output when a restart is required:

    NODE  CONFIG-VERSION  NEEDS-RESTART  INVALID  UNKNOWN
    1     4               true           []       []
  3. If a restart is required, perform a rolling restart.

When the cluster restore is successfully completed successfully, you can redirect your application workload to the new cluster. Make sure to update your application code to use the new addresses of your brokers.