Data Archiving

This feature requires an Enterprise license. To upgrade, contact Redpanda sales.

When producers send events to Redpanda, Redpanda stores those events on a local storage volume. Redpanda then replicates the data to other nodes, according to the requirements defined by the producer. Redpanda’s persistence and replication processes protect your data in the event of a node or cluster failure. After configuring data archiving, your data is uploaded to cloud storage. In the event of a data center failure, data corruption, or cluster migration, you can recover your archived data from the cloud back to your cluster.

Redpanda supports data archiving to Amazon S3 and Google Cloud Storage.

Configure data archiving

To configure data archiving:

  1. Create the cloud storage bucket and prepare it to be archived:

    • Amazon AWS S3

      • (Optional) Specify expiration rules for the files that are based on the rp-type file tags.

      • Use the IAM service to create a user to access S3.

        • Grant this user permission to read and create objects.

        • Copy the access key and secret key to the Redpanda cluster configuration properties cloud_storage_access_key and cloud_storage_secret_key.

    • Google Cloud Storage

    • MinIO (for local testing)

      • Use the MINIO_ROOT_PASSWORD and MINIO_ROOT_USER environment variables to specify the access key (cloud_storage_access_key) and secret key (cloud_storage_secret_key), respectively.

      • Use the MINIO_REGION_NAME environment variable to specify the region name (cloud_storage_region).

      • Set the MINIO_DOMAIN environment variable. Redpanda uses virtual-hosted style endpoints but MinIO only supports them if the custom domain name is provided.

      • Set the custom API endpoint (cloud_storage_api_endpoint) and port (cloud_storage_api_endpoint_port) and either:

        • Require TLS - Set cloud_storage_disable_tls to disabled to disable TLS.

    The secret and access keys are stored in plain text in configuration files.
  2. Configure the following properties using cluster configuration rpk commands:

    Parameter name Type Description



    Enables archival storage feature



    Access key



    Secret key



    Cloud region



    Bucket name



    Cloud storage API endpoint (Default: S3)



    Cloud storage API endpoint port number (Default: 443)



    Alternative location of the CA certificate (Default: /etc/pki/tls/cert.pem)



    Disable TLS for cloud storage connections

  3. (Optional) You can tune the data transfer with these properties:

    Parameter name Type Description



    Reconciliation period (Default: 10s)



    Number of simultaneous uploads per shard (Default: 20)

Cancel and delete archiving jobs

The data archiving process runs on each node. By default, every 10 seconds the archiving process identifies log segments that are ready for archiving. Those log segments are uploaded when they are closed for new batches, because they contain only batches with offsets that are less than the committed offset of the partition.

Every upload job handles a single partition for which the node is a leader. When the leadership for the partition changes, the reconciliation process cancels all upload jobs that are no longer running on a leader and starts new upload jobs for new partitions.

In cases where you want to delete older archived data, you can use the following tags to safely delete both partition manifests and log segments from S3 buckets when they are no longer needed.

It’s recommended that you keep topic manifests to recover the corresponding topic.

Redpanda uploads the following objects to an S3 bucket:

  • Partition manifests

  • Topic manifests

  • Log segments

Partition and topic manifests use the .json extension, and log segments use the .log extension.

  • Log paths match the paths in the Redpanda data folder in the format <random-prefix>/<namespace>/<topic>/<partition-id>_<revision-id>/<logfile-name>.log. For example, a log segment path looks like /06160a28/kafka/redpanda-test/0_2/39459-1-v1.log. The random characters improve performance with S3 data transfers.

  • Partition and topic manifests use prefixes with all symbols set to 0 except the first one. For example, a manifest path looks like /00000000/meta/kafka/redpanda-test/topic_manifest.json or /00000000/meta/kafka/redpanda-test/6_2/manifest.json.

Redpanda adds the rp-type tag to all objects in S3:

  • Partition manifests - rp-type=partition-manifest

  • Log segments - rp-type=segment

  • Topic manifests - rp-type=topic-manifest