Redpanda Iceberg Docker Compose Example

This lab provides a Docker Compose environment to help you quickly get started with Redpanda and its integration with Apache Iceberg. It showcases how Redpanda, when paired with a Tiered Storage solution like MinIO, can write data in the Iceberg format, enabling seamless analytics workflows. The lab also includes a Spark environment configured for querying the Iceberg tables using SQL within a Jupyter Notebook interface.

In this setup, you will:

  • Produce data to Redpanda topics that are Iceberg-enabled.

  • Observe how Redpanda writes this data in Iceberg format to MinIO as the Tiered Storage backend.

  • Use Spark to query the Iceberg tables, demonstrating a complete pipeline from data production to querying.

This environment is ideal for experimenting with Redpanda’s Iceberg and Tiered Storage capabilities, enabling you to test end-to-end workflows for analytics and data lake architectures.

Prerequisites

You must have the following installed on your machine:

Run the lab

  1. Clone this repository:

    git clone https://github.com/redpanda-data/redpanda-labs.git
  2. Change into the docker-compose/iceberg/ directory:

    cd redpanda-labs/docker-compose/iceberg
  3. Set the REDPANDA_VERSION environment variable to at least version 24.3.1. For all available versions, see the GitHub releases.

    For example:

    export REDPANDA_VERSION=24.3.1
  4. Set the REDPANDA_CONSOLE_VERSION environment variable to the version of Redpanda Console that you want to run. For all available versions, see the GitHub releases.

    For example:

    export REDPANDA_CONSOLE_VERSION=2.8.0
  5. Start the Docker Compose environment, which includes Redpanda, MinIO, Spark, and Jupyter Notebook:

    docker compose build && docker compose up
  6. Create and switch to a new rpk profile that connects to your Redpanda broker:

    rpk profile create docker-compose-iceberg --set=admin_api.addresses=localhost:19644 --set=brokers=localhost:19092 --set=schema_registry.addresses=localhost:18081
  7. Create two topics with Iceberg enabled:

    rpk topic create key_value --topic-config=redpanda.iceberg.mode=key_value
    rpk topic create value_schema_id_prefix --topic-config=redpanda.iceberg.mode=value_schema_id_prefix
  8. Produce data to the key_value topic and see data show up.

    echo "hello world" | rpk topic produce key_value --format='%k %v\n'
  9. Open Redpanda Console at http://localhost:8081/topics to see that the topics exist in Redpanda.

  10. Open the Jupyter Notebook server at http://localhost:8888. The notebook guides you through querying Iceberg tables created from Redpanda topics.

  11. Create a schema in the Schema Registry:

    rpk registry schema create value_schema_id_prefix-value --schema schema.avsc
  12. Produce data to the value_schema_id_prefix topic:

    echo '{"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}\n{"user_id":3333,"event_type":"SCROLL","ts":"2024-11-25T20:24:14.774Z"}\n{"user_id":7272,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:24:34.552Z"}' | rpk topic produce value_schema_id_prefix --format='%v\n' --schema-id=topic

When the data is committed, it should be available in Iceberg format and you can query the table lab.redpanda.value_schema_id_prefix in the Jupyter Notebook.

Alternative query interfaces

While the notebook server is running, you can query Iceberg tables directly using Spark’s CLI tools, Instead of Jupyter Notebook:

Spark Shell
docker exec -it spark-iceberg spark-shell
Spark SQL
docker exec -it spark-iceberg spark-sql
PySpark
docker exec -it spark-iceberg pyspark

Clean up

To shut down and delete the containers along with all your cluster data:

docker compose down -v