Query Iceberg Topics

This feature requires an enterprise license. To get a trial license key or extend your trial period, generate a new trial license key. To purchase a license, contact Redpanda Sales.

If Redpanda has enterprise features enabled and it cannot find a valid license, restrictions apply.

When you access Iceberg topics from a data lakehouse or other Iceberg-compatible tools, how you consume the data depends on the topic Iceberg mode and whether you’ve registered a schema for the topic in the Redpanda Schema Registry. You do not need to rely on complex ETL jobs or pipelines to access real-time data from Redpanda.

Access Iceberg tables

Redpanda generates an Iceberg table with the same name as the topic. Depending on the processing engine and your Iceberg catalog implementation, you may also need to define the table (for example using CREATE TABLE) to point the data lakehouse to its location in the catalog. For an example, see Query Iceberg Topics using Snowflake and Open Catalog.

If your engine needs the full JSON metadata path, use the following:

redpanda-iceberg-catalog/metadata/redpanda/<topic-name>/v<version-number>.metadata.json

This provides read access to all snapshots written as of the specified table version (denoted by version-number).

Redpanda automatically removes expired snapshots on a periodic basis. Snapshot expiry helps maintain a smaller metadata size and reduces the window available for time travel.

Query examples

To follow along with the examples on this page, suppose you produce the same stream of events to a topic ClickEvent, which uses a schema, and another topic ClickEvent_key_value, which uses the key-value mode. The topics have Tiered Storage configured to an AWS S3 bucket. A sample record contains the following data:

{"user_id": 2324, "event_type": "BUTTON_CLICK", "ts": "2024-11-25T20:23:59.380Z"}

Topic with schema (`value_schema_id_prefix` mode)

The steps in this section also apply to the value_schema_latest mode, except for step 2. The value_schema_latest mode doesn’t require the Schema Registry wire format, so you’ll use your own producer code instead of rpk topic produce.

Assume that you have created the ClickEvent topic, set redpanda.iceberg.mode to value_schema_id_prefix, and are connecting to a REST-based Iceberg catalog. The following is an Avro schema for ClickEvent:

schema.avsc

{
    "type" : "record",
    "namespace" : "com.redpanda.examples.avro",
    "name" : "ClickEvent",
    "fields" : [
       { "name": "user_id", "type" : "int" },
       { "name": "event_type", "type" : "string" },
       { "name": "ts", "type": "string" }
    ]
 }

rpk registry schema create ClickEvent-value --schema path/to/schema.avsc --type avro

Produce to the ClickEvent topic using the following format:
```
echo '"key1" {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent --format='%k %v\n' --schema-id=topic
```
The value_schema_id_prefix mode requires that you produce to a topic using the Schema Registry wire format, which includes the magic byte and schema ID in the prefix of the message payload. This allows Redpanda to identify the correct schema version in the Schema Registry for a record.

The following Spark SQL query returns values from columns in the ClickEvent table, with the table structure derived from the schema, and column names matching the schema fields. If you’ve integrated a catalog, query engines such as Spark SQL provide Iceberg integrations that allow easy discovery and access to existing Iceberg tables in object storage.

SELECT *
FROM `<catalog-name>`.redpanda.ClickEvent;

+-----------------------------------+---------+--------------+--------------------------+
| redpanda                          | user_id | event_type   | ts                       |
+-----------------------------------+---------+--------------+--------------------------+
| {"partition":0,"offset":0,"timestamp":2025-03-05 15:09:20.436,"headers":null,"key":null} | 2324    | BUTTON_CLICK | 2024-11-25T20:23:59.380Z |
+-----------------------------------+---------+--------------+--------------------------+

Topic in key-value mode

In key_value mode, you do not associate the topic with a schema in the Schema Registry, which means using semi-structured data in Iceberg. The record keys and values can have an arbitrary structure, so Redpanda stores them in binary format in Iceberg.

In this example, assume that you have created the ClickEvent_key_value topic, and set redpanda.iceberg.mode to key_value.

Produce to the ClickEvent_key_value topic using the following format:

echo 'key1 {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent_key_value --format='%k %v\n'

The following Spark SQL query returns the semi-structured data in the ClickEvent_key_value table. The table consists of two columns: one named redpanda, containing the record key and other metadata, and another binary column named value for the record’s value:

SELECT *
FROM `<catalog-name>`.redpanda.ClickEvent_key_value;

+-----------------------------------+------------------------------------------------------------------------------+
| redpanda                          | value                                                                        |
+-----------------------------------+------------------------------------------------------------------------------+
| {"partition":0,"offset":0,"timestamp":2025-03-05 15:14:30.931,"headers":null,"key":key1} | {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"} |
+-----------------------------------+------------------------------------------------------------------------------+

Depending on your query engine, you might need to first decode the binary value to display the record key and value using a SQL helper function. For example, see the decode and unhex Spark SQL functions, or the HEX_DECODE_STRING Snowflake function. Some engines may also automatically decode the binary value for you.

Time travel queries

Some query engines, such as Spark, support time travel with Iceberg, allowing you to query the table at a specific point in time. You can query the table as it existed at a specific timestamp or version number.

Redpanda automatically removes expired snapshots on a periodic basis, which also reduces the window available for time travel queries.

The following example queries a ClickEvent table at a specific timestamp in Spark:

SELECT * FROM `<catalog-name>`.redpanda.ClickEvent TIMESTAMP AS OF '2025-03-02 10:00:00';

Next steps

Query Iceberg Topics using Snowflake and Open Catalog

Suggested labs

Redpanda Iceberg Docker Compose Example

Search all labs

Was this helpful?

group Ask in the community

mail Share your feedback

group_add Make a contribution

What do you think of this page?

Let us know more:

Let us contact you about your feedback: