Query Iceberg Topics using AWS Glue
|
This feature requires an enterprise license. To get a trial license key or extend your trial period, generate a new trial license key. To purchase a license, contact Redpanda Sales. If Redpanda has enterprise features enabled and it cannot find a valid license, restrictions apply. |
This guide walks you through querying Redpanda topics as Iceberg tables stored in AWS S3, using a catalog integration with AWS Glue. For general information about Iceberg catalog integrations in Redpanda, see Use Iceberg Catalogs.
Prerequisites
-
An AWS account with access to AWS Glue Data Catalog.
-
Redpanda version 25.1.7 or later.
-
rpkinstalled or updated to the latest version. -
Object storage configured for your cluster and Tiered Storage enabled for the topics for which you want to generate Iceberg tables.
You also use the S3 bucket URI to set the base location for AWS Glue Data Catalog.
-
Admin permissions to create IAM policies and roles in AWS.
Limitations
Lowercase field names required
Use only lowercase field names. AWS Glue converts all table column names to lowercase, and Redpanda requires exact column name matches to manage schemas. Using uppercase letters prevents Redpanda from finding matching columns, which breaks schema management.
Nested partition spec support
AWS Glue does not support partitioning on nested fields. If Redpanda detects that
the default partitioning (hour(redpanda.timestamp)) based on the record metadata is in use, it will instead apply an empty partition spec (), which means the table will not be partitioned.
To use partitioning, you must implement custom partitioning using your own partition columns (that is, columns that are not nested).
|
In Redpanda versions 25.2.1 and earlier, an empty partition spec |
Manual deletion of Iceberg tables
The AWS Glue catalog integration does not support automatic deletion of Iceberg tables from Redpanda. To manually delete Iceberg tables in AWS Glue, you must either:
-
Set the cluster property
iceberg_deletetofalsewhen you configure the catalog integration. -
Override the cluster property
iceberg_deleteby setting the topic propertyredpanda.iceberg.deletetofalsefor the topic you want to delete.
When iceberg_delete or the topic override redpanda.iceberg.delete is set to false, you can delete the Redpanda topic, and then delete the table in AWS Glue and the Iceberg data and metadata files in the S3 bucket. If you plan to re-create the topic after deleting it, you must delete the table data entirely before re-creating the topic.
Authorize access to AWS Glue
You must allow Redpanda access to AWS Glue services in your AWS account. You can use the same access credentials that you configured for S3 (IAM role, access keys, and KMS key), as long as you have also added read and write access to AWS Glue Data Catalog.
For example, you could create a separate IAM policy that manages access to AWS Glue and attach it to the IAM role that Redpanda also uses to access S3. Add all AWS Glue API actions in the policy ("glue:*") on the following resources:
-
Root catalog (
catalog) -
All databases (
database/*) -
All tables (
table/*/*)
Your IAM policy should include a statement similar to the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:*"
],
"Resource": [
"arn:aws:glue:<aws-region>:<aws-account-id>:catalog",
"arn:aws:glue:<aws-region>:<aws-account-id>:database/*",
"arn:aws:glue:<aws-region>:<aws-account-id>:table/*/*"
]
}
]
}
For more information on configuring IAM permissions, see the AWS Glue documentation.
Configure authentication and credentials
You must configure credentials for the AWS Glue Data Catalog integration in either of the following ways:
-
Allow Redpanda to use the same
cloud_storage_*credential properties configured for S3. This is the recommended approach. -
If you want to configure authentication to AWS Glue separately from authentication to S3, there are equivalent credential configuration properties named
iceberg_rest_catalog_aws_*that override the object storage credentials. These properties only apply to REST catalog authentication, and never to S3 authentication:
Update cluster configuration
To configure your Redpanda cluster to enable Iceberg on a topic and integrate with the AWS Glue Data Catalog:
-
Edit your cluster configuration to set the
iceberg_enabledproperty totrue, and set the catalog integration properties listed in the example below.By default, Redpanda creates Iceberg tables in a namespace called
redpanda. Because AWS Glue provides a single catalog per account, each Redpanda cluster that writes to the same Glue catalog must use a distinct namespace to avoid table name collisions. To set a unique namespace, also seticeberg_default_catalog_namespacewhen you seticeberg_enabled. This property cannot be changed after Iceberg is enabled.Run
rpk cluster config editto update these properties:iceberg_enabled: true # Set a custom namespace instead of the default "redpanda" iceberg_default_catalog_namespace: ["<custom-namespace>"] # Glue requires Redpanda Iceberg tables to be manually deleted iceberg_delete: false iceberg_catalog_type: rest iceberg_rest_catalog_endpoint: https://glue.<glue-region>.amazonaws.com/iceberg iceberg_rest_catalog_authentication_mode: aws_sigv4 # Because Redpanda does not support the use of distinct buckets for Iceberg, # always place iceberg_rest_catalog_base_location in the same S3 bucket as cloud_storage_bucket iceberg_rest_catalog_base_location: s3://<bucket-name>/<warehouse-path> # Use the iceberg_rest_catalog_aws_* properties if you want to # use separate AWS credentials for the catalog, or omit these lines to reuse S3 # (cloud_storage_*) credentials. # For access using access keys only, use iceberg_rest_catalog_aws_access_key # and iceberg_rest_catalog_aws_secret_key. For access with an IAM role, use # iceberg_rest_catalog_credentials_source only. # iceberg_rest_catalog_aws_region: # iceberg_rest_catalog_aws_access_key: # iceberg_rest_catalog_aws_secret_key: # iceberg_rest_catalog_credentials_source:Use your own values for the following placeholders:
-
<custom-namespace>: A unique namespace for this cluster’s Iceberg tables. Each Redpanda cluster that writes to the same Glue catalog must use a distinct namespace to avoid table name collisions. If omitted, the default namespaceredpandais used. -
<glue-region>: The AWS region where your Data Catalog is located. The region in the AWS Glue endpoint must match the region specified in either yourcloud_storage_regionoriceberg_rest_catalog_aws_regionproperty. -
<bucket-name>and<warehouse-path>: AWS Glue requires you to specify the base location where Redpanda stores Iceberg data and metadata files. You must use an S3 URI; for example,s3://<bucket-name>/iceberg. This must be the same bucket used for object storage (yourcloud_storage_bucket). You cannot specify a different bucket for Iceberg data.<warehouse-path>is a name you choose (such asiceberg) as the logical name for the warehouse represented by all Redpanda Iceberg topic data in the cluster.As a security best practice, do not use the bucket root for the base location. Always specify a subfolder to avoid interfering with your cluster’s data in object storage.
Successfully updated configuration. New configuration version is 2. -
-
If you change the configuration for a running cluster, you must restart that cluster now.
-
Enable the integration for a topic by configuring the topic property
redpanda.iceberg.mode. The following examples show how to userpkto either create a new topic or alter the configuration for an existing topic and set the Iceberg mode tokey_value. Thekey_valuemode creates a two-column Iceberg table for the topic, with one column for the record metadata including the key, and another binary column for the record’s value. See Specify Iceberg Schema for more details on Iceberg modes.Create a new topic and setredpanda.iceberg.mode:rpk topic create <topic-name> --topic-config=redpanda.iceberg.mode=key_valueSetredpanda.iceberg.modefor an existing topic:rpk topic alter-config <topic-name> --set redpanda.iceberg.mode=key_value -
Produce to the topic. For example,
echo "hello world\nfoo bar\nbaz qux" | rpk topic produce <topic-name> --format='%k %v\n'
You should see the topic as a table with data in AWS Glue Data Catalog. The data may take some time to become visible, depending on your iceberg_target_lag_ms setting.
-
In AWS Glue Studio, go to Databases.
-
Select the
redpandadatabase. Theredpandadatabase and the table within are automatically added for you. The table name is the same as the topic name.
Query Iceberg table
You can query the Iceberg table using different engines, such as Amazon Athena, PyIceberg, or Apache Spark. To query the table or view the table data in AWS Glue, ensure that your account has the necessary permissions to access the catalog, database, and table.
To query the table in Amazon Athena:
-
On the list of tables in AWS Glue Studio, click "Table data" under the View data column.
-
Click "Proceed" to be redirected to the Athena query editor.
-
In the query editor, select AwsDataCatalog as the data source, and select the
redpandadatabase. If you set a custom namespace for your cluster, select that database instead ofredpanda. -
The SQL query editor should be pre-populated with a query that selects 10 rows from the Iceberg table. Run the query to see a preview of the table data.
SELECT * FROM "AwsDataCatalog"."redpanda"."<table-name>" limit 10;Your query results should look like the following:
+-----------------------------------------------------+----------------+ | redpanda | value | +-----------------------------------------------------+----------------+ | {partition=0, offset=0, timestamp=2025-07-21 | 77 6f 72 6c 64 | | 18:11:25.070000, headers=null, key=[B@1900af31} | | +-----------------------------------------------------+----------------+