Cloud

parquet_encode

Encodes Parquet files from a batch of structured messages.

Common
Advanced

processors:
  label: ""
  parquet_encode:
    schema: [] # No default (optional)
    schema_metadata: ""
    default_compression: uncompressed

processors:
  label: ""
  parquet_encode:
    schema: [] # No default (optional)
    schema_metadata: ""
    default_compression: uncompressed
    default_encoding: DELTA_LENGTH_BYTE_ARRAY
    default_timestamp_unit: NANOSECOND

Fields

`default_compression`

The default compression type to use for fields.

Type: string

Default: uncompressed

Options: uncompressed, snappy, gzip, brotli, zstd, lz4raw

`default_encoding`

The default encoding type to use for fields. A custom default encoding is only necessary when consuming data with libraries that do not support DELTA_LENGTH_BYTE_ARRAY.

Type: string

Default: DELTA_LENGTH_BYTE_ARRAY

Options: DELTA_LENGTH_BYTE_ARRAY, PLAIN

`default_timestamp_unit`

The precision used when encoding TIMESTAMP logical types. The default NANOSECOND matches historical behaviour, but TIMESTAMP(NANOS) is not readable by Apache Spark (Databricks), AWS Athena or DuckDB; set this to MICROSECOND (or MILLISECOND) when writing Parquet files intended for consumption by those engines.

Type: string

Default: NANOSECOND

Options: NANOSECOND, MICROSECOND, MILLISECOND

`schema[]`

Parquet schema.

Type: object

`schema[].fields[]`

A list of child fields.

Type: array

# Examples:
fields:
  - name: foo
    type: INT64
  - name: bar
    type: BYTE_ARRAY

`schema[].name`

The name of the column.

Type: string

`schema[].optional`

Whether the field is optional.

Type: bool

Default: false

`schema[].repeated`

Whether the field is repeated.

Type: bool

Default: false

`schema[].type`

The type of the column, only applicable for leaf columns with no child fields. Some logical types can be specified here such as UTF8.

Type: string

Options: BOOLEAN, INT32, INT64, FLOAT, DOUBLE, BYTE_ARRAY, UTF8, TIMESTAMP, BSON, ENUM, JSON, UUID

`schema_metadata`

Optionally specify a metadata field containing a schema definition to use for encoding instead of a statically defined schema. For batches of messages, the first message’s schema will be applied to all subsequent messages of the batch.

Type: string

Default: ""

Examples

Writing Parquet Files to AWS S3

In this example we use the batching mechanism of an aws_s3 output to collect a batch of messages in memory, which then converts it to a parquet file and uploads it.

output:
  aws_s3:
    bucket: TODO
    path: 'stuff/${! timestamp_unix() }-${! uuid_v4() }.parquet'
    batching:
      count: 1000
      period: 10s
      processors:
        - parquet_encode:
            schema:
              - name: id
                type: INT64
              - name: weight
                type: DOUBLE
              - name: content
                type: BYTE_ARRAY
            default_compression: zstd

Was this helpful?

group Ask in the community

mail Share your feedback

group_add Make a contribution

What do you think of this page?

Let us know more:

Let us contact you about your feedback:

parquet_encode

Fields

default_compression

default_encoding

default_timestamp_unit

schema[]

schema[].fields[]

schema[].name

schema[].optional

schema[].repeated

schema[].type

schema_metadata

Examples

Writing Parquet Files to AWS S3

Simple online edits

Contribution guide

`default_compression`

`default_encoding`

`default_timestamp_unit`

`schema[]`

`schema[].fields[]`

`schema[].name`

`schema[].optional`

`schema[].repeated`

`schema[].type`

`schema_metadata`