parquet_encode
Encodes Parquet files from a batch of structured messages.
-
Common
-
Advanced
processors:
label: ""
parquet_encode:
schema: [] # No default (optional)
schema_metadata: ""
default_compression: uncompressed
processors:
label: ""
parquet_encode:
schema: [] # No default (optional)
schema_metadata: ""
default_compression: uncompressed
default_encoding: DELTA_LENGTH_BYTE_ARRAY
default_timestamp_unit: NANOSECOND
Fields
default_compression
The default compression type to use for fields.
Type: string
Default: uncompressed
Options: uncompressed, snappy, gzip, brotli, zstd, lz4raw
default_encoding
The default encoding type to use for fields. A custom default encoding is only necessary when consuming data with libraries that do not support DELTA_LENGTH_BYTE_ARRAY.
Type: string
Default: DELTA_LENGTH_BYTE_ARRAY
Options: DELTA_LENGTH_BYTE_ARRAY, PLAIN
default_timestamp_unit
The precision used when encoding TIMESTAMP logical types. The default NANOSECOND matches historical behaviour, but TIMESTAMP(NANOS) is not readable by Apache Spark (Databricks), AWS Athena or DuckDB; set this to MICROSECOND (or MILLISECOND) when writing Parquet files intended for consumption by those engines.
Type: string
Default: NANOSECOND
Options: NANOSECOND, MICROSECOND, MILLISECOND
schema[].fields[]
A list of child fields.
Type: array
# Examples:
fields:
- name: foo
type: INT64
- name: bar
type: BYTE_ARRAY
Examples
Writing Parquet Files to AWS S3
In this example we use the batching mechanism of an aws_s3 output to collect a batch of messages in memory, which then converts it to a parquet file and uploads it.
output:
aws_s3:
bucket: TODO
path: 'stuff/${! timestamp_unix() }-${! uuid_v4() }.parquet'
batching:
count: 1000
period: 10s
processors:
- parquet_encode:
schema:
- name: id
type: INT64
- name: weight
type: DOUBLE
- name: content
type: BYTE_ARRAY
default_compression: zstd