parquet_encode

Encodes Parquet files from a batch of structured messages.

  • Common

  • Advanced

# Common config fields, showing default values
label: ""
parquet_encode:
  schema: [] # No default (required)
  default_compression: uncompressed
# All config fields, showing default values
label: ""
parquet_encode:
  schema: [] # No default (required)
  default_compression: uncompressed
  default_encoding: DELTA_LENGTH_BYTE_ARRAY

This processor uses https://github.com/parquet-go/parquet-go, which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.

Examples

  • Writing Parquet Files to AWS S3

In this example we use the batching mechanism of an aws_s3 output to collect a batch of messages in memory, which then converts it to a parquet file and uploads it.

output:
  aws_s3:
    bucket: TODO
    path: 'stuff/${! timestamp_unix() }-${! uuid_v4() }.parquet'
    batching:
      count: 1000
      period: 10s
      processors:
        - parquet_encode:
            schema:
              - name: id
                type: INT64
              - name: weight
                type: DOUBLE
              - name: content
                type: BYTE_ARRAY
            default_compression: zstd

Fields

schema

Parquet schema.

Type: array

schema[].name

The name of the column.

Type: string

schema[].type

The type of the column, only applicable for leaf columns with no child fields. Some logical types can be specified here such as UTF8.

Type: string

Options: BOOLEAN , INT32 , INT64 , FLOAT , DOUBLE , BYTE_ARRAY , UTF8 .

schema[].repeated

Whether the field is repeated.

Type: bool

Default: false

schema[].optional

Whether the field is optional.

Type: bool

Default: false

schema[].fields

A list of child fields.

Type: array

# Examples

fields:
  - name: foo
    type: INT64
  - name: bar
    type: BYTE_ARRAY

default_compression

The default compression type to use for fields.

Type: string

Default: "uncompressed"

Options: uncompressed , snappy , gzip , brotli , zstd , lz4raw .

default_encoding

The default encoding type to use for fields. A custom default encoding is only necessary when consuming data with libraries that do not support DELTA_LENGTH_BYTE_ARRAY and is therefore best left unset where possible.

Type: string

Default: "DELTA_LENGTH_BYTE_ARRAY"

Options: DELTA_LENGTH_BYTE_ARRAY , PLAIN .