parquet

Reads and decodes Parquet files into a stream of structured messages.

Introduced in version 4.8.0.

  • Common

  • Advanced

# Common config fields, showing default values
input:
  label: ""
  parquet:
    paths: [] # No default (required)
    auto_replay_nacks: true
# All config fields, showing default values
input:
  label: ""
  parquet:
    paths: [] # No default (required)
    batch_count: 1
    auto_replay_nacks: true

This input uses https://github.com/parquet-go/parquet-go, which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.

By default any BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY value will be extracted as a byte slice ([]byte) unless the logical type is UTF8, in which case they are extracted as a string (string).

When a value extracted as a byte slice exists within a document which is later JSON serialized by default it will be base 64 encoded into strings, which is the default for arbitrary data fields. It is possible to convert these binary values to strings (or other data types) using Bloblang transformations such as root.foo = this.foo.string() or root.foo = this.foo.encode("hex"), etc.

Fields

paths

A list of file paths to read from. Each file will be read sequentially until the list is exhausted, at which point the input will close. Glob patterns are supported, including super globs (double star).

Type: array

# Examples

paths: /tmp/foo.parquet

paths: /tmp/bar/*.parquet

paths: /tmp/data/**/*.parquet

batch_count

Optionally process records in batches. This can help to speed up the consumption of exceptionally large files. When the end of the file is reached the remaining records are processed as a (potentially smaller) batch.

Type: int

Default: 1

auto_replay_nacks

Whether messages that are rejected (nacked) at the output level should be automatically replayed indefinitely, eventually resulting in back pressure if the cause of the rejections is persistent. If set to false these messages will instead be deleted. Disabling auto replays can greatly improve memory efficiency of high throughput streams as the original shape of the data can be discarded immediately upon consumption and mutation.

Type: bool

Default: true