parquet

Type:

Available in: Self-Managed

Reads and decodes Parquet files into a stream of structured messages.

Introduced in version 4.8.0.

Common
Advanced

# Common config fields, showing default values
input:
  label: ""
  parquet:
    paths: [] # No default (required)
    auto_replay_nacks: true

# All config fields, showing default values
input:
  label: ""
  parquet:
    paths: [] # No default (required)
    batch_count: 1
    auto_replay_nacks: true

This input uses https://github.com/parquet-go/parquet-go, which is itself experimental. Therefore changes could be made into how this processor functions outside of major version releases.

By default any BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY value will be extracted as a byte slice ([]byte) unless the logical type is UTF8, in which case they are extracted as a string (string).

When a value extracted as a byte slice exists within a document which is later JSON serialized by default it will be base 64 encoded into strings, which is the default for arbitrary data fields. It is possible to convert these binary values to strings (or other data types) using Bloblang transformations such as root.foo = this.foo.string() or root.foo = this.foo.encode("hex"), etc.

Fields

`auto_replay_nacks`

Whether messages that are rejected (nacked) at the output level should be automatically replayed indefinitely, eventually resulting in back pressure if the cause of the rejections is persistent. If set to false these messages will instead be deleted. Disabling auto replays can greatly improve memory efficiency of high throughput streams as the original shape of the data can be discarded immediately upon consumption and mutation.

Type: bool

Default: true

`batch_count`

Optionally process records in batches. This can help to speed up the consumption of exceptionally large files. When the end of the file is reached the remaining records are processed as a (potentially smaller) batch.

Type: int

Default: 1

`paths[]`

A list of file paths to read from. Each file will be read sequentially until the list is exhausted, at which point the input will close. Glob patterns are supported, including super globs (double star).

Type: array

# Examples:
paths:
  - /tmp/foo.parquet

  - /tmp/bar/*.parquet

  - /tmp/data/**/*.parquet

Was this helpful?

group Ask in the community

mail Share your feedback

group_add Make a contribution

What do you think of this page?

Let us know more:

Let us contact you about your feedback:

parquet

Fields

auto_replay_nacks

batch_count

paths[]

Simple online edits

Contribution guide

`auto_replay_nacks`

`batch_count`

`paths[]`