text_chunker

Available in: Cloud, Self-Managed

Breaks down text-based message content into manageable chunks using a configurable strategy. This processor is ideal for creating vector embeddings of large text documents.

Common
Advanced

# Common configuration fields, showing default values
label: ""
text_chunker:
  strategy: "" # No default (required)
  chunk_size: 512
  chunk_overlap: 100
  separators:
    - "\n\n"
    - "\n"
    - " "
    - ""
  length_measure: runes
  include_code_blocks: false
  keep_reference_links: false

# All configuration fields, showing default values
label: ""
text_chunker:
  strategy: "" # No default (required)
  chunk_size: 512
  chunk_overlap: 100
  separators:
    - "\n\n"
    - "\n"
    - " "
    - ""
  length_measure: runes
  token_encoding: cl100k_base # No default (optional)
  allowed_special: []
  disallowed_special:
    - all
  include_code_blocks: false
  keep_reference_links: false

Fields

`allowed_special[]`

A list of special tokens to include in the output from this processor.

Type: array

Default: []

`chunk_overlap`

The number of characters duplicated in adjacent chunks of text.

Type: int

Default: 100

`chunk_size`

The maximum size of each chunk, using the selected length_measure.

Type: int

Default: 512

`disallowed_special[]`

A list of special tokens to exclude from the output of this processor.

Type: array

Default:

- all

`include_code_blocks`

When set to true, this processor includes code blocks in the output.

Type: bool

Default: false

`keep_reference_links`

When set to true, this processor includes reference links in the output.

Type: bool

Default: false

`length_measure`

Choose a method to measure the length of a string.

Type: string

Default: runes

Option Summary

Option	Summary
`graphemes`	Use unicode graphemes to determine the length of a string.
`runes`	Use the number of codepoints to determine the length of a string.
`token`	Use the number of tokens (using the `token_encoding` tokenizer) to determine the length of a string.
`utf8`	Determine the length of text using the number of utf8 bytes.

graphemes

Use unicode graphemes to determine the length of a string.

runes

Use the number of codepoints to determine the length of a string.

token

Use the number of tokens (using the token_encoding tokenizer) to determine the length of a string.

utf8

Determine the length of text using the number of utf8 bytes.

`separators[]`

A list of strings to use as separators between chunks when the recursive_character strategy option is specified.

By default, the following separators are tried in turn until one is successful:

Double newlines (`

) - Single newlines ( ) - Spaces (`" “,”")

Type: array

Default:

- "\n\n"
- "\n"
- " "
- ""

`strategy`

Choose a strategy for breaking content down into chunks.

Type: string

Option Summary

Option	Summary
`markdown`	Split text by markdown headers.
`recursive_character`	Split text recursively by characters (defined in `separators`).
`token`	Split text by tokens.

markdown

Split text by markdown headers.

recursive_character

Split text recursively by characters (defined in separators).

token

Split text by tokens.

`token_encoding`

The type of encoding to use for tokenization.

Type: string

# Examples:
token_encoding: cl100k_base
token_encoding: r50k_base

Was this helpful?

group Ask in the community

mail Share your feedback

group_add Make a contribution

What do you think of this page?

Let us know more:

Let us contact you about your feedback:

text_chunker

Fields

allowed_special[]

chunk_overlap

chunk_size

disallowed_special[]

include_code_blocks

keep_reference_links

length_measure

separators[]

strategy

token_encoding

Simple online edits

Contribution guide

`allowed_special[]`

`chunk_overlap`

`chunk_size`

`disallowed_special[]`

`include_code_blocks`

`keep_reference_links`

`length_measure`

`separators[]`

`strategy`

`token_encoding`