text_chunker
Breaks down text-based message content into manageable chunks using a configurable strategy. This processor is ideal for creating vector embeddings of large text documents.
-
Common
-
Advanced
processors:
label: ""
text_chunker:
strategy: "" # No default (required)
chunk_size: 512
chunk_overlap: 100
separators:
- "\n\n"
- "\n"
- " "
- ""
length_measure: runes
include_code_blocks: false
keep_reference_links: false
processors:
label: ""
text_chunker:
strategy: "" # No default (required)
chunk_size: 512
chunk_overlap: 100
separators:
- "\n\n"
- "\n"
- " "
- ""
length_measure: runes
token_encoding: "" # No default (optional)
allowed_special: []
disallowed_special:
- "all"
include_code_blocks: false
keep_reference_links: false
Fields
allowed_special[]
A list of special tokens to include in the output from this processor.
Type: array
Default: []
chunk_overlap
The number of characters duplicated in adjacent chunks of text.
Type: int
Default: 100
chunk_size
The maximum size of each chunk, using the selected length_measure.
Type: int
Default: 512
disallowed_special[]
A list of special tokens to exclude from the output of this processor.
Type: array
Default:
- "all"
include_code_blocks
When set to true, this processor includes code blocks in the output.
Type: bool
Default: false
keep_reference_links
When set to true, this processor includes reference links in the output.
Type: bool
Default: false
length_measure
Choose a method to measure the length of a string.
Type: string
Default: runes
| Option | Summary |
|---|---|
|
Use unicode graphemes to determine the length of a string. |
|
Use the number of codepoints to determine the length of a string. |
|
Use the number of tokens (using the |
|
Determine the length of text using the number of utf8 bytes. |
separators[]
A list of strings to use as separators between chunks when the recursive_character strategy option is specified.
By default, the following separators are tried in turn until one is successful:
-
Double newlines (`
)
- Single newlines (
)
- Spaces (`" “,”")
Type: array
Default:
- "\n\n"
- "\n"
- " "
- ""