# text_chunker

> For the complete documentation index, see [llms.txt](https://docs.redpanda.com/llms.txt). Component-specific: [connect-full.txt](https://docs.redpanda.com/connect-full.txt)

---
title: text_chunker
latest-connect-version: 4.93.0
latest-operator-version: v26.1.4
latest-console-tag: v3.7.3
latest-redpanda-tag: v26.1.9
docname: processors/text_chunker
page-component-name: connect
page-version: master
page-component-version: master
page-component-title: Connect
page-relative-src-path: processors/text_chunker.adoc
page-edit-url: https://github.com/redpanda-data/rp-connect-docs/edit/main/modules/components/pages/processors/text_chunker.adoc
page-git-created-date: "2025-05-02"
page-git-modified-date: "2026-05-26"
---

<!-- Source: https://docs.redpanda.com/connect/components/processors/text_chunker.md -->

**Available in:** [Cloud](https://docs.redpanda.com/cloud-data-platform/develop/connect/components/processors/text_chunker/%20%22View%20the%20Cloud%20version%20of%20this%20component%22), Self-Managed

Breaks down text-based message content into manageable chunks using a configurable strategy. This processor is ideal for creating vector embeddings of large text documents.

Introduced in version 4.51.0.

#### Common

```yml
processors:
  label: ""
  text_chunker:
    strategy: "" # No default (required)
    chunk_size: 512
    chunk_overlap: 100
    separators:
      - "\n\n"
      - "\n"
      - " "
      - ""
    length_measure: runes
    include_code_blocks: false
    keep_reference_links: false
```

#### Advanced

```yml
processors:
  label: ""
  text_chunker:
    strategy: "" # No default (required)
    chunk_size: 512
    chunk_overlap: 100
    separators:
      - "\n\n"
      - "\n"
      - " "
      - ""
    length_measure: runes
    token_encoding: "" # No default (optional)
    allowed_special: []
    disallowed_special:
      - "all"
    include_code_blocks: false
    keep_reference_links: false
```

## [](#fields)Fields

### [](#allowed_special)`allowed_special[]`

A list of special tokens to include in the output from this processor.

**Type**: `array`

**Default**: `[]`

### [](#chunk_overlap)`chunk_overlap`

The number of characters duplicated in adjacent chunks of text.

**Type**: `int`

**Default**: `100`

### [](#chunk_size)`chunk_size`

The maximum size of each chunk, using the selected [`length_measure`](#length_measure).

**Type**: `int`

**Default**: `512`

### [](#disallowed_special)`disallowed_special[]`

A list of special tokens to exclude from the output of this processor.

**Type**: `array`

**Default**:

```yaml
- "all"
```

### [](#include_code_blocks)`include_code_blocks`

When set to `true`, this processor includes code blocks in the output.

**Type**: `bool`

**Default**: `false`

### [](#keep_reference_links)`keep_reference_links`

When set to `true`, this processor includes reference links in the output.

**Type**: `bool`

**Default**: `false`

### [](#length_measure)`length_measure`

Choose a method to measure the length of a string.

**Type**: `string`

**Default**: `runes`

| Option | Summary |
| --- | --- |
| graphemes | Use unicode graphemes to determine the length of a string. |
| runes | Use the number of codepoints to determine the length of a string. |
| token | Use the number of tokens (using the token_encoding tokenizer) to determine the length of a string. |
| utf8 | Determine the length of text using the number of utf8 bytes. |

### [](#separators)`separators[]`

A list of strings to use as separators between chunks when the [`recursive_character` strategy option](#strategy) is specified.

By default, the following separators are tried in turn until one is successful:

-   Double newlines (\`


`) - Single newlines (` ``) - Spaces (`" “,”"``)

**Type**: `array`

**Default**:

```yaml
- "\n\n"
- "\n"
- " "
- ""
```

### [](#strategy)`strategy`

Choose a strategy for breaking content down into chunks.

**Type**: `string`

| Option | Summary |
| --- | --- |
| markdown | Split text by markdown headers. |
| recursive_character | Split text recursively by characters (defined in separators). |
| token | Split text by tokens. |

### [](#token_encoding)`token_encoding`

The type of encoding to use for tokenization.

**Type**: `string`

```yaml
# Examples:
token_encoding: cl100k_base

# ---

token_encoding: r50k_base
```