Filter Messages into a New Topic using a Regex

This is an example of how to filter messages from one topic into another using regular expressions (regex) and Redpanda data transforms. If a source topic contains a key or value that matches the regex, it will be produced to the sink topic.

Regexes are implemented using Go’s regexp library, which uses the same syntax as RE2. See the RE2 wiki for help with syntax.

The regex used in this example matches the typical email address pattern.

Prerequisites

You must have the following:

Run the lab

  1. Clone this repository:

    git clone https://github.com/redpanda-data/redpanda-labs.git
  2. Change into the data-transforms/regex/ directory:

    cd redpanda-labs/data-transforms/regex
  3. Set the REDPANDA_VERSION environment variable to at least version 23.3.1. Data transforms was introduced in this version. For all available versions, see the GitHub releases.

    For example:

    export REDPANDA_VERSION=24.1.1
  4. Set the REDPANDA_CONSOLE_VERSION environment variable to the version of Redpanda Console that you want to run. For all available versions, see the GitHub releases.

    For example:

    export REDPANDA_CONSOLE_VERSION=2.5.2
  5. Start Redpanda in Docker by running the following command:

    docker compose up -d --wait
  6. Set up your rpk profile:

    rpk profile create regex --from-profile profile.yml
  7. Create the required topics:

    rpk topic create src sink
  8. Build the transforms function:

    rpk transform build
  9. Deploy the transforms function:

    ./deploy-transform.sh

    See the file deploy-transform.sh to understand the regex used in the transform. Only input that matches the regular expression will be transformed.

    This example accepts the following environment variables:

    • PATTERN (required): The regex to match against records. Here, the regex finds messages containing email addresses.

    • MATCH_VALUE: By default, the regex matches record keys, but if set to true, the regex will match values.

  10. Run rpk topic produce:

    rpk topic produce src
  11. Paste the following into the prompt and press Ctrl+C to exit:

    Hello, please contact us at help@example.com.
    Hello, please contact us at support.example.com.
    Hello, please contact us at help@example.edu.
  12. Consume the sink topic to see that input lines containing email addresses were extracted and produced to the sink topic:

    rpk topic consume sink --num 2
    {
      "topic": "sink",
      "value": "Hello, please contact us at help@example.com.",
      "timestamp": 1714525578013,
      "partition": 0,
      "offset": 0
    }
    {
      "topic": "sink",
      "value": "Hello, please contact us at help@example.edu.",
      "timestamp": 1714525579192,
      "partition": 0,
      "offset": 1
    }
The second input line, Hello, please contact us at support.example.com., is not in the sink topic because it did not match the regex that identifies valid email addresses.

You can also see the sink topic contents in Redpanda Console.

Switch to the src topic to see all of the events, including the one that does not match the regex and is not in the sink topic.

Clean up

To shut down and delete the containers along with all your cluster data:

docker compose down -v