Docs Labs Stream Text Embeddings with Redpanda, OpenAI, and MongoDB In this lab, you’ll build a retrieval augmented generation (RAG) pipeline to enhance natural language understanding and response generation using Redpanda, OpenAI, MongoDB Atlas, and LangChain . This RAG pipeline comprises of two phases: Acquisition and persistence of new information: In this initial phase, Langchain is used to facilitate the acquisition of new information and prepare it for ingestion into Redpanda. The Redpanda Platform adds OpenAI text embeddings to messages as they stream through Redpanda on their way to a MongoDB Atlas vector database. Redpanda handles real-time data ingestion and storage, while Redpanda Connect ensures efficient communication with MongoDB Atlas. The acquired information, such as documents and webpages, is split into smaller text chunks and stored in MongoDB Atlas along with their vector embeddings. These embeddings, which encode the semantic meaning of text in a multidimensional space, enable efficient semantic search. MongoDB Atlas enables queries based on vector embeddings to retrieve texts with similar semantic meaning. Retrieval of relevant contextual information: In this phase, contextual information relevant to the user’s question (prompt) is retrieved from MongoDB Atlas through semantic search. This contextual information is then passed alongside the user’s question to OpenAI’s large language model. OpenAI’s language model leverages this additional context to improve the quality and relevance of its generated answers. This retrieval and augmentation of contextual information enhance the model’s understanding and enable it to produce more accurate and contextually relevant responses. Prerequisites You must have the following: Redpanda Cloud account OpenAI developer platform account Make sure your account has available credits. MongoDB Atlas account Python 3 rpk Set up a local environment Clone this repository: git clone https://github.com/redpanda-data/redpanda-labs.git Change into the redpanda-labs/connect-plugins/processor/embeddings/openai/ directory: cd redpanda-labs/connect-plugins/processor/embeddings/openai Set up Redpanda Serverless Log in to your Redpanda Cloud account and create a new Serverless cluster. Make a note of the bootstrap server URL. Create a topic called documents with the default settings. Create a new user with permissions (ACLs) to access a topic named documents and a consumer group named connect. Add the cluster connection information to a local .env file: cat > .env<< EOF REDPANDA_SERVERS="<bootstrap-server-url>" REDPANDA_USER="<username>" REDPANDA_PASS="<password>" REDPANDA_TOPICS="documents" EOF Set up OpenAI API Log in to your OpenAI developer platform account and create a new Project API key. Add the secret key to the local .env file: cat >> .env<< EOF OPENAI_API_KEY="<secret_key>" OPENAI_EMBEDDING_MODEL="text-embedding-3-small" OPENAI_MODEL="gpt-4o" EOF Set up MongoDB Atlas Log in to your MongoDB Atlas account and deploy a new free cluster for development purposes. Create a new database named VectorStore, a new collection in that database named Embeddings, and an Atlas Vector Search index with the following JSON configuration: { "fields": [ { "numDimensions": 1536, "path": "embedding", "similarity": "euclidean", "type": "vector" } ] } Add the Atlas connection information to the local .env file: cat >> .env<< EOF # Connection string for MongoDB Driver for Go: ATLAS_CONNECTION_STRING="<connection-string>" ATLAS_DB="VectorStore" ATLAS_COLLECTION="Embeddings" ATLAS_INDEX="vector_index" EOF Set the environment variables Your .env file should now look like this: REDPANDA_SERVERS="<bootstrap-server-url>" REDPANDA_USER="<username>" REDPANDA_PASS="<password>" REDPANDA_TOPICS="documents" OPENAI_API_KEY="<secret_key>" OPENAI_EMBEDDING_MODEL="text-embedding-3-small" OPENAI_MODEL="gpt-4o" ATLAS_CONNECTION_STRING="<connection-string>" ATLAS_DB="VectorStore" ATLAS_COLLECTION="Embeddings" ATLAS_INDEX="vector_index" To check your .env file: cat .env Create a Python virtual environment Create the Python virtual environment in the current directory: python3 -m venv env source env/bin/activate pip install -r requirements.txt exit Run the lab This lab has three parts: Use LangChain’s WebBaseLoader and RecursiveCharacterTextSplitter to generate chunks of text from the BBC Sport website and send each chunk to a Redpanda topic named documents. Use Redpanda Connect to consume the messages from the documents topic and pass each message through a processor that calls OpenAI’s embeddings API to retrieve the vector embeddings for the text. The enriched messages are then inserted into a MongoDB Atlas database collection that has a vector search index. Complete the RAG pipeline by using LangChain to retrieve similar texts from the MongoDB Atlas database and add that context alongside a user question to a prompt that is sent to OpenAI’s new gpt-4o model. Start Redpanda Connect Start Redpanda Connect with the custom OpenAI processor: rpk connect run --env-file .env --log.level debug atlas_demo.yaml You should see the following in the output: INFO Running main config from specified file @service=redpanda-connect redpanda_connect_version=v4.33.0 path=atlas_demo.yaml INFO Listening for HTTP requests at: http://0.0.0.0:4195 @service=redpanda-connect DEBU url: https://api.openai.com/v1/embeddings, model: text-embedding-3-small @service=redpanda-connect label="" path=root.pipeline.processors.0 INFO Launching a Redpanda Connect instance, use CTRL+C to close @service=redpanda-connect INFO Input type kafka is now active @service=redpanda-connect label="" path=root.input DEBU Starting consumer group @service=redpanda-connect label="" path=root.input INFO Output type mongodb is now active @service=redpanda-connect label="" path=root.output Generate new text documents In another terminal window, generate new text documents and send them to Atlas through Redpanda Connect for embeddings: source env/bin/activate # Single webpage: python produce_documents.py -u "https://www.bbc.co.uk/sport/football/articles/c3gglr8mpzdo" # Entire sitemap: python produce_documents.py -s "https://www.bbc.com/sport/sitemap.xml" You can view the text and embeddings in the Atlas console. Run the retrieval and generation chain Run the retrieval chain and ask OpenAI a question: source env/bin/activate python retrieve_generate.py -q """ Which football players made the provisional England national squad for the Euro 2024 tournament, and on what date was this announced? """ It takes a few seconds for the following response to appear in the output: Question: Which football players made the provisional England national squad for the Euro 2024 tournament, and on what date was this announced? Initial answer: As of my knowledge cutoff date in October 2023, the provisional England national squad for the Euro 2024 tournament has not been announced. The selection of national teams for major tournaments like the UEFA European Championship typically happens closer to the event, often just a few weeks before the tournament starts. For the most current information, I recommend checking the latest updates from the Football Association (FA) or other reliable sports news sources. Augmented answer: The provisional England national squad for the Euro 2024 tournament includes the following players: Goalkeepers: Dean Henderson (Crystal Palace) Jordan Pickford (Everton) Aaron Ramsdale (Arsenal) James Trafford (Burnley) Defenders: Jarrad Branthwaite (Everton) Lewis Dunk (Brighton) Joe Gomez (Liverpool) Marc Guehi (Crystal Palace) Ezri Konsa (Aston Villa) Harry Maguire (Manchester United) Jarell Quansah (Liverpool) Luke Shaw (Manchester United) John Stones (Manchester City) Kieran Trippier (Newcastle) Kyle Walker (Manchester City) Midfielders: Trent Alexander-Arnold (Liverpool) Conor Gallagher (Chelsea) Curtis Jones (Liverpool) Kobbie Mainoo (Manchester United) Declan Rice (Arsenal) Adam Wharton (Crystal Palace) Forwards: Jude Bellingham (Real Madrid) Jarrod Bowen (West Ham) Eberechi Eze (Crystal Palace) Phil Foden (Manchester City) Jack Grealish (Manchester City) Anthony Gordon (Newcastle) Harry Kane (Bayern Munich) James Maddison (Tottenham) Cole Palmer (Chelsea) Bukayo Saka (Arsenal) Ivan Toney (Brentford) Ollie Watkins (Aston Villa) This announcement was made on May 21, 2024. Next steps Learn more about Redpanda Connect and explore the other available connectors. Back to top × Simple online edits For simple changes, such as fixing a typo, you can edit the content directly on GitHub. Edit on GitHub Or, open an issue to let us know about something that you want us to change. Open an issue Contribution guide For extensive content updates, or if you prefer to work locally, read our contribution guide . Was this helpful? thumb_up thumb_down group Ask in the community mail Share your feedback group_add Make a contribution