Week 6 Summary - RAG
We are back in office after two weeks of vacation. We have now completed five weeks, and have three weeks remaining. When we left for vacation two weeks ago, we had accomplished the following:
- Deployed backend server to Microsoft Azure App Service in the cloud.
- Configured the API and set up connections to Azure Storage account for temporary storage of the audio files. Also made sure that each device is playing its corresponding audio files, leveraging the opportunity for usage on different devices simultaneously.
- Changed the OpenAI API-keys from “our own” to Azure’s resources through Azure Key Vault.
- Successfully managed to build the frontend client locally as an APK-file for Android devices to download and build the app without having to go through Google Play. This app will then communicate with the online server through the Azure API. We will need an Apple Developer Account to be able to proceed with this task on iPhone.
- Identified and erased as many bugs as possible.
Coming into week 6, with the Translation feature and Community Guide feature completed, we wanted to focus on the last remaining task - the RAG functionality for storing and analyzing the conversations. The goal of this feature is for the admin of the app (for instance the local manager of a Red Cross branch) to be able to keep statistics and analyze the conversational data in order to identify expressed needs and structural barriers, aiding in project evaluation and mapping to improve the operations. This will be done using the RAG (Retrieval-Augmented Generation) technique. In summary, RAG works by storing conversations as vectors, querying a vector database to retrieve relevant conversations, generating a response using a language model, and delivering the response to the user as an answer to the question, based on the context from the stored data. How GDPR, acceptance of terms, and how AI can be applied to identify and anonymize private data in the conversations, is something we will come back to and put focus on later.
At first, we wanted to implement one type of RAG called GraphRAG. GraphRAG is a technique that enhances the retrieval process by leveraging the structure of a graph to improve the relevance and accuracy of the retrieved information. It does this by considering the relationships between different nodes (documents or passages) in the graph and using this information to better match the query with the most relevant documents. This is a technique that we think matches our specific use case best, since we want to gather data from a specific topic over a wide range of vectors. However, the GraphRAG technique comes from very recent research, hence there is very little information and guides on how to implement it. It also requires Python language, and our tech stack is running on Node.js and JavaScript.
During the week, we worked on different approaches on ways to implement the RAG functionality. These approaches included: trying out to use a Function App to run the Python script by calling a separate API, implementing GraphRAG in python, using the LangChain framework to build RAG in both Python and JavaScript, and with the help of different vector databases as ChromaDB, Cosmos and MongoDB. At the end of the week, we managed to successfully implement a fully functional Baseline RAG (i.e. “normal” RAG) by ourselves, without the help of any RAG-framework like LangChain. We felt that this approach was the best for us to learn the most, fully understand the RAG functionality as well as to easily control and manage the RAG pipeline in our code.
How does our RAG pipeline work?
- Store conversations as vectors: Conversations are transformed into numerical vectors using a technique called embedding, which captures the semantic meaning of the text. Each talk is put together as a string, and then sent into OpenAI:s Embedding Model. We get a 1536-dimensional vector in return, that has kept the semantics of the string. Then, this vector is uploaded together with the original string and some additional metadata to our resource in Azure called Cosmos MongoDB vCore - an open-source, document-oriented NoSQL database that stores data in a flexible, JSON-like format. The natively integrated vector database in MongoDB enables us to efficiently store, index, and query high-dimensional vector data that's stored directly in Azure Cosmos DB for MongoDB vCore, along with the original data from which the vector data is created.
- Index the vector database: We have added a Vector Search Index to our collection directly in the MongoDB Shell. We are using the IVF (Inverted File) algorithm, starting with 1 cluster to group the vector data for brute-force search. The similarity is calculated with cosine.
- Query the vector database: When a new query is received, it is also embedded and compared to the stored vectors in the database. This part of the RAG pipeline starts in the new Admin Search Screen of the app, which has a Q&A-chat interface for the admin to be able to ask questions regarding the data, and get LLM-generated responses in return based on the data. When the query is sent, it gets vectorized by OpenAI:s Embedding model. Then we perform a vector search between the query vector and vector storage.
- Retrieve relevant conversations: The most similar vectors (conversations) are retrieved from the database based on their similarity to the query. The number of most relevant vectors is now set to 10, but this is a number we will experiment with.
- Generate a response: The retrieved conversations are used to generate a response using a language model, which can understand and generate human-like text. So from the retrieved vectors, we now extract the original text. This text is sent to OpenAI:s GPT-4 model, together with the question and an instruction for the LLM to follow: “Based on the following information: … Answer this question: …”.
- Deliver the response: The generated response is then delivered to the user.