In our previous blog post of this series, we introduced vector database as an essential tool to build a reliable and accurate chatbot with large language model (LLM) capability. Companies worldwide are moving fast to integrate LLMs into existing products and even creating entirely new products using LLMs. However, one of the two most common and significant challenges with these LLMs are: hallucination and outdated information. Yet, we can overcome these issues by augmenting our LLM with the right components.
In this next blog post, we are going to illustrate the true power of vector databases by building a step-by-step guide to create a SageMaker notebook to answer questions related to the AWS Well-Architected Framework.
Our workflow is as follows:
- We submit a query (i.e. question) via notebook;
- The query is processed by our LangChain orchestrator;
- We generate an embedding (i.e. vector) from the query using an embedding model;
- That embedded query is processed again by our orchestrator and send to our private open-source vector database via AWS API Gateway to retrieve relevant documents for context;
- Our query combined with our set of relevant documents are submitted to our large language model;
- A response is generated and displayed to the user.
In terms of the requirements for our experiment, we used the following items:
- SageMaker:
- Image: Data Science 3.0
- Kernel: Python 3.10
- Instance Type: ml.m5.4xlarge 16vCPU + 64GB
- Embedding model: huggingface-textembedding-all-MiniLM-L6-v2
- LLM model: meta-textgeneration-llama-2-7b-f
- Vector database: Chroma DB
Deploying Models in SageMaker
There are two methods for deploying LLM as endpoint on AWS SageMaker:
- HuggingFaceModel which is used to deploy HuggingFace models as SageMaker endpoints.
- JumpStartModel which is used for all the models available in AWS stable of jumpstart models.
We will use the latter to deploy both our embedding model and our large language model.
Large Language Model (LLM)
In order to deploy our Llama2 language model, we first need to define a handler class to transform inputs and outputs from Llama 2 to a format that SageMaker endpoint expects.
You might have noticed that we used 3 parameters to calibrate our payload for the inputs:
- Max new tokens: parameter allows you to set an upper limit on the number of tokens generated in addition to the input tokens.
- Top_p: It helps in controlling the diversity of generated output.
- Temperature: It controls the randomness.
Now we will deploy our SageMaker endpoint for our LLM.
Now that we have everything set up, let’s test our LLM! The query submitted is “What are the pillars of the AWS Well-Architected Framework?”
As we can see, the output from our LLM is partially correct as it only mentioned 5 pillars missing the 6th one which is Sustainability (see The pillars of the framework).
If we provide some contexts to our LLM with our query, we see the results are accurate:
The LLM is following our instructions and we’ve also demonstrated how contexts can help our LLM answer questions accurately. However, we’re unlikely to be inserting a context directly into a prompt like this unless we already know the answer — and if we already know the answer why would we be asking the question at all?
We need a way of extracting relevant contexts from our custom pdf knowledge base. For that we need a private vector database.
Embedding Model
We need to create a SageMaker endpoint for our embedding model following the same steps we used to create our LLM endpoint. The goal is to convert our documents to embeddings that we will store in our private vector database. This will allow us to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.
Similar to Llama2 deployment, we create our SageMaker endpoint:
Fetching and Processing the Sample Data
Next, we are going to fetch the sample data from the AWS well-architected repositories which consist of pdf documents. The pdf documents are stored locally on our SageMaker notebook.
Note that we have collected the metadata for each file as this may help with information retrieval from the vector database.
Next, we clean the downloaded pdf documents. What we do is to break them down into pieces so you can provide the most relevant sections to the LLM as part of our workflow. Here we will iterate over all the documents and break them into 512-character chunks with an overlap of 100 characters.
The result shows that we have processed 3073 pages and created 20082 chunks of text which will be later converted into embeddings and stored in our private vector database.
Connecting to Vector Database
With our embedding references ready, the next step is to actually process those document chunks into vectors and store them into our vector database. Our Chroma vector database endpoint has been made secure by putting it behind an API Gateway using an API key.
First, we create a client-server connection to our private Chroma database using HTTPS requests. We also created a collection named: AWS_WAR_2023_09_19 which represents the name of the group of embeddings related to a specific task. It is worth mentioning that we can have multiple collections within a vector database. Then we load our embeddings into private Chroma database:
Running Vector Queries
Now that we have populated our private vector database, we can run queries against it to return relevant document chunks.
The results that came back from the similarity_search_with_score API are sorted by score from lowest to highest. The score value is represented by the L2-norm similarity score of each result. The distance can be any value between zero and infinity. If the distance is zero, the vectors are identical. The larger the distance, the farther apart the vectors are.
Let’s get all the results together!
We have gotten results from our private vector database, but currently they are just chunks of the original documents and some of them might not contain the information we want to provide as an answer to our original query.
To generate the appropriate response, we leverage a prompt template that takes the original question asked along with relevant context chunks from our private Chroma database to generate a new response from our LLM.
LangChain provides functionality to allow for easier creation and population of prompt templates. The template below has specific placeholder values for {context} and {question}, which we will provide to fill out the template.
With LLM model endpoint deployed within our own environment, we are now ready to build our chain.
We use LangChain: RetrievalQA chain as follows:
- Take a query as input
- Generate query embedding
- Query the private vector database for relevant document chunks based on the query embedding
- Inject the context and the original query into the prompt template
- Invoke the LLM with the completed prompt
- Return the result
And we get the following results:
As we can see, the result to the question “What are the pillars of the AWS Well-Architected Framework?” is a lot more accurate.
We asked another question and the answer seems very accurate and reliable.
Summary
That’s it for our blog post on understanding vector databases from an implementation viewpoint. We’ve explored how to deploy SageMaker’s Jumpstart models with Llama 2, and embedding models with MiniLM.
We have implemented a complete end-to-end RAG pipeline using our open-access models and a private Chroma vector database. Using this, we minimise hallucinations, keep our LLM knowledge up to date, and ultimately enhance the user experience and trust in our systems.