A recap of my conference talk
I recently gave a talk at the DataEngBytes Sydney conference on a subject that’s transforming the way we harness data’s potential: Vector Databases. In a world driven by the exponential growth of data, these innovative solutions are reshaping how we manage and extract insights from information. Imagine a world where your digital experiences are finely tailored – where movie recommendations match your mood flawlessly, and engaging conversations with internal documents are highly possible. This captivating narrative is powered by vector databases, a technological leap beyond traditional data management.
This post serves as a comprehensive encapsulation of my recent conference talk, aptly titled “Vector Databases: The What, the How, and the Why.” For further expansion on some the details mentioned in this article, please refer to our Cevo blog post titled exploring the power of vector databases.
Large Language Models (LLM)
Powered by self-attention mechanism, Large Language Models (LLMs) have taken the world by storm with the release of ChatGPT late last year. However, these models have a few challenges when it comes to responding with factual information:
- Hallucination: The AI model imagines or fabricates information that does not directly correspond to the provided input.
- Lack of specific knowledge: Unable to find answers about your organisation that are backed by institutional knowledge.
- Lack of explainability: Unable to verify the enriched responses from your LLM using the relationships in your enterprise knowledge database.
We often hear of methods to mitigate the challenges mentioned above, notably Fine-tuning and Vector Database.
Fine-tuning enhances an LLM’s performance by tailoring pretrained models to specific tasks or domains, leading to improved accuracy through learning task-specific patterns, domain adaptability by incorporating domain-specific terminology, and leveraging transfer learning for faster task acquisition and better generalisation. However, fine-tuning does not prevent risk of further hallucinations, requires high cost of hosting LLMs due to their size, and may involve the need for retraining to incorporate changes from the knowledge base.
The What: Understanding vector databases
Vector databases are designed to store and retrieve multidimensional vector embeddings. Vector embedding takes complex objects (text documents, images, audio and videos), translates them into numbers, arranges them sensibly in a space, and let us perform meaningful operations on them
They excel in similarity-based querying, enabling applications like recommendation systems and image search.
In terms of their characteristics, we have:
- Data Representation: In traditional NoSQL databases, data is typically stored as key-value pairs, documents, or wide-column structures. In contrast, vector databases store and operate on vectors, which are mathematical representations of data points.
- Querying Approach: Traditional NoSQL databases often rely on exact matching queries based on keys or specific attribute values. Vector databases, on the other hand, use similarity-based queries, where the goal is to find vectors that are most similar to a given query vector.
- Optimisation Techniques: Vector databases employ specialised algorithms for Approximate Nearest Neighbor (ANN) search, which optimise the search process. These algorithms may involve techniques such as hashing, quantisation, or graph-based search. Traditional NoSQL databases typically focus on different optimisation methods depending on their data models.
The How: Understanding the mechanics behind vector DBs
Vector databases are specifically designed for unstructured data and yet provide some of the functionality you’d expect from a traditional relational database. They can execute CRUD operations (create, read, update, and delete) on the vectors they store, provide data persistence, and filter queries by metadata. When you combine vector search with database operations, you get a powerful tool with many applications.
The optimisation of a vector database depends on the scope of its implementation since each is customised for datasets that cater to various use-cases.
Here are the six crucial elements that need to be included in vector databases:
- Single level store: Vector databases adopt the principle of a single level store, which implies that the on-disk representation of data is mirrored as closely as possible in vector format. This approach enables efficient querying and fast processing of large datasets while minimising the need for data movement. In addition, the data store retains metadata and metrics that gauge the relevance and similarity among vector embeddings.
- Indexing: Vector databases use structures like nearest neighbor indexes to assess how similar objects are to each other. Traditional nearest neighbor search has a problem for large indexes as they require a comparison between the search query and the entire indexed vector, which takes time. To address this problem, vector databases implement Approximate Nearest Neighbor (ANN) algorithms, which offer very precise balances while also providing very fast performance. Popular methods for building ANN indexes include techniques such as Euclidean distance, cosine similarity. Each technique improves performance in different ways, such as reducing memory resources or improving accuracy.
- Filtering: Filtering is a powerful search feature that leverages metadata to deliver a streamlined subset of results that precisely match your criteria. This not only boosts result relevance, but also reduces unnecessary query processing, making for a faster, more efficient search experience. Furthermore, metadata filters like those used in the approximate nearest neighbor algorithm can even be applied to gain highly accurate recommendations.
- Sharding and GPU Support: Vector databases frequently utilise GPUs or cloud-native tools, such as Kubernetes, to partition individual work units and assign dedicated resources to each one. These shards then operate concurrently to scan vectors and integrate their findings into a consolidated end result. Sharding can help large queries run quickly across vast amounts of data. For example, if you have 20 million vectors to search through, you can use 20 shards to get results in the time it takes to search 1 million vectors with only one shard.
- Replication: Vector databases use redundancy like having backup “workers” to handle multiple requests simultaneously. This redundancy ensures that even if some “workers” fail, the system remains highly available due to backup resources taking over. This approach is commonly used in cloud platforms to guarantee efficient resource allocation and maintain high availability.
- Language access: Vector queries can be posed via various programming languages – from widely used languages such as SQL and Python to more sophisticated ones like Tensorflow – or through intuitive, visual drag and drop interfaces found in business intelligence tools like Power BI and Tableau.ix
The Why: Advantages of Vector Databases
Some of the benefits of vector databases are:
- Efficient Handling of Large-Scale Datasets: Vector databases offer a robust solution for efficiently managing and processing large-scale datasets. Traditional databases might struggle to handle the sheer volume of data, leading to slower query times and reduced overall system performance. Vector databases, however, utilise specialised indexing techniques that are optimised for vectorised data. This allows them to handle massive datasets with ease, ensuring that queries are executed swiftly and resources are utilised efficiently. As a result, businesses can work with extensive data collections without compromising on speed or scalability.
- Leveraging Semantic Similarities for Accurate Resolution: Vector databases take advantage of the concept of semantic similarity, wherein entities that are conceptually or contextually similar are represented closer to each other in the vector space. This property enhances the accuracy of resolution and retrieval tasks. For example, in a search scenario, a vector database can identify items that are semantically similar to the query, even if the exact terms don’t match. This capacity to understand context and semantics can greatly improve the relevance and quality of search results, leading to more meaningful interactions for users and better decision-making for businesses.
- Accounting for Complex Relationships Between Entities: Entities in real-world scenarios often have intricate relationships that traditional databases might struggle to model and query effectively. Vector databases excel in capturing complex relationships between entities by representing them as vectors in a multi-dimensional space. This spatial representation enables the database to calculate distances, angles, and similarities between vectors, accurately reflecting the relationships between entities. This is particularly valuable in applications like recommendation systems, social network analysis, and graph databases, where understanding intricate connections is vital for generating meaningful insights.
Choosing The Right Vector Database
Choosing the right vector database for your use case is not a straightforward exercise as there is a new product or service out on the market.
However, we have a list of questions that can be used as a guidance for the selection process:
- What is the nature of my data? Determine if your data can be represented as vectors. Vectors are numerical representations of objects and sequences of information, for example time-series, so it’s important to understand if your data can be effectively transformed into vector form or not. Vector databases are specifically designed for unstructured data and yet provide some of the functionality you’d expect from a traditional relational database. They can execute CRUD operations on the vectors they store, provide data persistence, and filter queries by metadata. When you combine vector search with database operations, you get a powerful tool with many applications.
- What is the dimensionality of my data? Consider the number of dimensions in your vectors. Some vector databases perform better with lower-dimensional data, while others can handle high-dimensional vectors efficiently. Inquire about the indexing techniques and search algorithms used by the vector database. Different databases employ various indexing structures (e.g., k-d trees, product quantisation) and search algorithms (e.g., nearest neighbour search, range search), each with its own trade-offs in terms of efficiency and accuracy.
- What is the expected data size and growth rate? Understand the scale of your data and how it is expected to grow over time. This information will help you choose a vector database that can handle your data volume effectively. Consider the scalability of the database to handle growing data volumes and increasing query loads. Ask about the ability to distribute the database across multiple nodes or clusters to achieve higher throughput.
- What are the indexing and search capabilities? Inquire about the indexing techniques and search algorithms used by the vector database. Different databases employ various indexing structures (e.g., k-d trees, product quantisation) and search algorithms (e.g., nearest neighbour search, range search), each with its own trade-offs in terms of efficiency and accuracy.
- What are the retrieval performance and latency? Determine the desired query performance and latency for your application. Depending on your use case, you may need a vector database that can provide real-time or near-real-time results.
- What is the scalability and throughput of the vector DB? Consider the scalability of the database to handle growing data volumes and increasing query loads. Ask about the ability to distribute the database across multiple nodes or clusters to achieve higher throughput.
- What integration options and APIs are available? Check the compatibility and integration options with your existing systems and programming languages. Look for APIs and libraries that are supported by the vector database to ensure seamless integration into your application stack.
- What is the support for updates and deletions? Determine whether your use case requires frequent updates or deletions of vectors. Some vector databases offer efficient mechanisms for modifying or removing vectors without compromising the overall performance.
- What are the deployment options and infrastructure requirements? Consider the deployment options available for the vector database, such as on-premises, cloud-based, or managed services. Assess the infrastructure requirements in terms of hardware, storage, and network to ensure compatibility with your environment.
- What is the cost and licensing model? Understand the pricing structure and licensing model of the vector database. Consider factors such as upfront costs, ongoing maintenance fees, and any additional charges for scaling or support services.
Questions and Answers
How does the ‘encoding’ of an input to a vector work using an LLM? Specifically using an LLM as an encoder rather than other vector encoding techniques.
A pre-trained language model encodes input text into fixed-size vectors by first tokenising the text and converting tokens into embeddings with positional information. It then processes these embeddings through a multi-layer bidirectional Transformer, producing contextualised representations that capture complex linguistic dependencies. The final hidden states are used as fixed-size vectors that can be stored in a vector database.
Do you have to manually separate the embeddings to avoid overlap of unrelated categories? Or is this done magically? Can it be observed and tested?
Embeddings inherently capture contextual information from entire texts, not requiring manual separation for unrelated categories. Metadata can be associated with embeddings by linking them to documents, entities, or topics, enhancing their relevance for specific tasks. Evaluation of metadata-enhanced embeddings can be performed by assessing the pretrained model’s performance on downstream tasks, such as text classification or sentiment analysis.
Does metadata contribute to the similarity relative to another vector? Or is it creating a vector from the data, with metadata for filtering after retrieval?
Metadata can influence the similarity of vectors in two ways: by directly incorporating metadata into vector representations or by using metadata to filter results after retrieval. When metadata is integrated into vectors, it can impact similarity calculations by adjusting weights based on specific criteria. Alternatively, keeping metadata separate allows for post-retrieval filtering, enabling more precise control over how metadata affects result refinement. The choice depends on the application’s requirements and the desired balance between simplicity and flexibility.
Conclusion
Vector databases are the future of data management, offering efficient solutions for large datasets, semantic understanding, and complex relationships. Choosing the right vector database is crucial for unleashing the full potential of vector-driven insights. As we embrace these innovations, we embark on a data-driven journey with limitless possibilities.