In today’s blog, we will be discussing the ability of Generative AI to unlock productivity in the enterprise. We have seen a lot of content with Large Language Models (LLM) in the content generation domain – think image and video generation, or article and marketing copy generation. Today, we will be leveraging LLMs to democratise SQL database understanding.
Democratising database understanding allows the insights contained in these databases to become accessible to a wider range of people within the enterprise, and not just limited to the professionals well versed with the SQL language.
These days, we hear enterprises promoting data-driven cultures as one of the strategies that enable enterprise-wide transformation.
Can improved data literacy really help in achieving this? Read on to understand how this simple Generative AI functionality can help with democratising data access and understanding beyond the typical business intelligence group and unlock productivity in the enterprise.
Example use case: e-commerce business
Sonia is the CFO of Candle Glow, an online business that sells all types of scented candles. She always wants to know how her business is doing and what she can do to continuously improve.
However, John, the business intelligence guy, isn’t always available. All the data Sonia needs to know about Candle Glow is contained in the database. If only she knew SQL, she could get her data insights herself, anytime she wants.
Natural Language to SQL
We all know how Generative AI has pervaded our lives, from image and movie generation to article and content generation. We will be discussing how LLMs have also been used to understand databases and generate SQL queries.
There are numerous libraries developed to help us with these types of tasks, but for this blog, we will be using LlamaIndex, a library that uses LLMs for many types of applications, including the generation of SQL queries and understanding databases using natural language.
Instead of having to always wait for the availability of John, Sonia can now simply just write in natural language (in English, or any other supported language) and this query can be relayed to the database and results produced and translated back to natural language. Sonia will be able to get all the insights she needs to run and improve her business, without knowing a single line of SQL.
This is great news for Sonia, when she needs to quickly do some sales analysis – for example, she might be interested to know the effects of a recent marketing campaign on online sales. She can simply type something like the following:
- “How many new customers did we acquire from the campaign?”
- “How much revenue did the campaign generate?”
- “Which marketing channels performed best?”
Using LlamaIndex and Bedrock to understand your SQL Database
I’m sure by this time you would have heard the technique called Retrieval Augmented Generation (RAG), a Generative AI method that allows one to talk to your data, like your documents, in natural language. Text2SQL is a bit similar, however the method is quite different. Unlike RAG, we will not be using a Vector Database for this Text2SQL example, we will still be using an indexing technique, however we will be utilising an in-memory index to build our tables schema object.
The Dataset
In this blog, we continue with Sonia’s e-commerce business, so we are using Candle Glow’s SQL transaction database, which contains 6 tables:
- Categories: contains all the categories of scented candles
- Products: contains the products list sold
- Users: contains the users list who have purchased products
- Orders: contains the orders list placed by users
- Order Items: contains the order items list placed by users
- Reviews: contains the reviews given by users for the products purchased
This is also stored in a PostgreSQL database. The choice of PostgreSQL was based on its popularity and widespread use in many enterprises.
Many companies will use a database not unlike Postgres to contain transaction data, and for production use, we don’t typically connect to the main database directly. Common techniques include the use of read replicas or database clones to ensure an isolated environment from the main database.
Setting up local Postgres DB
For this blog, I have set up the Postgres database to run in a local docker container using the postgres:alpine image from Docker Hub.
In the entity relationship diagram shown below, we can see that the six tables – and their relationships to each other.
LlamaIndex Orchestration
The following image shows how LlamaIndex orchestrates the different components to convert natural language questions into SQL queries:
- The user sends a natural language question to LlamaIndex
- LlamaIndex sends the natural language request to the LLM to extract the SQL query
- LlamaIndex receives the SQL query from the LLM
- LlamaIndex sends the SQL query to the database
- The database executes the SQL query and returns the result to LlamaIndex
- LlamaIndex sends the result to the LLM for response synthesis to natural language
- LLM sends the natural language response to LlamaIndex
- LlamaIndex sends the natural language response to the user
The Method
This blog comes with an accompanying Jupyter notebook. Please feel free to clone or download it and follow at your own pace.
A bit of a warning for the less technical reader, the next several steps will contain some Python, either just read the method step description, or skip this section entirely.
1. Install the required libraries
We start off by installing all the required libraries. Note that apart from the LlamaIndex modules, we also need psycopg – for interfacing with Postgres databases, and SQLAlchemy – a Python-based ORM mapper for SQL.
2. Connection to database
After loading the environment variables, we proceed by instantiating the SQL engine using SQLAlchemy. This will be used later for the LlamaIndex calls. Note that internally the psycopg Postgres adapter is being used.
3. Testing the connection
To test if our database connection works, we perform a couple of SELECTs to both tables:
4. Setting up the LLM Bedrock configuration
In LlamaIndex, the Settings object is a global configuration interface for setting up LlamaIndex with your LLM, embedding model and other miscellaneous configurations. Here we are setting up both our LLM and the embedding model with the AWS environment variables we have loaded earlier.
Note that we are using Claude 3 Sonnet as the LLM and the Cohere multilingual embedding model as both are excellent models to use.
5. Setting up the Database Schema index
The SQLDatabase object is a wrapper around SQLAlchemy, and this is what LlamaIndex uses to actually interact with the database. Here we are setting it up with the tables we want to include in the system. We want to include both the people and the planet tables.
6. Building the table schema index and setting up the retriever
Here we are creating an internal index of the SQL tables using ObjectIndex. This is critical because during query-time, LlamaIndex needs to know what tables it needs to query to satisfy the user’s question.
In this blog, we have a small database of only 6 tables, but in a production setting, it is not uncommon to find systems with dozens or even hundreds of tables. So instead of ingesting all the table schemas, we are building an index of table schemas. The table name and table description we prepared earlier are used for this.
Finally, an SQLTableRetrieverQueryEngine is instantiated using the SQLDatabase and the table schema index that we just created.
7. Formatting the results output
What’s the use of our results if we cannot read it easily? This is just a convenience formatting functionality to easily read the results and for this we use the Markdown format as it is just plain text and is supported widely including this notebook.
8. We are ready to question the database in natural language
After going through all that, we are now ready to talk to our SQL database in natural language. Here we have created the method text_to_sql that receives a natural language question and fires off the query to the query engine we set up earlier and formats the response for us.
In addition to returning the natural language response, we are also displaying the SQL statement that LlamaIndex translated from the question and used to query the database with.
Sample Invocation Output
Here are a few samples of questions and answers that show the capability of this Text2SQL system. LlamaIndex, together with a very capable LLM in Claude 3, can easily connect with your SQL Database, and enable you to get insights using nothing but natural language.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
This can even handle when the questions are in another language, here let us try Filipino.
“Calculate the average order value for each user (firstname and lastname). Just give me the top 10 users.”
The LLM can answer correctly without issues.
Conclusion
In this blog, we have used Generative AI to democratise database understanding and open the dataset to a wider range of people within the enterprise. What was once the domain of data scientists and database administrators can now be accessed by anyone using natural language.
This is but a simple example of how Generative AI can be used to simplify complex tasks and make your data more accessible to a wider audience, promoting a data-driven culture in your organisation.
Cevo Australia has extensive expertise in building systems including Generative AI projects. In fact, we can develop a more enterprise-ready Generative AI assistant, which includes Text2SQL capability, using Amazon Q Business. This solution could offer a more robust and scalable option for you.
Note: This blog comes with an accompanying Jupyter notebook. Please feel free to clone or download it and follow at your own pace.