Business Analytics with LangChain and LLMs

Author:Murphy  |  View: 24968  |  Time: 2025-03-22 23:40:30

GENERATIVE AI

Image by the author (generated via Midjourney)

Many businesses have a lot of proprietary data stored in their databases. If there's a virtual agent that understands human language and can query these databases, it opens up big opportunities for these businesses. Think of customer service chatbots, they're a common example. These agents can take customer requests, ask the database for information, and give the customer what they need.

The benefit of such agents is not limited to external customer interactions. Many business owners or people in companies, even in tech companies, might not know SQL or similar languages, but they still need to ask the database for information. That's where frameworks like LangChain come in. Such frameworks make it easy to create these helpful agents/applications. Agents that can talk to humans and at the same time, talk to databases, APIs, and more.

LLM-backed Applications

LangChain is an open-source framework for building interactive applications using Large Language Models (LLMs). It's a tool that helps LLMs connect with other sources of information and lets them talk to the world around them. One important concept in such frameworks is the Chain. Let's take a look at this concept.

What are Chains?

Chains are advanced tools in this framework that combine LLMs with other tools to perform more complicated tasks. Specifically, chains are interfaces that use a sequence of LLMs along with other tools, such as SQL databases, API calls, bash operators, or math calculators, to complete a complex job. An example could be our application receiving input from a user and passing it to our LLM model; then, the LLM calls an API. The API responds to the LLM, and the LLM takes the response to perform another task, and so on. As you can see, it is a chain of inputs and outputs where, in many parts of this sequence, we have LLM models handling the situation.

Getting Our Hands Dirty

Now it's time to get our hands dirty and start coding a simple LLM-backed application. For this application, we are going to make a simple Q&A agent that takes our question and queries a SQL database to find the answer for us.

Set Up A PostgreSQL Sample Database

We use DVD Rental Sample Database from postgresqltutorial.com (License Information). To install the database on your local system, you need to have PostgreSQL installed. Simply type psql in your terminal and see if it runs.

Then you should follow the instructions here. Here I walk through the database installation with you quickly.

In your terminal launch PostgreSQL via:

psql -U 

You must substitute the with your actual username. Then the system asks you about your password. Enter your password and you get into your Postgres database.

First, we should create a database to load all the tables.

postgres=# CREATE DATABASE dvdrental;

After creating the database, you can check it by typing list command, You should see your dvdrental database in the returned list. Then simply exit Postgres (via q).

Now we need to download the tables and data. You can download everything as a tar file here: Download DVD Rental Sample Database

Go to the folder that you downloaded the tar file and using pg_restore, we can load the tables into our Postgres Database.

pg_restore -U postgres -d dvdrental dvdrental.tar

Before going further, let's check if the tables are loaded. Launch Postgres via psql -U and enter your password.

Then type the following commands to list tables inside the dvdrental database.

postgres=# c dvdrental
dvdrental=# dt

You must see a list of tables like this:

             List of relations
 Schema |     Name      | Type  |  Owner
--------+---------------+-------+----------
 public | actor         | table | postgres
 public | address       | table | postgres
 public | category      | table | postgres
 public | city          | table | postgres
 public | country       | table | postgres
 public | customer      | table | postgres
 public | film          | table | postgres
 public | film_actor    | table | postgres
 public | film_category | table | postgres
 public | inventory     | table | postgres
 public | language      | table | postgres
 public | payment       | table | postgres
 public | rental        | table | postgres
 public | staff         | table | postgres
 public | store         | table | postgres

Set Up .env File

After building the sample database, we need to create a .env file. We use this file to store our secrets and keys as well as environmental variables. Of course, we can put all that information in our code, but we must follow best engineering practices even for small projects. An accidental git push can expose our API keys and secrets to the public. Here is our .env file:

OPENAI_API_KEY = 'YOUR-OPENAI-API-KEY'
PSQL_HOST = 'localhost'
PSQL_PORT = 'YOUR-POSTGRES-PORT'
PSQL_USERNAME = 'YOUR-POSTGRES-USERNAME'
PSQL_PASSWORD = 'YOUR-POSTGRES-PASSWORD'
PSQL_DB = 'dvdrental'

As you see, you need to get your own OpenAI API key for this test. You can follow the instructions here to obtain an API key from Open AI.

Creating a LangChain Application

After saving your .env file in your project folder, we can start the actual code. First, we import the required libraries into our Python code.

pip install langchain langchain-experimental openai psycopg2

For this tutorial, I recommend using a Jupyter Notebook to test it step by step.

We need to import the required libraries at the beginning.

from os import environ
from dotenv import load_dotenv

from langchain.chat_models import ChatOpenAI
from langchain.utilities import SQLDatabase
from langchain_experimental.sql import SQLDatabaseChain

load_dotenv()

Using load_dotenv(), we load the environment variables that we defined in the .env file. Now, safely we can access them in the code.

For our chain, we need a language model with chat capabilities. For simplicity, we chose gpt-3.5-turbo from OpenAI which is accessible through the API. You can use any other public or private models.

# Instantiating Our LLM

OPENAI_API_KEY = environ.get("OPENAI_API_KEY")
llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0, openai_api_key=OPENAI_API_KEY)

In addition to a model, we need a SQL Database connection. That connection enables our chain to query against the database and get the results back. As mentioned before, we use a PostgreSQL Database, but according to LangChain, any SQL engine with a JDBC connection should be easy to use (e.g. MySQL, Presto, Databricks).

# Database Setup

PSQL_HOST = environ.get("PSQL_HOST")
PSQL_PORT = environ.get("PSQL_PORT")
PSQL_USERNAME = environ.get("PSQL_USERNAME")
PSQL_PASSWORD = environ.get("PSQL_PASSWORD")
PSQL_DB = environ.get("PSQL_DB")
psql_uri = f"postgresql+psycopg2://{PSQL_USERNAME}:{PSQL_PASSWORD}@{PSQL_HOST}:{PSQL_PORT}/{PSQL_DB}"

db = SQLDatabase.from_uri(psql_uri, sample_rows_in_table_info=3)
db.get_usable_table_names()

In case you are wondering about sample_rows_in_table_info parameter that we used in our SQL database connection, Rajkumar et. al showed in their paper (https://arxiv.org/abs/2204.00498) that including a few sample rows from the table increases the performance of the model in creating mode affecting querying the data. In LangChain, simply you can set sample_rows_in_table_info and determine the number of sample rows from each table that will be appended to each table description.

To test the SQL database connection, I printed the list of available tables using db.get_usable_table_names(). It should return the following list of tables for you too.

['actor',
 'address',
 'category',
 'city',
 'country',
 'customer',
 'film',
 'film_actor',
 'film_category',
 'inventory',
 'language',
 'payment',
 'rental',
 'staff',
 'store']

The next step is the most important one. With our LLM model and a SQL database connection, now we should be able to instantiate our chain. Before showing the instantiation, let's get a little bit more sense about a chain.

Now, we are using an SQL chain from LangChain to get a human question, convert it to an SQL query, run it on the database, and retrieve the results.

# Definining Our SQL Dabatabe Chain

db_chain = SQLDatabaseChain.from_llm(llm, db, verbose=True)

Finally, we use the run() method to pass our input/question to the chain and get the final response back.

# Define a Function To Run the Chain with Our Query

def query(input: str) -> str:
    response = db_chain.run(input)
    return response

Now it is time to test our code. We start with a simple query that requires querying a single table.

# Example 1:

query("What is the average length of films released in 2006")

The chain gets our human-style query and converts it into an SQL query first. Then using the SQL DB connection, it runs the SQL query on the database. The returned response provides context for the LLM and along with the human original query, triggers a response. In this case, you can see the final response:

> Entering new SQLDatabaseChain chain...
What is the average length of films released in 2006
SQLQuery:SELECT AVG(length) FROM film WHERE release_year = 2006
SQLResult: [(Decimal('115.2720000000000000'),)]
Answer:The average length of films released in 2006 is 115.272 minutes.
> Finished chain.
'The average length of films released in 2006 is 115.272 minutes.'

First, since the verbose=True (where we defined the db_chain) we get more information about the constructed SQL query and the returned SQL results. If you turn off the verbose, you only see the final results (highlighted with the boldface). The final answer shows correctly that the average length of films released in 2006 is 115.272 minutes (you can verify it with your own SQL query on Postgres).

The second question is a little bit more complicated and requires joining three tables.

# Example 2:

query("Which actor played in movies with total longer duration? And how much was the duration?")

The question asks about the actor who played in movies that their total length is more than any other actors. This question requires joining three tables: actor, film_actor, and film. Here is the chain response:

> Entering new SQLDatabaseChain chain...
Which actor played in movies with total longer duration? And how much was the duration?
SQLQuery:SELECT a.first_name, a.last_name, SUM(f.length) AS total_duration
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
JOIN film f ON fa.film_id = f.film_id
GROUP BY a.actor_id
ORDER BY total_duration DESC
LIMIT 1;
SQLResult: [('Mary', 'Keitel', 4962)]
Answer:Mary Keitel played in movies with a total duration of 4962 minutes.
> Finished chain.
'Mary Keitel played in movies with a total duration of 4962 minutes.'

As you see, our super-simple LangChain application did a good job of understanding the relationships between these tables and constructing the SQL query.

Summary

In this article, we introduced a powerful open-source tool called LangChain that enables us to build LLM-based applications.

Then we used SQLDatabaseChain to build an application that queries a SQL database based on user questions and returns the results.

This simple Q&A application can be expanded into a more complex business analytics assistant that anyone inside a business can use daily to get the latest insights from the proprietary data.

Tags: Artificial Intelligence ChatGPT Data Science Machine Learning Technology

Comment