Maximizing AI Efficiency in Production with Caching: A Cost-Efficient Performance Booster

Author:Murphy  |  View: 21214  |  Time: 2025-03-22 22:21:32

Free Friend Link – Please help to like this linkedin post

Introduction

Despite the transformative potential of AI applications, approximately 70% never make it to production. The challenges? Cost, performance, security, flexibility, and maintainability. In this article, we address two critical challenges: escalating costs and the need for high performance – and reveal how caching strategy in AI is THE solution.

Photo by Possessed Photography on Unsplash

The Cost Challenge: When Scale Meets Expense

Running AI models, especially at scale, can be prohibitively expensive. Take, for example, the GPT-4 model, which costs $30 for processing 1M input tokens and $60 for 1M output tokens. These figures can quickly add up, making widespread adoption a financial challenge for many projects.

To put this into perspective, consider a customer service chatbot that processes an average of 50,000 user queries daily. Each query and response pair might average 50 tokens for both. In a single day, that translates to 2,500,000 tokens, up to 75 million input 75 million output and in a month. At GPT-4's pricing, this means the chatbot's owner could be facing about $2250 in input token costs and $4500 in output token costs monthly, totaling $6750 just for processing user queries. What if your application is a huge success, and you have 500,000 user queries or 5 million user queries per day?

The Performance Paradigm: Real-Time Responses

Today's users expect immediate gratification – a demand that traditional Machine Learning and deep learning approaches struggle to meet. The arrival of Generative AI promises near-real-time responses, transforming user interactions into seamless experiences. But sometimes generative AI may not be fast enough.

Consider the same AI-driven chatbot service for customer support, designed to provide instant responses to customer inquiries. Without caching, each query is processed in real-time, leading to seconds to a few-minute delays during peak times or when complex queries require deeper AI analysis. Impatient users will already close the application and look for other alternatives.

The Magic of Caching:

Caching, in the context of AI applications, involves saving prompts and responses of large language models (LLMs) for future reuse. This strategy not only reduces response times but also significantly cuts down on costs associated with repetitive model invocations.

Why you should use cache in your AI application

Now, imagine implementing caching in the same chatbot: common queries about product recommendations, store hours, return policies, or shipping costs are cached after their Retrieval Augmented Generation (RAG) process. When a new or repeat customer asks a previously answered question, the chatbot instantly retrieves the answer from the cache, significantly reducing the cost of reprocessing this query through LLM with the provided retrieval information and the response time from potentially several seconds to near-instant.

Exploring Advanced Caching Techniques for AI Applications with LangChain

After examining the challenges and introducing caching as a brilliant solution in AI efficiency, let's deep dive the different caching methodologies that can further optimize Performance and cost-efficiency.

However, caching is not a one-size-fits-all solution. Its effectiveness depends on the method applied, particularly in the context of AI and natural language processing. Here, we analyze two primary caching techniques:

Standard Caching: The Basics

Standard caching involves storing and checking for an exact match between a user query and the information in the cache store. It's a straightforward approach but can sometimes fall short in handling the nuances of natural language.

Example: If one user asks, "What are the ingredients to make a pizza?" and another query, "What are the ingredients to make pizzas?" despite the slight variation in phrasing, standard caching may treat these as distinct requests, missing an opportunity for efficiency.

Semantic Caching: Beyond Exact Matches

Semantic caching elevates the caching mechanism by identifying and retrieving information based on semantically similar sentences to the user's query. This approach is particularly suited to natural language contexts, where variations in phrasing are common.

Example: In the pizza ingredient queries mentioned above, semantic caching recognizes the similarity in intent between "What are the ingredients to make a pizza?" and "What are the ingredients to make pizzas," providing the same cached response to both, thus enhancing efficiency.

Comparaison of Standard Cache and Semantic Cache

Caching Technologies in the LangChain Framework

The LangChain framework supports several caching technologies, each suited to different application needs and scales.

In-Memory Caching

A temporary storage solution that uses the computer's RAM. It's quick to implement and access but is limited by available memory and is not ideal for production environments due to volatility and scalability concerns.

Database as Cache

Databases can serve as robust caching solutions, offering persistence and scalability. Here are a few options:

  • MongoDB as Standard Cache and Semantic Cache: MongoDB merges structured, unstructured, and AI data for easy, advanced AI app development. Its architecture simplifies data handling, speeding up project delivery. With features like large data types (text, numeric, geospatial, time series, etc.), and AI analysis integration, plus broad AI tool compatibility and multi-cloud support, MongoDB ensures flexibility and rapid innovation.
  • SQLite as Standard Cache: A lightweight, disk-based database, perfect for smaller-scale applications where simplicity and minimal setup are key.
  • Cassandra as Standard and Semantic Cache: Known for its scalability and performance, Cassandra is ideal for applications requiring high availability and large-scale distribution.
  • Astra DB as Standard and Semantic Cache: A cloud-native Cassandra-as-a-Service, Astra DB offers scalability and simplicity in managing Cassandra databases, including caching functionalities.
  • Azure Cosmos DB as Semantic Cache: Available only on Azure but globally distributed, multi-model database service that offers turnkey global distribution, making it suitable for complex caching requirements, including semantic caching.

External Caching Systems

For applications demanding high performance and scalability, external caching systems offer specialized solutions but be careful, they're not your average setup. More effort is required in addition to your application's full stack:

  • Redis as Standard and Semantic Cache: An in-memory data structure store, used as a database, cache, and message broker, Redis is known for its speed and supports a variety of data structures.
  • GPTCache as Standard and Semantic Cache: Tailored specifically for caching responses from Generative Pre-trained Transformers, optimizing cost and response times in AI-driven applications.
  • Momento as Standard Cache: A fully managed caching service that simplifies caching infrastructure, offering scalability and ease of use without extensive caching expertise.

Incorporating these caching techniques and technologies within the LangChain framework can dramatically enhance the efficiency, speed, and cost-effectiveness of AI applications. As we progress, the focus will shift towards leveraging In-Memory caching and MongoDB for caching, showing the importance of selecting the right tool for your application's specific needs and scale.

Implementing In-Memory Caching with LangChain

As explained above, in-memory caching is one of the easiest caching techniques. The following code snippet demonstrates a practical implementation of in-memory caching using the LangChain library with OpenAI's GPT-3.5-turbo model.

Setting Up the Environment

The setup involves importing necessary modules from LangChain, including support for the OpenAI model and the in-memory caching mechanism. By initializing the InMemoryCache, we prepare our environment to store and retrieve responses from the cache, reducing the need for repeated calls to the API for identical queries.

import langchain
import time
import os
from langchain.llms import OpenAI
from langchain.cache import InMemoryCache
from langchain.callbacks import get_openai_callback
from langchain.chains import LLMChain

openai_api_key = os.environ["OPENAI_API_KEY"]
langchain.llm_cache = InMemoryCache()

Query Processing and Caching

The core of this implementation lies in processing queries LLM with caching enabled. The process is straightforward:

  1. A question is defined and passed to the model for processing.
  2. The get_openai_callback() context manager is utilized to measure the execution time, showcasing the efficiency of caching.
  3. The first call processes the query through GPT-3.5-turbo, fetching the response and storing it in the cache.
  4. Subsequent calls with identical or similar queries retrieve responses directly from the cache, significantly reducing response times and API costs.
llm = OpenAI(model="gpt-3.5-turbo-instruct")
question = "What are the ingredients to cook a pizza?"

with get_openai_callback() as cb:
    start = time.time()
    result = llm(question)
    end = time.time()
    print(result)
    print("--- cb")
    print(str(cb) + f"({end - start:.2f} seconds)")

with get_openai_callback() as cb2:
    start = time.time()
    result2 = llm("What are the ingredients to cook pizzas?")
    end = time.time()
    print(result2)
    print("--- cb2")
    print(str(cb2) + f"({end - start:.2f} seconds)")

with get_openai_callback() as cb3:
    start = time.time()
    result3 = llm(question)
    end = time.time()
    print(result3)
    print("--- cb2")
    print(str(cb3) + f"({end - start:.2f} seconds)")

Observations and Results

The results demonstrate the effectiveness of in-memory caching:

  • The first execution takes approximately 1.24 seconds, with a detailed breakdown of tokens used and the cost associated with processing the query. This initial call incurs a cost due to API usage.
  • The second execution processes a slightly different but very similar query but still goes through the API, taking about 1.16 seconds, indicating that caching based on exact matches wouldn't cache this as the same query.
  • The third execution revisits the original question, and due to the caching mechanism, it retrieves the answer instantly with negligible processing time (0.00 seconds), showcasing the caching efficiency.

This implementation clearly illustrates the significant effects of in-memory caching on performance and cost efficiency. However, it's important to note that standard caching is employed here. Consequently, for questions that are similar but not identical, we still proceed with API calls, assuming that similar queries should yield similar responses.

Integrating MongoDB for Advanced Caching in AI Applications

Why MongoDB for AI Applications?

MongoDB stands out as a robust platform for AI applications, offering a wide range of features that cater to the complex needs of data integration, management, and retrieval.

Its capability to streamline the integration of vector and operational data makes it invaluable for machine learning and NLP tasks.

MongoDB Atlas, a fully managed developer platform, alleviates the burdens of database management with automated scaling, backups, and monitoring, enabling developers to concentrate on advancing AI functionalities.

The document-oriented schema of MongoDB provides the flexibility required to accommodate the dynamic nature of AI data, simplifying adaptation to changing data needs without the need for complex schema alterations.

MongoDB's efficient search capabilities and dedicated vector search nodes ensure swift data retrieval, crucial for effective caching and scalable AI search functions.

With support for multi-cloud across major platforms like AWS, Azure, and GCP, MongoDB ensures scalability and resilience.

Moreover, MongoDB Atlas is acclaimed for its comprehensive security features, offering robust data protection for AI applications.

The following section outlines the implementation steps and the inherent benefits of incorporating MongoDB Cache with the LangChain framework to boost AI performance.

MongoDB Standard Caching Implementation

The provided code snippet demonstrates a seamless integration of MongoDB with the LangChain framework for Standard caching purposes. Here's a breakdown of the essential steps involved in setting up MongoDB for caching AI queries and responses:

  1. Environment Setup:

The MongoDB Atlas Cluster URI is securely stored and accessed via environment variables, ensuring sensitive information remains protected.

import os
from pymongo import MongoClient
from langchain.llms import OpenAI
from langchain_mongodb.cache import MongoDBCache
from langchain_core.globals import set_llm_cache
from langchain_openai import OpenAIEmbeddings
from langchain.callbacks import get_openai_callback
import time

openai_api_key = os.environ["OPENAI_API_KEY"]
llm = OpenAI(model="gpt-3.5-turbo-instruct")

2. MongoDB Client Initialization:

A MongoDB client is instantiated using the MongoDB Atlas Cluster URI, establishing a connection to the database server. Cache Configuration: The MongoDBCache class is configured with the MongoDB connection details, including the database name (langchain_db) and collection name (langchain_cache). This setup specifies where the cached data will be stored and retrieved within the MongoDB database.

MONGODB_ATLAS_CLUSTER_URI = os.getenv('MONGO_URI')# initialize MongoDB python client
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)

COLLECTION_NAME="langchain_cache"
DATABASE_NAME="langchain_db"
question="What are the ingredients to cook a pizza?"

3. Caching Logic:

The set_llm_cache function from the langchain_core.globals module integrates the MongoDBCache as the Standard Caching mechanism for the LangChain framework. This integration allows for the automatic caching of queries and responses generated by the AI model, in this case, OpenAI's GPT-3.5-turbo-instruct model.

set_llm_cache(MongoDBCache(
    connection_string=MONGODB_ATLAS_CLUSTER_URI,
    collection_name=COLLECTION_NAME,
    database_name=DATABASE_NAME,
))

4. Query Processing:

When a query is processed through the llm function (representing the AI model), the system first checks the MongoDB cache for an existing response. If a cached response is found, it's returned immediately, bypassing the need to invoke the AI model again, which saves processing time and API call costs.

5. Performance Measurement:

The get_openai_callback wrapper is used to measure and output the execution time for each query, illustrating the efficiency gains from caching. Subsequent queries, especially identical or similar ones, show a notable reduction in response time due to the retrieval of responses from the cache.

Observations and Results

As shown in the following figure:

The first run incurs costs and token usage, highlighting the expense of initial API calls.

The second attempt, despite a slightly altered query, still triggers an API call due to the exact matches of standard caching, bypassing potential caching benefits.

The third run, however, resends the initial question and leverages the standard cache, delivering the answer almost instantaneously (0.00 seconds), thus illustrating the efficiency of the caching mechanism.

Result of Standard Caching in MongoDB

MongoDB Semantic Caching Implementation

The implementation of MongoDB Atlas Semantic Cache in AI-driven applications represents a significant leap toward optimizing response accuracy and speed. This approach leverages the power of vector embeddings and MongoDB's robust database management capabilities to create a semantic caching layer.

The provided code snippet and its outcomes serve as a practical illustration of this advanced caching mechanism.

  1. Initialization of OpenAI and MongoDB Client

The process begins with the initialization of the OpenAI model and the MongoDB client, connecting to the MongoDB Atlas Cluster URI. This setup forms the backbone for integrating the AI model's responses with the MongoDB caching system.

2. Semantic Caching with Vector Embeddings

By utilizing OpenAIEmbeddings, the implementation generates vector embeddings for textual inputs, which are crucial for semantic analysis. These embeddings capture the nuanced meanings of queries, allowing for a more sophisticated comparison than traditional text matching.

import os
from pymongo import MongoClient
from langchain.llms import OpenAI
from langchain_community.llms import OpenAI
from langchain_mongodb.cache import MongoDBCache
from langchain_core.globals import set_llm_cache
from langchain_openai import OpenAIEmbeddings
from langchain.callbacks import get_openai_callback
import time
from langchain_mongodb.cache import MongoDBAtlasSemanticCache

openai_api_key = os.environ["OPENAI_API_KEY"]
llm = OpenAI(model="gpt-3.5-turbo-instruct")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1024)
MONGODB_ATLAS_CLUSTER_URI = os.getenv('MONGO_URI')# initialize MongoDB python client

client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
COLLECTION_NAME="langchain_Semantic_cache"
DATABASE_NAME="langchain_db"
INDEX_NAME="vector_index"

3. MongoDB Atlas Semantic Cache Configuration

The MongoDBAtlasSemanticCache class is configured with essential parameters, including the connection string, collection name, database name, and specifically, the index name that corresponds to the vector embeddings index as shown below.

This configuration is vital for enabling semantic caching, where the similarity between query embeddings dictates cache hits.

set_llm_cache(MongoDBAtlasSemanticCache(
    embedding=embeddings,
    connection_string=MONGODB_ATLAS_CLUSTER_URI,
    collection_name=COLLECTION_NAME,
    database_name=DATABASE_NAME,
    index_name=INDEX_NAME,
    wait_until_ready=True # Optional, waits until the cache is ready to be used
))

4. MongoDB Atlas Index Definition for Semantic Matching

The index configuration, detailing fields like numDimensions and path, specifies how vector embeddings are stored and searched within MongoDB. The similarity setting as "cosine" ensures that the search mechanism understands and utilizes the semantic proximity between vectors to find the best match.

{
  "fields": [
    {
      "numDimensions": 1024,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    },
    {
      "path": "llm_string",
      "type": "filter"
    }
  ]
}

5. Query Processing and Caching

Queries such as "How to make a pizza?" are processed through the AI model, and their embeddings are generated. The MongoDB Atlas Semantic Cache checks for existing similar queries such as "What are the ingredients to cook pizzas" based on these embeddings. If a match is found, the cached response is retrieved, significantly speeding up the response time while using zero additional tokens from the AI model.

question="How to make a pizza ?"
similar_question="What are the ingredients to cook pizzas"

with get_openai_callback() as cb:
    start = time.time()
    result = llm(question)
    end = time.time()
    print(result)
    print("--- cb")
    print(str(cb) + f"({end - start:.2f} seconds)")
time.sleep(5)

with get_openai_callback() as cb2:
     start = time.time()
     result2 = llm(similar_question)
     end = time.time()
     print(result2)
     print("--- cb2")
     print(str(cb2) + f"({end - start:.2f} seconds)")
time.sleep(5)

with get_openai_callback() as cb3:
     start = time.time()
     result3 = llm(question)
     end = time.time()
     print(result3)
     print("--- cb3")
     print(str(cb3) + f"({end - start:.2f} seconds)")

6. Demonstrated Efficiency

The execution times, as shown in the callback outputs, illustrate the efficiency gains from using semantic caching.

  • Initial queries that generate and cache the embeddings and responses take slightly longer,
  • Second similar but different queries are served almost instantaneously, demonstrating the cache's effectiveness. And second query shows no token usage and has a total cost of $0.0.
  • The third query was served instantaneously as well with cache since we could find the exact match of this query.
Result of Semantic Caching

Business Case Comparison: Caching Strategies for AI Applications

Imagine a business that operates a customer service chatbot handling a vast array of customer queries ranging from product information, and troubleshooting, to order processing. The chatbot, powered by a sophisticated AI model like OpenAI's GPT 4, processes an average of 100,000 queries daily, with each query and its response averaging around 50 tokens.

Without Caching

  • Cost: High, with $13,500 monthly on API calls.
  • Performance: Slower, with variable response times.
  • Scalability: Financially challenging as query volume increases.

With Standard Caching

  • Cost Reduction: Noticeable, dropping to about $9,450 monthly due to serving 30% of repeat queries from cache.
  • Performance: Improved for repeated queries.
  • Limitations: Ineffectual for queries with slight variations.

With Semantic Caching

  • Cost Efficiency: Maximized, potentially reducing monthly costs to $5,400 by caching 60% of queries including variations.
  • Performance: Near-instant responses for a broader range of queries.
  • Scalability: Enhanced, with reduced direct AI model invocations.
Simulated Business Case Results

Conclusion

Transitioning from no caching to semantic caching significantly improves operational efficiency and cost-effectiveness in AI applications. Semantic caching not only offers superior user experiences through faster responses but also ensures scalability and financial viability by efficiently handling query variations.

Acknowledgement

Currently working as a Solution Architect at MongoDB, I want to extend my special thanks to my colleagues, Prakul Agarwal, AI/ML Product Manager, for his valuable product insights and to Christophe Locoge for sharing insights on enterprise and corporate challenges, derived from discussions with our clients.

Before you go!

Tags: AI Artificial Intelligence Machine Learning Money Performance

Comment