Solega Co. Done For Your E-Commerce solutions.
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel
No Result
View All Result
No Result
View All Result
Home Artificial Intelligence

Build an Inference Cache to Save Costs in High-Traffic LLM Apps

Solega Team by Solega Team
October 27, 2025
in Artificial Intelligence
Reading Time: 29 mins read
0
Build an Inference Cache to Save Costs in High-Traffic LLM Apps
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


In this article, you will learn how to add both exact-match and semantic inference caching to large language model applications to reduce latency and API costs at scale.

Topics we will cover include:

  • Why repeated queries in high-traffic apps waste time and money.
  • How to build a minimal exact-match cache and measure the impact.
  • How to implement a semantic cache with embeddings and cosine similarity.

Alright, let’s get to it.

Build an Inference Cache to Save Costs in High-Traffic LLM Apps

Build an Inference Cache to Save Costs in High-Traffic LLM Apps
Image by Editor

Introduction

Large language models (LLMs) are widely used in applications like chatbots, customer support, code assistants, and more. These applications often serve millions of queries per day. In high-traffic apps, it’s very common for many users to ask the same or similar questions. Now think about it: is it really smart to call the LLM every single time when these models aren’t free and add latency to responses? Logically, no.

Take a customer service bot as an example. Thousands of users might ask questions every day, and many of those questions are repeated:

  • “What’s your refund policy?”
  • “How do I reset my password?”
  • “What’s the delivery time?”

If every single query is sent to the LLM, you’re just burning through your API budget unnecessarily. Each repeated request costs the same, even though the model has already generated that answer before. That’s where inference caching comes in. You can think of it as memory where you store the most common questions and reuse the results. In this article, I’ll walk you through a high-level overview with code. We’ll start with a single LLM call, simulate what high-traffic apps look like, build a simple cache, and then take a look at a more advanced version you’d want in production. Let’s get started.

Setup

Install dependencies. I am using Google Colab for this demo. We’ll use the OpenAI Python client:

Set your OpenAI API key:

import os

from openai import OpenAI

 

os.environ[“OPENAI_API_KEY”] = “sk-your_api_key_here”

client = OpenAI()

Step 1: A Simple LLM Call

This function sends a prompt to the model and prints how long it takes:

import time

 

def ask_llm(prompt):

    start = time.time()

    response = client.chat.completions.create(

        model=“gpt-4o-mini”,

        messages=[{“role”: “user”, “content”: prompt}]

    )

    end = time.time()

    print(f“Time: {end – start:.2f}s”)

    return response.choices[0].message.content

 

print(ask_llm(“What is your refund policy?”))

Output:

Time: 2.81s

As an AI language model, I don‘t have a refund policy since I don’t...

This works fine for one call. But what if the same question is asked over and over?

Step 2: Simulating Repeated Questions

Let’s create a small list of user queries. Some are repeated, some are new:

queries = [

    “What is your refund policy?”,

    “How do I reset my password?”,

    “What is your refund policy?”,   # repeated

    “What’s the delivery time?”,

    “How do I reset my password?”,   # repeated

]

Let’s see what happens if we call the LLM for each:

start = time.time()

for q in queries:

    print(f“Q: {q}”)

    ans = ask_llm(q)

    print(“A:”, ans)

    print(“-“ * 50)

end = time.time()

 

print(f“Total Time (no cache): {end – start:.2f}s”)

Output:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Q: What is your refund policy?

Time: 2.02s

A: I don‘t handle transactions or have a refund policy…

————————————————–

Q: How do I reset my password?

Time: 10.22s

A: To reset your password, you typically need to follow…

————————————————–

Q: What is your refund policy?

Time: 4.66s

A: I don’t handle transactions or refunds directly...

—————————————————————————

Q: What’s the delivery time?

Time: 5.40s

A: The delivery time can vary significantly based on several factors...

—————————————————————————

Q: How do I reset my password?

Time: 6.34s

A: To reset your password, the process typically varies...

—————————————————————————

Total Time (no cache): 28.64s

Every time, the LLM is called again. Even though two queries are identical, we’re paying for both. With thousands of users, these costs can skyrocket.

Step 3: Adding an Inference Cache (Exact Match)

We can fix this with a dictionary-based cache as a naive solution:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

cache = {}

 

def ask_llm_cached(prompt):

    if prompt in cache:

        print(“(from cache, ~0.00s)”)

        return cache[prompt]

    

    ans = ask_llm(prompt)

    cache[prompt] = ans

    return ans

 

start = time.time()

for q in queries:

    print(f“Q: {q}”)

    print(“A:”, ask_llm_cached(q))

    print(“-“ * 50)

end = time.time()

 

print(f“Total Time (exact cache): {end – start:.2f}s”)

Output:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Q: What is your refund policy?

Time: 2.35s

A: I don’t have a refund policy since...

—————————————————————————

Q: How do I reset my password?

Time: 6.42s

A: Resetting your password typically depends on...

—————————————————————————

Q: What is your refund policy?

(from cache, ~0.00s)

A: I don’t have a refund policy since...

—————————————————————————

Q: What’s the delivery time?

Time: 3.22s

A: Delivery times can vary depending on several factors...

—————————————————————————

Q: How do I reset my password?

(from cache, ~0.00s)

A: Resetting your password typically depends...

—————————————————————————

Total Time (exact cache): 12.00s

Now:

  • The first time “What is your refund policy?” is asked, it calls the LLM.
  • The second time, it instantly retrieves from cache.

This saves cost and reduces latency dramatically.

Step 4: The Problem with Exact Matching

Exact matching works only when the query text is identical. Let’s see an example:

q1 = “What is your refund policy?”

q2 = “Can you explain the refund policy?”

 

print(ask_llm_cached(q1))

print(ask_llm_cached(q2))  # Not cached, even though it means the same!

Output:

(from cache, ~0.00s)

First: I don’t have a refund policy since...

 

Time: 7.93s

Second: Refund policies can vary widely depending on the company...

Both queries ask about refunds, but since the text is slightly different, our cache misses. That means we still pay for the LLM. This is a big problem in the real world because users phrase questions differently.

Step 5: Semantic Caching with Embeddings

To fix this, we can use semantic caching. Instead of checking if text is identical, we check if queries are similar in meaning. We can use embeddings for this:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

import numpy as np

 

semantic_cache = {}

 

def embed(text):

    emb = client.embeddings.create(

        model=“text-embedding-3-small”,

        input=text

    )

    return np.array(emb.data[0].embedding)

 

def ask_llm_semantic(prompt, threshold=0.85):

    prompt_emb = embed(prompt)

    

    for cached_q, (cached_emb, cached_ans) in semantic_cache.items():

        sim = np.dot(prompt_emb, cached_emb) / (

            np.linalg.norm(prompt_emb) * np.linalg.norm(cached_emb)

        )

        if sim > threshold:

            print(f“(from semantic cache, matched with ‘{cached_q}’, ~0.00s)”)

            return cached_ans

    

    start = time.time()

    ans = ask_llm(prompt)

    end = time.time()

    semantic_cache[prompt] = (prompt_emb, ans)

    print(f“Time (new LLM call): {end – start:.2f}s”)

    return ans

 

print(“First:”, ask_llm_semantic(“What is your refund policy?”))

print(“Second:”, ask_llm_semantic(“Can you explain the refund policy?”))  # Should hit semantic cache

Output:

Time: 4.54s

Time (new LLM call): 4.54s

First: As an AI, I don‘t have a refund policy since I don’t sell...

 

(from semantic cache, matched with ‘What is your refund policy?’, ~0.00s)

Second: As an AI, I don‘t have a refund policy since I don’t sell...

Even though the second query is worded differently, the semantic cache recognizes its similarity and reuses the answer.

Conclusion

If you’re building customer support bots, AI agents, or any high-traffic LLM app, caching should be one of the first optimizations you put in place.

  • Exact cache saves cost for identical queries.
  • Semantic cache saves cost for meaningfully similar queries.
  • Together, they can massively reduce API calls in high-traffic apps.

In real-world production apps, you’d store embeddings in a vector database like FAISS, Pinecone, or Weaviate for fast similarity search. But even this small demo shows how much cost and time you can save.



Source link

Tags: AppsBuildCacheCostsHighTrafficInferenceLLMSave
Previous Post

Vitalik Buterin and Anatoly Yakovenko Clash Over Ethereum’s Layer-2 Security

Next Post

7 Common Travel Myths That are Completely WRONG

Next Post
7 Common Travel Myths That are Completely WRONG

7 Common Travel Myths That are Completely WRONG

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR POSTS

  • 20 Best Resource Management Software of 2025 (Free & Paid)

    20 Best Resource Management Software of 2025 (Free & Paid)

    0 shares
    Share 0 Tweet 0
  • How to Make a Stakeholder Map

    0 shares
    Share 0 Tweet 0
  • 10 Ways To Get a Free DoorDash Gift Card

    0 shares
    Share 0 Tweet 0
  • The Role of Natural Language Processing in Financial News Analysis

    0 shares
    Share 0 Tweet 0
  • They Combed the Co-ops of Upper Manhattan With $700,000 to Spend

    0 shares
    Share 0 Tweet 0
Solega Blog

Categories

  • Artificial Intelligence
  • Cryptocurrency
  • E-commerce
  • Finance
  • Investment
  • Project Management
  • Real Estate
  • Start Ups
  • Travel

Connect With Us

Recent Posts

How to Make a Schedule of Rates for a Construction Project

Construction Inventory Management: Managing Materials & Equipment

November 1, 2025
Orlando International Airport: A Gateway for Growth

Orlando International Airport: A Gateway for Growth

November 1, 2025

© 2024 Solega, LLC. All Rights Reserved | Solega.co

No Result
View All Result
  • Home
  • E-commerce
  • Start Ups
  • Project Management
  • Artificial Intelligence
  • Investment
  • More
    • Cryptocurrency
    • Finance
    • Real Estate
    • Travel

© 2024 Solega, LLC. All Rights Reserved | Solega.co