Build an Inference Cache to Save Costs in High-Traffic LLM Apps

In this article, you will learn how to add both exact-match and semantic inference caching to large language model applications to reduce latency and API costs at scale.

Topics we will cover include:

Why repeated queries in high-traffic apps waste time and money.
How to build a minimal exact-match cache and measure the impact.
How to implement a semantic cache with embeddings and cosine similarity.

Alright, let’s get to it.

Build an Inference Cache to Save Costs in High-Traffic LLM Apps
Image by Editor

Introduction

Large language models (LLMs) are widely used in applications like chatbots, customer support, code assistants, and more. These applications often serve millions of queries per day. In high-traffic apps, it’s very common for many users to ask the same or similar questions. Now think about it: is it really smart to call the LLM every single time when these models aren’t free and add latency to responses? Logically, no.

Take a customer service bot as an example. Thousands of users might ask questions every day, and many of those questions are repeated:

“What’s your refund policy?”
“How do I reset my password?”
“What’s the delivery time?”

If every single query is sent to the LLM, you’re just burning through your API budget unnecessarily. Each repeated request costs the same, even though the model has already generated that answer before. That’s where inference caching comes in. You can think of it as memory where you store the most common questions and reuse the results. In this article, I’ll walk you through a high-level overview with code. We’ll start with a single LLM call, simulate what high-traffic apps look like, build a simple cache, and then take a look at a more advanced version you’d want in production. Let’s get started.

Setup

Install dependencies. I am using Google Colab for this demo. We’ll use the OpenAI Python client:

Set your OpenAI API key:

import os from openai import OpenAI os.environ[“OPENAI_API_KEY”] = “sk-your_api_key_here” client = OpenAI()

import os

from openai import OpenAI

os.environ[“OPENAI_API_KEY”] = “sk-your_api_key_here”

client = OpenAI()

Step 1: A Simple LLM Call

This function sends a prompt to the model and prints how long it takes:

import time def ask_llm(prompt): start = time.time() response = client.chat.completions.create( model=”gpt-4o-mini”, messages=[{“role”: “user”, “content”: prompt}] ) end = time.time() print(f”Time: {end – start:.2f}s”) return response.choices[0].message.content print(ask_llm(“What is your refund policy?”))

import time

def ask_llm(prompt):

start = time.time()

response = client.chat.completions.create(

model=“gpt-4o-mini”,

messages=[{“role”: “user”, “content”: prompt}]

)

end = time.time()

print(f“Time: {end – start:.2f}s”)

return response.choices[0].message.content

print(ask_llm(“What is your refund policy?”))

Output:

Time: 2.81s As an AI language model, I don’t have a refund policy since I don’t…

Time: 2.81s

As an AI language model, I don‘t have a refund policy since I don’t...

This works fine for one call. But what if the same question is asked over and over?

Step 2: Simulating Repeated Questions

Let’s create a small list of user queries. Some are repeated, some are new:

queries = [ “What is your refund policy?”, “How do I reset my password?”, “What is your refund policy?”, # repeated “What’s the delivery time?”, “How do I reset my password?”, # repeated ]

queries = [

“What is your refund policy?”,

“How do I reset my password?”,

“What is your refund policy?”, # repeated

“What’s the delivery time?”,

“How do I reset my password?”, # repeated

]

Let’s see what happens if we call the LLM for each:

start = time.time() for q in queries: print(f”Q: {q}”) ans = ask_llm(q) print(“A:”, ans) print(“-” * 50) end = time.time() print(f”Total Time (no cache): {end – start:.2f}s”)

start = time.time()

for q in queries:

print(f“Q: {q}”)

ans = ask_llm(q)

print(“A:”, ans)

print(“-“ * 50)

end = time.time()

print(f“Total Time (no cache): {end – start:.2f}s”)

Output:

Q: What is your refund policy? Time: 2.02s A: I don’t handle transactions or have a refund policy… ————————————————– Q: How do I reset my password? Time: 10.22s A: To reset your password, you typically need to follow… ————————————————– Q: What is your refund policy? Time: 4.66s A: I don’t handle transactions or refunds directly… ————————————————– Q: What’s the delivery time? Time: 5.40s A: The delivery time can vary significantly based on several factors… ————————————————– Q: How do I reset my password? Time: 6.34s A: To reset your password, the process typically varies… ————————————————– Total Time (no cache): 28.64s

Q: What is your refund policy?

Time: 2.02s

A: I don‘t handle transactions or have a refund policy…

————————————————–

Q: How do I reset my password?

Time: 10.22s

A: To reset your password, you typically need to follow…

————————————————–

Q: What is your refund policy?

Time: 4.66s

A: I don’t handle transactions or refunds directly...

—————————————————————————

Q: What’s the delivery time?

Time: 5.40s

A: The delivery time can vary significantly based on several factors...

—————————————————————————

Q: How do I reset my password?

Time: 6.34s

A: To reset your password, the process typically varies...

—————————————————————————

Total Time (no cache): 28.64s

Every time, the LLM is called again. Even though two queries are identical, we’re paying for both. With thousands of users, these costs can skyrocket.

Step 3: Adding an Inference Cache (Exact Match)

We can fix this with a dictionary-based cache as a naive solution:

cache = {} def ask_llm_cached(prompt): if prompt in cache: print(“(from cache, ~0.00s)”) return cache[prompt] ans = ask_llm(prompt) cache[prompt] = ans return ans start = time.time() for q in queries: print(f”Q: {q}”) print(“A:”, ask_llm_cached(q)) print(“-” * 50) end = time.time() print(f”Total Time (exact cache): {end – start:.2f}s”)

cache = {}

def ask_llm_cached(prompt):

if prompt in cache:

print(“(from cache, ~0.00s)”)

return cache[prompt]

ans = ask_llm(prompt)

cache[prompt] = ans

return ans

start = time.time()

for q in queries:

print(f“Q: {q}”)

print(“A:”, ask_llm_cached(q))

print(“-“ * 50)

end = time.time()

print(f“Total Time (exact cache): {end – start:.2f}s”)

Output:

Q: What is your refund policy? Time: 2.35s A: I don’t have a refund policy since… ————————————————– Q: How do I reset my password? Time: 6.42s A: Resetting your password typically depends on… ————————————————– Q: What is your refund policy? (from cache, ~0.00s) A: I don’t have a refund policy since… ————————————————– Q: What’s the delivery time? Time: 3.22s A: Delivery times can vary depending on several factors… ————————————————– Q: How do I reset my password? (from cache, ~0.00s) A: Resetting your password typically depends… ————————————————– Total Time (exact cache): 12.00s

Q: What is your refund policy?

Time: 2.35s

A: I don’t have a refund policy since...

—————————————————————————

Q: How do I reset my password?

Time: 6.42s

A: Resetting your password typically depends on...

—————————————————————————

Q: What is your refund policy?

(from cache, ~0.00s)

A: I don’t have a refund policy since...

—————————————————————————

Q: What’s the delivery time?

Time: 3.22s

A: Delivery times can vary depending on several factors...

—————————————————————————

Q: How do I reset my password?

(from cache, ~0.00s)

A: Resetting your password typically depends...

—————————————————————————

Total Time (exact cache): 12.00s

Now:

The first time “What is your refund policy?” is asked, it calls the LLM.
The second time, it instantly retrieves from cache.

This saves cost and reduces latency dramatically.

Step 4: The Problem with Exact Matching

Exact matching works only when the query text is identical. Let’s see an example:

q1 = “What is your refund policy?” q2 = “Can you explain the refund policy?” print(ask_llm_cached(q1)) print(ask_llm_cached(q2)) # Not cached, even though it means the same!

q1 = “What is your refund policy?”

q2 = “Can you explain the refund policy?”

print(ask_llm_cached(q1))

print(ask_llm_cached(q2)) # Not cached, even though it means the same!

Output:

(from cache, ~0.00s) First: I don’t have a refund policy since… Time: 7.93s Second: Refund policies can vary widely depending on the company…

(from cache, ~0.00s)

First: I don’t have a refund policy since...

Time: 7.93s

Second: Refund policies can vary widely depending on the company...

Both queries ask about refunds, but since the text is slightly different, our cache misses. That means we still pay for the LLM. This is a big problem in the real world because users phrase questions differently.

Step 5: Semantic Caching with Embeddings

To fix this, we can use semantic caching. Instead of checking if text is identical, we check if queries are similar in meaning. We can use embeddings for this:

import numpy as np semantic_cache = {} def embed(text): emb = client.embeddings.create( model=”text-embedding-3-small”, input=text ) return np.array(emb.data[0].embedding) def ask_llm_semantic(prompt, threshold=0.85): prompt_emb = embed(prompt) for cached_q, (cached_emb, cached_ans) in semantic_cache.items(): sim = np.dot(prompt_emb, cached_emb) / ( np.linalg.norm(prompt_emb) * np.linalg.norm(cached_emb) ) if sim > threshold: print(f”(from semantic cache, matched with ‘{cached_q}’, ~0.00s)”) return cached_ans start = time.time() ans = ask_llm(prompt) end = time.time() semantic_cache[prompt] = (prompt_emb, ans) print(f”Time (new LLM call): {end – start:.2f}s”) return ans print(“First:”, ask_llm_semantic(“What is your refund policy?”)) print(“Second:”, ask_llm_semantic(“Can you explain the refund policy?”)) # Should hit semantic cache

import numpy as np

semantic_cache = {}

def embed(text):

emb = client.embeddings.create(

model=“text-embedding-3-small”,

input=text

)

return np.array(emb.data[0].embedding)

def ask_llm_semantic(prompt, threshold=0.85):

prompt_emb = embed(prompt)

for cached_q, (cached_emb, cached_ans) in semantic_cache.items():

sim = np.dot(prompt_emb, cached_emb) / (

np.linalg.norm(prompt_emb) * np.linalg.norm(cached_emb)

)

if sim > threshold:

print(f“(from semantic cache, matched with ‘{cached_q}’, ~0.00s)”)

return cached_ans

start = time.time()

ans = ask_llm(prompt)

end = time.time()

semantic_cache[prompt] = (prompt_emb, ans)

print(f“Time (new LLM call): {end – start:.2f}s”)

return ans

print(“First:”, ask_llm_semantic(“What is your refund policy?”))

print(“Second:”, ask_llm_semantic(“Can you explain the refund policy?”)) # Should hit semantic cache

Output:

Time: 4.54s Time (new LLM call): 4.54s First: As an AI, I don’t have a refund policy since I don’t sell… (from semantic cache, matched with ‘What is your refund policy?’, ~0.00s) Second: As an AI, I don’t have a refund policy since I don’t sell…

Time: 4.54s

Time (new LLM call): 4.54s

First: As an AI, I don‘t have a refund policy since I don’t sell...

(from semantic cache, matched with ‘What is your refund policy?’, ~0.00s)

Second: As an AI, I don‘t have a refund policy since I don’t sell...

Even though the second query is worded differently, the semantic cache recognizes its similarity and reuses the answer.

Conclusion

If you’re building customer support bots, AI agents, or any high-traffic LLM app, caching should be one of the first optimizations you put in place.

Exact cache saves cost for identical queries.
Semantic cache saves cost for meaningfully similar queries.
Together, they can massively reduce API calls in high-traffic apps.

In real-world production apps, you’d store embeddings in a vector database like FAISS, Pinecone, or Weaviate for fast similarity search. But even this small demo shows how much cost and time you can save.

Source link

Build an Inference Cache to Save Costs in High-Traffic LLM Apps

Vitalik Buterin and Anatoly Yakovenko Clash Over Ethereum’s Layer-2 Security

7 Common Travel Myths That are Completely WRONG

7 Common Travel Myths That are Completely WRONG

Leave a Reply Cancel reply

POPULAR POSTS

20 Best Resource Management Software of 2025 (Free & Paid)

How to Make a Stakeholder Map

10 Ways To Get a Free DoorDash Gift Card

The Role of Natural Language Processing in Financial News Analysis

They Combed the Co-ops of Upper Manhattan With $700,000 to Spend

Categories

Connect With Us

Recent Posts

Construction Inventory Management: Managing Materials & Equipment

Orlando International Airport: A Gateway for Growth