Beyond Prefix Caching: How Non-Prefix Caching Achieves 25x Better Hit Rates for AI Agents

Tensormesh is an AI inference optimization company that never charges you twice for cached tokens, making AI applications faster and dramatically cheaper to run anywhere.

RepoAgent prompt caching only gives you 3.4% token hit rate, however, 85.9% of the prompts are reused (as non-prefix). Our technique, CacheBlend, enables non-prefix prompt caching and helps improve cache hit rate by 25x (3.4% to 85.9%)

Motivation

If you can only pick one metric to optimize in production-scale agents, what would that metric be? According to Manus, the single most important metric is the prompt caching hit rate.

With recent agents, we start to find that prompt caching hit rate is really low even though these agents logically are massively processing existing information in LLM prompts.

To understand why prompt caching does not appear for these agents, we dive into one example: RepoAgent, an agent that generates documentation for your github repository.

Workflow of RepoAgent

RepoAgent processes code objects in topological order based on their dependencies. This ensures that when documenting a function, the documentation for all functions it calls is already available.

The solid arrows represent parent-child relationships (containment), while the dashed arrows represent reference relationships (calls/dependencies).

RepoAgent uses these relationships to:

  1. Build a dependency graph
  2. Compute topological order
  3. Generate documentation bottom-up (leaf nodes first)

How RepoAgent Prompts Are Generated

Generated based on a the following template:

If you are interested in the concrete detail on what are these variables like code_type are:

Prompt Cache Hit Rate Analysis

The prompt template above directly causes low prompt cache hit rate. The key reason is that prompt caching requires the prompt to be prefix, but in RepoAgent, the same text will move position in the prompt, making them non-prefix.

Say both Func_1 and Sub_Func_2 call Sub_Func_1 (see dashed arrows in topo.svg). When generating documentation for each, Sub_Func_1's code appears in the {reference_letter} section—but at different positions:

We summarize the root causes of why prompt caching fails.

  1. Early Dynamic Variables: {project_structure} and {file_path} appear within the first 200 characters, immediately breaking prefix alignment
  2. Unique Code Content: Every request documents a different function/class with unique source code
  3. Variable Reference Context: The {reference_letter} and {referencer_content} sections vary based on each object's call relationships
  4. Topological Processing: Objects are processed by dependency order, not file proximity, so consecutive requests often document unrelated code

The Non-Prefix Advantage

Unlike prefix caching, non-prefix prompt caching can match and reuse any contiguous block of tokens, regardless of the position. This fundamentally changes what can be cached:

Cacheable Components in RepoAgent

Empirical Results

LMCache benchmarks demonstrate the dramatic improvement:


CacheBlend helps you achieve non-prefix prompt caching.

We have non-prefix prompt caching built at Tensormesh, using our technique CacheBlend.

Please contact us if you are interested in trying CacheBlend on your workload!

References

Recent Blog Posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet.

Name

Position
May 13, 2026

The AI Agent Metrics That Actually Matter: Beyond Tokens and Latency

Read article

May 6, 2026

Tensormesh Inference: Cheaper LLM Inference for AI Agents

Read article

April 29, 2026

Agentic AI Inference Cost: How LLM Agent Loops Break Caching and Drain Your Budget

Read article

April 28, 2026

Inside Tensormesh: Meet our CTO and Chief Scientist

Read article

April 22, 2026

Enterprise AI Vendor Lock-In: What It Costs When Your Provider Pulls Access

Read article

April 15, 2026

Introducing Tensormesh Beta 2.2: Serverless Inference & $0 Cached Input Tokens

Read article

April 8, 2026

How We Optimized Redis for LLM KV Cache: 0.3 GB/s to 10 GB/s

Read article

February 25, 2026

Introducing Tensormesh Beta 2: One-Click LLM Deployment, New UI & Real-Time Cost Savings

Read article

February 18, 2026

Agent Skills Caching with CacheBlend: Achieving 85% Cache Hit Rates for LLM Agents

Read article

February 4, 2026

The Open Source Revolution: Why Open-Weight AI Models Are Redefining the Future

Read article

January 28, 2026

LMCache's Production-Ready P2P Architecture: Powers Tensormesh's 5-10x Cost Reduction

Read article

January 21, 2026

The Document Reprocessing Problem: How LLMs Waste 93% of Your GPU Budget

Read article

January 15, 2026

Building Tensormesh: A conversation with the CEO (Junchen Jiang)

Read article

January 7, 2026

The Hidden Metric That's Destroying Your AI Agent's Performance & Budget

Read article

December 17, 2025

LMCache Storage ROI Calculator: When KV Cache Storage Reduces AI Inference Costs

Read article

December 10, 2025

AI Inference Costs in 2025: The $255B Market's Energy Crisis and Path to Sustainable Scaling

Read article

December 3, 2025

New Hugging Face Integration: Access 300,000+ AI Models with Real-Time Performance Monitoring

Read article

November 26, 2025

The AI Inference Throughput Challenge: Scaling LLM Applications Efficiently

Read article

November 19, 2025

Solving AI Inference Latency: How Slow Response Times Cost You Millions in Revenue

Read article

November 13, 2025

GPU Cost Crisis: How Model Memory Caching Cuts AI Inference Costs Up to 10×

Read article

October 23, 2025

Tensormesh Emerges From Stealth to Slash AI Inference Costs and Latency by up to 10x

Read article

October 21, 2025

Comparing LLM Serving Stacks: Introduction to Tensormesh Benchmark

Read article