Agent Skills Caching with CacheBlend: Achieving 85% Cache Hit Rates for LLM Agents

Tensormesh is an AI inference optimization company that never charges you twice for cached tokens, making AI applications faster and dramatically cheaper to run anywhere.

For Agent Skills, traditional prompt caching gives you very small (close to 0%) cache hit rate. We have been working on a non-prefix caching technique, called CacheBlend, with which we achieve 63.6-85.0% cache hit rates on skill-related content, potentially making any agent cheaper to run and delivering answers faster.

Overview

We propose a strategy for caching skills files (like SKILL.md and their referenced files) to enable efficient LLM prompt caching through context editing. By pre-processing and storing skill documentation in the correct format, we can achieve 85%+ cache hit rates on skill-related content.

Skill is a progressive disclosure: it provides just enough information in first several lines in skill.md - for Claude to know when each skill should be used without loading all of it into context.The actual body of this file is the second and third level of detail. If Claude thinks the skill is relevant to the current task, it will load them by reading its full instructions SKILL.md & related resources into context.

The actual body of this file is the second and third level of detail. If Claude thinks the skill is relevant to the current task, it will load them by reading its full instructions SKILL.md & related resources into context.

Problem Statement

When LLM agents use skill files (e.g., PDF processing skills), the skill documentation is inserted into the user context. Without proper caching:

  • Each request re-sends the full skill documentation
  • No prefix matching occurs with previous requests
  • Token costs and latency increase linearly

Solution: Pre-cached Skill Pool

Skill Files to Cache

For each skill, cache the main skill file and all referenced files:

Context Editing for Cache Hits

Traditional prefix caching requires skill content to appear at the exact same position in every prompt. However, with CacheBlend, we can concatenate pre-cached skill KV states with the existing context at any position.

Key Advantage: With CacheBlend, the skill content doesn't need to be at the prompt prefix. It can appear after dynamic content (user request, conversation history) and still achieve cache hits by concatenating pre-computed KV states.

Cache Hit Analysis

With proper caching, we observed these results on skill-related requests:

Note: "New Content Appended to Prefix" refers to the total number of words in the new content that was appended to the cached prefix. With CacheBlend, this new content can be placed after dynamic content (like user requests) while still achieving cache hits by concatenating pre-computed KV states. "Matched" represents the number of words from skill files that were found in the cache pool and successfully matched, enabling cache hits.

Benefits

  1. Reduced Token Costs: 85% of skill content hits cache → only 15% new tokens processed
  2. Lower Latency: Cached prefixes enable faster response generation

Conclusion

By pre-caching skill files and structuring prompts for prefix matching, we can achieve significant cache hit rates (63.6-85.0%) on skill-related content. This reduces token costs and latency while maintaining consistent skill documentation across requests.

Recent Blog Posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet.

Name

Position
May 20, 2026

KV Cache isn't just Cache, it's Memory: A Guide for LLM & Agent Devs

Read article

May 13, 2026

The AI Agent Metrics That Actually Matter: Beyond Tokens and Latency

Read article

May 6, 2026

Tensormesh Inference: Cheaper LLM Inference for AI Agents

Read article

April 29, 2026

Agentic AI Inference Cost: How LLM Agent Loops Break Caching and Drain Your Budget

Read article

April 28, 2026

Inside Tensormesh: Meet our CTO and Chief Scientist

Read article

April 22, 2026

Enterprise AI Vendor Lock-In: What It Costs When Your Provider Pulls Access

Read article

April 15, 2026

Introducing Tensormesh Beta 2.2: Serverless Inference & $0 Cached Input Tokens

Read article

April 8, 2026

How We Optimized Redis for LLM KV Cache: 0.3 GB/s to 10 GB/s

Read article

February 25, 2026

Introducing Tensormesh Beta 2: One-Click LLM Deployment, New UI & Real-Time Cost Savings

Read article

February 11, 2026

Beyond Prefix Caching: How Non-Prefix Caching Achieves 25x Better Hit Rates for AI Agents

Read article

February 4, 2026

The Open Source Revolution: Why Open-Weight AI Models Are Redefining the Future

Read article

January 28, 2026

LMCache's Production-Ready P2P Architecture: Powers Tensormesh's 5-10x Cost Reduction

Read article

January 21, 2026

The Document Reprocessing Problem: How LLMs Waste 93% of Your GPU Budget

Read article

January 15, 2026

Building Tensormesh: A conversation with the CEO (Junchen Jiang)

Read article

January 7, 2026

The Hidden Metric That's Destroying Your AI Agent's Performance & Budget

Read article

December 17, 2025

LMCache Storage ROI Calculator: When KV Cache Storage Reduces AI Inference Costs

Read article

December 10, 2025

AI Inference Costs in 2025: The $255B Market's Energy Crisis and Path to Sustainable Scaling

Read article

December 3, 2025

New Hugging Face Integration: Access 300,000+ AI Models with Real-Time Performance Monitoring

Read article

November 26, 2025

The AI Inference Throughput Challenge: Scaling LLM Applications Efficiently

Read article

November 19, 2025

Solving AI Inference Latency: How Slow Response Times Cost You Millions in Revenue

Read article

November 13, 2025

GPU Cost Crisis: How Model Memory Caching Cuts AI Inference Costs Up to 10×

Read article

October 23, 2025

Tensormesh Emerges From Stealth to Slash AI Inference Costs and Latency by up to 10x

Read article

October 21, 2025

Comparing LLM Serving Stacks: Introduction to Tensormesh Benchmark

Read article