How We Optimized Redis for LLM KV Cache: 0.3 GB/s to 10 GB/s

Software Engineer

Marketing, Enterprise and Partnerships

Tensormesh is an AI inference optimization company that never charges you twice for cached tokens, making AI applications faster and dramatically cheaper to run anywhere.

KV caches accelerate LLM inference, but their aggregate size quickly exceeds GPU memory, requiring a scalable, cluster-level storage backend. Redis is a natural fit, but its standard Python SDK delivers only 0.3 GB/s of retrieval throughput, far below what's needed to outpace GPU recomputation. In partnership with Redis, we rebuilt the client down to C++, achieving a 30x throughput improvement (0.3 → 10 GB/s) and a 40% reduction in end-to-end round time on production workloads.

‍Who We Are

We were founded by the creators of LMCache, the open-source KV caching engine we built and continue to maintain, integrated widely across the inference ecosystems (including all 3 major serving engines: vLLM, SGLang, TensorRT-LLM) and various orchestration stacks (NVIDIA Dynamo, llm-d, vLLM Production Stack, etc.)

This post walks through a specific piece of that work, making Redis a high-performance external storage backend for LMCache. We took Redis client throughput from 0.3 GB/s to 10 GB/s, and here's what that means for real-world inference.

‍‍

Why KV Cache Offloading Matters for LLM Inference

Large language model inference is expensive. When a user sends a prompt that shares a prefix with a previous request, the GPU recomputes attention over tokens it's already seen. KV cache offloading solves this through storing the computed key-value tensors externally and retrieving them instead of recomputing.

LMCache manages a three-tier memory hierarchy for this purpose: L0 (VRAM), L1 (Host RAM), and L2 (external storage). Redis is a natural fit for external storage as it runs everywhere (AWS, GCP, Azure, on-prem), teams already know how to operate it, and it's battle-tested at scale. In July 2025, we integrated Redis into LMCache's southbound storage tree. Then we set about making it as fast as possible.

‍

Why KV Cache Is Not a Typical Redis Workload

Redis typically handles small metadata payloads in the 1–10 KB range. KV cache chunks range from 500 KB to 40 MB. A single 10,000-token request on Llama 3.1 8B generates roughly 1.2 GB of KV cache data.

The optimization target is different too, as traditional Redis use cases optimize for latency and ops/s. For KV cache offloading, what matters is throughput in GB/s and tail latency — every chunk must arrive before inference can continue. We set a 2 GB/s retrieval SLO, the threshold where loading from Redis outpaces GPU prefill on modern datacenter hardware.

Why 2 GB/s? Attention during prefill is compute-bound at O(n²) in sequence length, while retrieval is bandwidth-bound at O(n). As context length grows, prefill cost grows faster than retrieval cost, widening the window where caching wins. The exact crossover depends on GPU generation and model size — a topic we plan to cover in a future post.

Starting with the standard redis-py SDK, we measured 0.278 GB/s. That's roughly 7x below our target.

‍

Validating Feasibility with memtier_benchmark

Before writing optimization code, we validated that the 2 GB/s target was achievable. Using memtier_benchmark (Redis's C-based benchmarking tool) with 4 MB payloads against a local Redis 8.2 instance:

5.85 GB/s on GET. The SLO was reachable, however, the client was the bottleneck. Our optimization work fell into three phases.

‍

Phase 1: Pure-Python Redis Client — 0.3 to 5 GB/s

Zero-copy protocol handling (0.3 → 2 GB/s)

We wrote a custom RESP protocol parser that enforces zero-copy on both send and receive paths. On sends, we use scatter-gather writes via sendmsg. On reads, recv_exactly_into writes directly into the caller's buffer but no intermediate copies or allocations.

This alone delivered a 7x improvement in retrieval throughput.

‍

Exploiting fixed-size KV cache chunks (2 → 4–5 GB/s)

LMCache chunks KV cache along the token dimension. Normally each chunk carries separate metadata recording its actual length, requiring two Redis operations per chunk. We made two changes:

Drop the last partial chunk, making all chunks a uniform fixed size. This eliminates metadata entirely, halving the number of Redis round trips. The dropped tokens are recomputed during the next prefill pass, which is negligible relative to the retrieval savings on the remaining chunks.
Fixed-size parsing: since the parser knows the payload size in advance, it skips RESP delimiter scanning across megabytes of data. Read the size header, assert it matches, read exactly that many bytes into the receive buffer.

At this point our pure-Python client matched memtier_benchmark's C-based throughput and we hadn't left GIL territory yet.

‍

Phase 2: Building the Python-C++ Boundary — 5 to 10 GB/s

vLLM and LMCache are Python applications, so the top-level interface must remain Python. The challenge was building a C++ client core while designing a synchronization boundary that minimizes Python-side scheduling overhead.

Why the GIL matters for I/O-bound LLM workloads

Profiling revealed that even with 8+ threads doing socket I/O, the GIL caused implicit serialization. Two I/O completions arriving simultaneously would contend, and no new jobs could dispatch while one was scheduling. Our pybind11 layer releases the GIL on the first line of every call:

Batched, tiled KV cache operations

KV cache retrieval is all-or-nothing as all chunks for a request must arrive before inference continues. The client exposes only batched operations (BATCH_TILE_GET, BATCH_TILE_SET, BATCH_TILE_EXISTS). Within a batch, work is tiled across threads with each thread receiving a contiguous slice of the key range:

Non-polling completion via Linux eventfd

Getting results back to Python without polling or blocking was the trickiest design problem. When the last tile completes, the C++ worker writes to a Linux eventfd. Python's event loop has a reader registered on that fd, so the kernel wakes Python avoiding waiting. Multiple tile completions coalesce into a single event; Python sees one wakeup per batch regardless of thread count.

Standalone result: 9–10 GB/s on a 128-vCPU bare-metal node with ConnectX-7 NICs — resulting in a 30x improvement over the original redis-py client. With full LMCache integration, end-to-end throughput reached 7 GB/s (the remaining gap is partly GPU ↔ DRAM transfer). On faster GPUs (H100, B200), the NIC demand increases to keep retrieval ahead of prefill — but longer context lengths also increase the compute saved, so the tradeoff remains favorable.

All optimizations are available in LMCache PR #2541.

‍

Phase 3: Cloud Tuning for Production Redis Deployments

Local throughput is meaningless if it collapses over the network. Moving to GCP introduced new bottlenecks:

Same AZ + VPC is non-negotiable — cross-AZ drops bandwidth 5–10x.
TCP stream count: bandwidth converges around 64 parallel streams at ~40 Gbps on GCP VMs. Matching client worker count to this achieves >90% of raw iperf3 throughput.
TCP buffer sizes: GCP defaults (~212 KB) are too small for multi-MB KV cache transfers.

After tuning, self-hosted Redis on GCP (same AZ/VPC) reaches 6 GB/s standalone. Redis Cloud currently reaches ~2.6 GB/s, partly because managed offerings don't yet expose the highest-tier networking options on GCP and AWS. We expect this gap to narrow as we co-tune with the Redis team across both platforms.

‍‍

End-to-End LLM Inference Benchmark Results

This led us to the question, does retrieving KV cache from Redis over the network beat recomputing on the GPU?

We ran a long-document QA workload (benchmark source) across two GCP VMs in the same VPC — an 8× A100 GPU node and a c2d-standard-112 Redis node — with 40 documents at 40,000 input tokens each, 100 output tokens, and a 66.7% cache hit rate:

Cache hit rate? This models a 3-round conversation. In multi-turn settings with N homogeneous rounds, KV cache reuse is (N−1)/N — so 67% is actually conservative for users chatting beyond 4 rounds. Real-world hit rates depend on prefix overlap patterns, but conversational and agentic workloads tend to be favorable for KV cache reuse.

Wondering whether KV cache offloading pays for itself on your workload? Try the Tensormesh Storage ROI Calculator.

‍

Lessons Learned Optimizing Redis for LLM Inference

Isolate phases. The standalone client, the Python-C++ integration, and cloud networking are three separate problems with different bottlenecks; solve them separately.

Define the right metric. Tail latency, not average latency. Throughput in GB/s, not ops/s.

Exploit workload invariants. KV cache chunks have a known, fixed size per batch. This one property unlocked aggressive parsing, eliminated metadata, and simplified the protocol.

The GIL matters for I/O. Profiling reveals it, don't assume otherwise.

Don't assume monotonicity. Throughput vs. chunk size is non-monotonic and machine-specific, it’s important to always sweep on target hardware.

General systems toolkit: batching, zero-copy, non-blocking, non-polling, tiling. We're generalizing these optimizations into a native southbound protocol in LMCache: PR #2642.

‍

What's Next for Tensormesh and LMCache

We’ve published results on multi-node P2P KV cache sharing with Tencent, and are continuing to expand cross-node caching capabilities.

On the infrastructure side, we're expanding cloud tuning to AWS, further optimizing Redis Cloud performance with the Redis team, and generalizing the high-performance client patterns across LMCache's storage backends.

Want to try LMCache with Redis? Check out the LMCache GitHub repo or reach out to the Tensormesh team.

‍

Recent Blog Posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet.

Name

Position

July 1, 2026

Designing AI Infrastructure Products for Developers

Marketing, Enterprise and Partnerships

Read article

June 24, 2026

Persistent KV Cache: Own Your Context Caching Lifecycle

Sandro Mazziotta

Head of Product Management

Marketing, Enterprise and Partnerships

Read article

June 17, 2026

Fighting the Amnesia Tax: The Hidden Cost of Open-Weight LLM Serving

Sandro Mazziotta

Head of Product Management

Marketing, Enterprise and Partnerships

Read article

June 10, 2026

Run Open-Weight LLMs in Claude Code via Tensormesh Serverless Inference

Sandro Mazziotta

Head of Product Management

Marketing, Enterprise and Partnerships

Read article

June 2, 2026

Run Open-Weight LLMs in Your AI Agent with Codex CLI & Tensormesh Serverless Inference

Sandro Mazziotta

Head of Product Management

Marketing, Enterprise and Partnerships

Read article

May 28, 2026

Fixing AI's Most Expensive Problem — Junchen Jiang, Tensormesh CEO

CEO, Co-Founder

Read article

May 27, 2026

Tensormesh Raises $20M from Investors Including AMD Ventures, CoreWeave, NVentures, Launches Tensormesh Inference to Fix AI’s Most Expensive Problem

CEO, Co-Founder

Read article

May 20, 2026

KV Cache isn't just Cache, it's Memory: A Guide for LLM & Agent Devs

Software Engineer

Marketing, Enterprise and Partnerships

Read article

May 13, 2026

The AI Agent Metrics That Actually Matter: Beyond Tokens and Latency

Marketing, Enterprise and Partnerships

Read article

May 6, 2026

Tensormesh Inference: Cheaper LLM Inference for AI Agents

Sandro Mazziotta

Head of Product Management

Marketing, Enterprise and Partnerships

Read article

April 29, 2026

Agentic AI Inference Cost: How LLM Agent Loops Break Caching and Drain Your Budget

Marketing, Enterprise and Partnerships

Read article

April 28, 2026

Inside Tensormesh: Meet our CTO and Chief Scientist

Chief Scientist, Co-Founder

CTO, Co-Founder

Read article

April 22, 2026

Enterprise AI Vendor Lock-In: What It Costs When Your Provider Pulls Access

Marketing, Enterprise and Partnerships

Read article

April 15, 2026

Introducing Tensormesh Beta 2.2: Serverless Inference & $0 Cached Input Tokens

Marketing, Enterprise and Partnerships

Read article

February 25, 2026

Introducing Tensormesh Beta 2: One-Click LLM Deployment, New UI & Real-Time Cost Savings

Marketing, Enterprise and Partnerships

Read article

February 18, 2026

Agent Skills Caching with CacheBlend: Achieving 85% Cache Hit Rates for LLM Agents

Chief Scientist, Co-Founder

Read article

February 11, 2026

Beyond Prefix Caching: How Non-Prefix Caching Achieves 25x Better Hit Rates for AI Agents

Chief Scientist, Co-Founder

Read article

February 4, 2026

The Open Source Revolution: Why Open-Weight AI Models Are Redefining the Future

Marketing, Enterprise and Partnerships

Read article

January 28, 2026

LMCache's Production-Ready P2P Architecture: Powers Tensormesh's 5-10x Cost Reduction

Marketing, Enterprise and Partnerships

Read article

January 21, 2026

The Document Reprocessing Problem: How LLMs Waste 93% of Your GPU Budget

Marketing, Enterprise and Partnerships

Read article

January 15, 2026

Building Tensormesh: A conversation with the CEO (Junchen Jiang)

CEO, Co-Founder

Read article

January 7, 2026

The Hidden Metric That's Destroying Your AI Agent's Performance & Budget

Marketing, Enterprise and Partnerships

Read article

December 17, 2025

LMCache Storage ROI Calculator: When KV Cache Storage Reduces AI Inference Costs

Sandro Mazziotta

Head of Product Management

Read article

December 10, 2025

AI Inference Costs in 2025: The $255B Market's Energy Crisis and Path to Sustainable Scaling

Marketing, Enterprise and Partnerships

Read article

December 3, 2025

New Hugging Face Integration: Access 300,000+ AI Models with Real-Time Performance Monitoring

Marketing, Enterprise and Partnerships

Read article

November 26, 2025

The AI Inference Throughput Challenge: Scaling LLM Applications Efficiently

Marketing, Enterprise and Partnerships

Read article

November 19, 2025

Solving AI Inference Latency: How Slow Response Times Cost You Millions in Revenue

Marketing, Enterprise and Partnerships

Read article

November 13, 2025

GPU Cost Crisis: How Model Memory Caching Cuts AI Inference Costs Up to 10×

Marketing, Enterprise and Partnerships

Read article

October 23, 2025

Tensormesh Emerges From Stealth to Slash AI Inference Costs and Latency by up to 10x

CEO, Co-Founder

Read article

October 21, 2025

Comparing LLM Serving Stacks: Introduction to Tensormesh Benchmark

Software Engineer

Read article