Fighting the Amnesia Tax: The Hidden Cost of Open-Weight LLM Serving

Tensormesh is an AI inference optimization company that never charges you twice for cached tokens, making AI applications faster and dramatically cheaper to run anywhere.

A recent piece making the rounds, Sumit Pandey on Claude Opus 4.8 versus DeepSeek V4-Pro, landed on a conclusion most of the industry has been circling for six months. The open-weight tier has caught up. DeepSeek V4-Pro posts 80.6% on SWE-bench Verified, ties Gemini 3.1 Pro, and ships under an MIT license. Kimi K2.6 clears 80%. Qwen3.6 Plus is at 78.8%. The decision against the closed frontier is no longer about whether open weights are good enough. For most workloads, they are.

The decision is about whether you can actually capture the savings.

The headline cost gap Pandey draws, DeepSeek V4-Pro at $0.435 / $0.87 per million tokens versus Claude Opus 4.8 at $5 / $25, is real on the API price card. It is not what you will pay in production. The gap between published per-token cost and effective per-request cost is where most open-source AI economics quietly fall apart. Closing it is the entire reason we built Tensormesh.

What the per-token price hides

Every production AI workload has a structural problem the leaderboards never measure. The same context shows up on every request.

A coding agent re-sends the system prompt, the tool definitions, and a chunk of the relevant codebase on each turn. A RAG application re-sends the retrieved documents, often near-identical to the last query, every time the user asks for a follow-up. A document-processing pipeline re-sends the same 200-page contract for every clause it analyzes. A multi-step agent re-feeds its own conversation history on every step of the loop.

Each of those requests, the model processes the same tokens it processed thirty seconds ago. It does the same attention computation. It loads the same intermediate activations. It fills the same KV cache from scratch. On a busy multi-tenant system, that cache is under constant eviction pressure. New user contexts displace older ones, GPU KV memory fills, replicas cycle, so by the time a related request arrives, the entry it would have reused is usually gone.

We call this the Amnesia Tax. It is the single largest line item in most production LLM deployments, and it is invisible in the per-token price.

For high-context workloads (agents, RAG, long-running pipelines) the Amnesia Tax routinely accounts for 60 to 90 percent of compute. The model is doing the same work, on the same tokens, repeatedly. Even if every token costs $0.435 per million, the bill compounds when you are paying to process those tokens forty times in an hour.

DeepSeek themselves price cached input at $0.0036 per million tokens, about 120x cheaper than their own cache-miss rate. That number tells you everything. The vendor knows the marginal cost of reusing a token is near zero. The question is whether you can reuse them in your own infrastructure, on the model you actually want to run.

Why open-source serving makes this worse, not better

The closed APIs have started solving this for you. Anthropic's prompt caching, OpenAI's automatic cache, DeepSeek's cached-input pricing are all attempts to give back some of the Amnesia Tax to customers who would otherwise notice. They work because the vendor controls the entire stack.

When you run open weights yourself, you inherit the problem and lose the solution. You spin up vLLM on a couple of H100s, you point your agent at it, and you watch your GPU utilization graph. What you see is not encouraging. The cards are busy. They are also doing the same work, over and over, for every active session.

This is the gap that quietly kills open-source serving projects. The team migrates from Claude to DeepSeek to capture the 28x output-token savings. Six weeks later, they realize their effective cost per agent-task only fell by 3x, because their workload is context-heavy and their infrastructure has no memory between requests. The price comparison that drove the migration was real on paper and partially fictional in production.

You can fix this, but fixing it is not trivial. Distributed KV-cache sharing across vLLM nodes, eviction policies that survive multi-tenancy, persistence across container restarts, fast cache transfer over the network are all infrastructure problems that have nothing to do with the model. They are the reason a working open-source serving stack takes a team of inference engineers six months to build, and another six to make production-grade.

How Tensormesh removes the Amnesia Tax

Tensormesh Inference is open-weight serving with the Amnesia Tax removed.

You point your OpenAI-compatible client at our endpoint, pick a model from the supported lineup, and ship. The first time a chunk of context shows up (your system prompt, your tool definitions, your retrieved document, your conversation history) we process it and cache the KV state. Every subsequent request that includes that same context reuses the cache. The repeated tokens become free.

Not cheaper. Free. Cached input tokens are $0 on Tensormesh.

Open-weight models you can run today

Serverless, pay-per-token, no setup. Every model below ships with $0 cached-input pricing and works behind an OpenAI-compatible endpoint, including drop-in compatibility with Codex CLI and Claude Code, which is the practical answer to Pandey's "GPT-5.5 still owns the terminal" point. You can run an open-weight coding agent through the same harness, with the cache savings on top.A few notes on the lineup that matter for the open-vs-closed framing.

โ€

DeepSeek-V4-Flash is the smaller, faster sibling to the V4-Pro that drove Pandey's headline numbers. For most agent workloads, where you want the V4 architecture's reasoning behavior without paying for V4-Pro's full compute envelope, Flash is the practical pick, and at $0.14 / $0.28 it undercuts both V4-Pro's published rate and every closed-frontier API by an order of magnitude.

Qwen3-Coder-30B-A3B-Instruct is the open-weight answer to the "GPT-5.5 owns terminal-bench" argument. Pair it with Codex CLI through our endpoint, cache the system prompt and tool definitions, and you get a coding agent that runs at $0.15 input / $0.60 output with cached context free. The reliability ceiling is lower than GPT-5.5. The cost ceiling is roughly 30x lower.

OpenAI's own gpt-oss release matters here too. gpt-oss-120b at $0.15 / $0.60 is the strongest signal yet that the closed labs are conceding the open-weight tier on cost. If you wanted the same family of training behavior at a fraction of the price, this is now a serverless option.

The engine underneath is LMCache, the open-source KV-cache layer we built and open-sourced. It handles distributed cache sharing across serving nodes, persistence across model instances, intelligent eviction, and fast cache transfer over GDS. Customers integrating LMCache with NVIDIA GPUDirect Storage have reported up to 41x reductions in time-to-first-token on long-context workloads. The same technology powers our hosted serverless platform and, soon, the on-premise Tensormesh Operator for teams that need to run models inside their own perimeter.

Here is what this means for the kind of workload Pandey was writing about.

If you are running a coding agent against an open model, the system prompt and tool definitions hit cache after the first request. Each subsequent turn pays only for the new user input and the new generated output. Effective input cost drops by roughly 60 to 85% on typical agent traces. Our own benchmark reaches 85% cache hit rate on agent skill workloads, and independent evaluations of long-horizon agentic tasks land in the 41 to 80% range across configurations. If you are running RAG on a corpus that gets queried repeatedly, the document chunks hit cache the second time they appear in a prompt. Effective input cost on hot documents trends toward zero.

If you are running long-horizon agents, the exact use case where Pandey argues Opus 4.8's reliability premium is worth paying, you no longer have to choose. You can run an open model with appropriate scaffolding, capture the headline cost savings, and also capture the structural savings from not re-processing your agent's growing context window on every step.

Open weights vs closed frontier

The decision framework Pandey lands on, open weights for cost-sensitive work, closed frontier for long-horizon agents, is correct under one assumption. That serving open weights costs what the published per-token price implies. Once you account for the Amnesia Tax, the math shifts.

A representative agent workload we benchmark internally (coding agent, ~8K-token system context, 20-turn average trace, mid-size open model) looks like this on a naive vLLM deployment versus Tensormesh.

That last row is the one closed-frontier vendors actually earn their premium on, and we are not going to pretend caching solves it. If you need Opus 4.8's reliability behavior in a multi-hour autonomous loop, you pay for Opus 4.8. The case Pandey makes there is correct.

But for the 80% of workloads that are not multi-hour autonomous loops, the agents that need to be reliable for a turn, not for a day; the RAG pipelines; the document processors; the high-volume batch generation, the open-weight tier wins on capability and on cost, if your serving layer captures the savings the per-token price advertises.

The decision, restated

The frontier did not pull away in 2026, it got narrow and specialized. Open weights win for most production workloads. The remaining question is not which model, that has been answered for the majority of use cases. It is whether your serving infrastructure converts the published price advantage into a real one.

Building that infrastructure in-house is six engineers and twelve months. Running on a stack that already delivers this is an API endpoint change.

If you are evaluating open-source serving against a closed-API baseline and the numbers do not quite work, the gap is almost certainly your cache hit rate. Run your workload on Tensormesh for a week. Measure tokens per task and dollars per task, not per-token price. The math gets a lot more honest.

Try Tensormesh and claim your free credits.

Recent Blog Posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet.

Name

Position
June 10, 2026

Run Open-Weight LLMs in Claude Code via Tensormesh Serverless Inference

Read article

June 2, 2026

Run Open-Weight LLMs in Your AI Agent with Codex CLI & Tensormesh Serverless Inference

Read article

May 28, 2026

Fixing AI's Most Expensive Problem โ€” Junchen Jiang, Tensormesh CEO

Read article

May 27, 2026

Tensormesh Raises $20M from Investors Including AMD Ventures, CoreWeave, NVentures, Launches Tensormesh Inference to Fix AIโ€™s Most Expensive Problem

Read article

May 20, 2026

KV Cache isn't just Cache, it's Memory: A Guide for LLM & Agent Devs

Read article

May 13, 2026

The AI Agent Metrics That Actually Matter: Beyond Tokens and Latency

Read article

May 6, 2026

Tensormesh Inference: Cheaper LLM Inference for AI Agents

Read article

April 29, 2026

Agentic AI Inference Cost: How LLM Agent Loops Break Caching and Drain Your Budget

Read article

April 28, 2026

Inside Tensormesh: Meet our CTO and Chief Scientist

Read article

April 22, 2026

Enterprise AI Vendor Lock-In: What It Costs When Your Provider Pulls Access

Read article

April 15, 2026

Introducing Tensormesh Beta 2.2: Serverless Inference & $0 Cached Input Tokens

Read article

April 8, 2026

How We Optimized Redis for LLM KV Cache: 0.3 GB/s to 10 GB/s

Read article

February 25, 2026

Introducing Tensormesh Beta 2: One-Click LLM Deployment, New UI & Real-Time Cost Savings

Read article

February 18, 2026

Agent Skills Caching with CacheBlend: Achieving 85% Cache Hit Rates for LLM Agents

Read article

February 11, 2026

Beyond Prefix Caching: How Non-Prefix Caching Achieves 25x Better Hit Rates for AI Agents

Read article

February 4, 2026

The Open Source Revolution: Why Open-Weight AI Models Are Redefining the Future

Read article

January 28, 2026

LMCache's Production-Ready P2P Architecture: Powers Tensormesh's 5-10x Cost Reduction

Read article

January 21, 2026

The Document Reprocessing Problem: How LLMs Waste 93% of Your GPU Budget

Read article

January 15, 2026

Building Tensormesh: A conversation with the CEO (Junchen Jiang)

Read article

January 7, 2026

The Hidden Metric That's Destroying Your AI Agent's Performance & Budget

Read article

December 17, 2025

LMCache Storage ROI Calculator: When KV Cache Storage Reduces AI Inference Costs

Read article

December 10, 2025

AI Inference Costs in 2025: The $255B Market's Energy Crisis and Path to Sustainable Scaling

Read article

December 3, 2025

New Hugging Face Integration: Access 300,000+ AI Models with Real-Time Performance Monitoring

Read article

November 26, 2025

The AI Inference Throughput Challenge: Scaling LLM Applications Efficiently

Read article

November 19, 2025

Solving AI Inference Latency: How Slow Response Times Cost You Millions in Revenue

Read article

November 13, 2025

GPU Cost Crisis: How Model Memory Caching Cuts AI Inference Costs Up to 10ร—

Read article

October 23, 2025

Tensormesh Emerges From Stealth to Slash AI Inference Costs and Latency by up to 10x

Read article

October 21, 2025

Comparing LLM Serving Stacks: Introduction to Tensormesh Benchmark

Read article