Watch the full video here: https://www.youtube.com/watch?v=zHW4Zzd7pjI
“We are building the best infrastructure for the future Big Data of AI... We believe KV cache is the new breed of Big Data for the next era“
1. What’s something the public still overlooks about the LLM inference infrastructure challenge?
There's a general misconception when it comes to AI cost and AI performance that it seems like all that matters is just bigger GPUs and bigger models, but people often overlook the power of storage and memory.
In particular, in the AI era, it is the AI model's internal memory that really matters. Now the technical term for AI models' internal memory is KV cache.
Now the immediate benefit of having KV cache around is that it allows the AI systems to trade storage for faster and cheaper inference by skipping a massive amount of GPU Compute. This is what our company Tensormesh and our open source project LMCache are about. We're here to build the foundational infrastructure that manages, distributes and stores and optimizes KV caches for AI models at scale.
We're doing this because we believe KV cache is the new breed of Big Data for the next era. Just like 2010s were the decade of the traditional big data for businesses and basically human readable data, the next decade will be dominated by this new breed of big data called "KV cache" for AI models and Tensormesh is in the best position to realize that vision.
2. When did it click, how did you first realize that KV caching was a foundational shift in how LLM inference should work?
For more than a decade, our research team had been working on how to build large-scale network systems for AI data and multimedia data. So when chatGPT became popular, we immediately realized that AI systems should be built around KV-cache, this new kind of AI native data. This will be key to making LLM inference faster, cheaper, and smarter. Now this prediction was based on two unique properties of KV-cache.
First, KV-caches are reusable, directly reusable by any Transformer-based large-language models. This allows them to skip a massive amount of GPU cycles, GPU compute and you cannot get this benefit by caching any other kind of data, because KV-cache is the AI model's internal memory. Now, more interesting than saving GPU cycles, KV-cache also embeds model's internal understanding of new knowledge.
This makes KV-cache a perfect source to see how models understand the relationship between the tokens in its input and output and it even makes it possible to shape its output.
Now, we came to this realization over time, rather gradually. But because we started doing this much earlier than everyone else, we have better technologies in our reserve. For example, we open-sourced LMCache, this open-source project that allows a lot of people to store and manage KV-cache. We open-sourced it about a 1.5 years ago, at the beginning, the reaction was lukewarm because people didn't see the point. But over the last few months, the adoption exploded as people started to see what we have been seeing all along.
3. Tensormesh sits at the intersection of academia, open source, and industry. What’s unique about building in that space and why do you consider it a winning combination?
Building a successful open source company is actually more challenging than people expect. The key is to have both a deep and long-term technical vision and a talented team to believe in that vision and build a product to realize that vision. You need to have a long-term deep technical vision because you need to have enough headroom for both the open source project and the product to succeed.
We are very fortunate to have both In Tensormesh, which is a company in the intersection between open source, academia, and industry. We have Grand Vision, which is building the best infrastructure for future Big Data for AI based on the AI model's native memory called KV-Cache.
More importantly, we also have a team of brilliant engineers and researchers who believe in this vision. Before starting a company, we already had 10 years of research in AI systems and almost 3 years in KV-Cache systems. This unique combination puts us in a strong position to win.
4. Many companies that are scaling AI are struggling with huge increases in inference costs, sometimes reaching millions per year. How does LMCache and Tensormesh change that equation, and what economic relief it brings to the industry?
Most of the enterprise use cases of LLMs these days are about processing long-context data, such as a long document, a long code repository, or a long video. This happens across many industries, for example, financial industry, legal, retail, and of course tech industries.
But if you break down the GPU cost of these long-context queries, most of the GPU cycles, more than 90% of the GPU cycles, will be spent on processing this long-context. For example, imagine you're a financial analyst, you need to ask multiple questions about a long financial document. Now if you look at the GPU cost of the LLM, 90% of them, or probably more than 90% of the GPU cost, will be about just processing this long document.
Just a fraction of the GPU cost will be for answering your question. Now of course, it will be wasteful to let the model forget what it has learned from this long document, and have to reprocess the long document every time. But that's what a lot of inference providers have to do these days.
Now if you store the KV-Cache of this long document and reuse it every time, you can skip the processing of long documents altogether, and save 90% of the GPU cost right there. Now this cost-saving is not unique to financial documents, of course. It can also happen to learn coded repositories, learn chat histories, of course, learn video inputs.
5. What did LMCache joining Pytorch Foundation represent?
We specialize at building a specific component in the whole AI Infrastructure. That means our system has to be integrated with various components, such as the inference engines, various storage vendors, and of course, cloud providers.
Now, there are two ways to fit in the ecosystem, one is we build a closed-source system that integrates with everything by ourselves, of course, that'd be very costly for us. I
nstead, we open-sourced the LMCache library because we believe in the power of a vibrant open-source community. Today, we have over 100 industry contributors for this open-source project. This ensures that our system always has a smooth and first-class integration with various vendors, and it also allows us to focus on what we're best at, which is how to store, distribute, and optimize KV Cache smartly.
That's why LMCache joining PyTorch Foundation makes perfect sense for us, we are actually among a number of companies that donated their projects to the open-source community, and we believe that this will be a big part of the AI of tomorrow.



