Comparing LLM Serving Stacks: Introduction to Tensormesh Benchmark

How do I try the tool?

The benchmarking tool is part of our command-line utility, tmesh.
Right now, it includes a single entry point—tmesh-cli benchmark—but it will soon expand into a full CLI and SDK toolkit for using and managing Tensormesh clusters.

To install it, simply: 

# pip install tmesh

Using the benchmarking tool

The tool is invoked using the tmesh-cli command as follow:

tmesh-cli benchmark \
--endpoint "<YOUR_OPEN_AI_API_ENDPOINT>" \
--api-key "<OPTIONAL_API_KEY>"

Example:

# tmesh-cli benchmark --endpoint "http://192.168.111.29:30080/" --api-key "vllm_sk_555a1b7ff3e0f617b1240300000000018075f66c9"

The model at the endpoint will automatically be discovered and the benchmark will run until you stop it.

The first message you will receive is the result of the discovery:

endpoint: http://192.168.111.29:30080/v1/chat/completions
api_key: vllm_sk_33fb64fbde9c281dc3d5a0000088403d942c7fc|
normalized endpoint: http://192.168.111.29:30080/v1/
found model: openai/gpt-oss-20b
offload_size: 100

Followed by an indication that you will need to stop the process (using ctrl-c, for example):

NOTE: tmesh-cli benchmark will run forever until you interrupt the process.

And then a definition of the configuration for the synthetic workload that will be generated:

Workload Specifications
Model: openai/gpt-oss-20b
Number of Contexts: 61
Number of Questions per Context: 61
Max Inflight Requests (Load-Balancing): 20
Input Length: 32000
Output Length: 100

Which in this case is going to continually send 61 long contexts (Number of Contexts) that will have 61 randomly generated questions (Number of Questions per Context) appended to them over time.

The workload is designed to stress-test the KV cache offloading buffer by:

  • Creating multiple long contexts (32k tokens each)
  • Reusing contexts with different questions
  • Managing concurrent requests with load balancing

We have hardcoded the following configurations: the token context length (Input Length) as well as the token output length (Output Length).

Interpreting the results

The tool sends requests continuously using a tiling pattern:

  • Cycles through all contexts sequentially
  • Appends random questions to each context
  • Maintains max inflight requests for load balancing
  • Maximizes cache evictions to test offloading

The tool is going to display the following every 5 seconds until you stop it:

Elapsed Time: 5.007764101028442
Total Number of Requests Processed: 24
QPS: 4.792558019071052
Global Average TTFT: 1.9451230665047963
Global Average ITL: 0.0025131286119239667
Global Average Prefill Throughput: 46750.16469702823
Global Average Decode Throughput: 2702.9887946346635
Requests Processed in Last 5 second Interval: 24
Interval Average TTFT: 1.9451230665047963
Interval Average ITL: 0.0025131286119239667
Interval Average Prefill Throughput: 46750.16469702823
Interval Average Decode Throughput: 2702.9887946346635

Elapsed Time: 10.008518934249878
Total Number of Requests Processed: 74
QPS: 7.393701354429838
Global Average TTFT: 1.1941296603228595
Global Average ITL: 0.0034814370141559065
Global Average Prefill Throughput: 81783.8627181991
Global Average Decode Throughput: 1513.3255635090504
Requests Processed in Last 5 second Interval: 50
Interval Average TTFT: 0.8336528253555298
Interval Average ITL: 0.003946225047227238
Interval Average Prefill Throughput: 98600.03776836112
Interval Average Decode Throughput: 942.2872125687559

Where:

  • Elapsed Time Is the number of seconds since the tool was started
  • Total Number of Requests Processed  Is the number of requests processed by the  LLM
  • QPS It the number of Query Per Second provided so far, which should progressively grow over time if the caching is effective.  It is equal to Total Number of Requests Processed / Elapsed Time
  • Global Average TTFT is the average time it took (in second) to receive the first token.  This number should decrease over time if the caching is properly configured.
  • Global Average ITL Is the average Inter-Token Latency, aka the exact pause between two consecutive tokens.  It should stabilize after about 20 seconds
  • Global Average Prefill Throughput is the average number of bytes processed in prefill
  • Global Average Decode Throughput is the average number of bytes process in decode
  • Under Requests Processed in Last 5 second Interval, you will find the same information but reduced specifically to an average over the last 5 seconds.

If you want to compare two deployments, ideally you would be running the benchmark function on two identical models deployed on the same number and type of GPU (ideally on the same GPU provider).  Let the benchmark run for the same amount of time (at least a few minutes) on each instance. The two most relevant numbers are  

  • QPS (higher the better)
  • and Global Average TTFT (lower the better). 

Best Practices

  1. Warm-up period: Ignore first 30-60 seconds of metrics (cold start effects)
  2. Steady state: Look at metrics after several full context rotations
  3. Compare intervals: Watch for performance degradation over time
  4. Resource monitoring: Monitor CPU, memory, GPU usage alongside tmesh metrics
  5. Network stability: Run from stable network connection for accurate latency measurements

The full documentation for the tool can be found at docs.tmesh.ai. Let us know what you think about this tool and how we could improve it!

Recent Blog Posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet.

Name

Position
April 22, 2026

Enterprise AI Vendor Lock-In: What It Costs When Your Provider Pulls Access

Read article

April 15, 2026

Introducing Tensormesh Beta 2.2: Serverless Inference & $0 Cached Input Tokens

Read article

April 8, 2026

How We Optimized Redis for LLM KV Cache: 0.3 GB/s to 10 GB/s

Read article

February 25, 2026

Introducing Tensormesh Beta 2: One-Click LLM Deployment, New UI & Real-Time Cost Savings

Read article

February 18, 2026

Agent Skills Caching with CacheBlend: Achieving 85% Cache Hit Rates for LLM Agents

Read article

February 11, 2026

Beyond Prefix Caching: How Non-Prefix Caching Achieves 25x Better Hit Rates for AI Agents

Read article

February 4, 2026

The Open Source Revolution: Why Open-Weight AI Models Are Redefining the Future

Read article

January 28, 2026

LMCache's Production-Ready P2P Architecture: Powers Tensormesh's 5-10x Cost Reduction

Read article

January 21, 2026

The Document Reprocessing Problem: How LLMs Waste 93% of Your GPU Budget

Read article

January 15, 2026

Building Tensormesh: A conversation with the CEO (Junchen Jiang)

Read article

January 7, 2026

The Hidden Metric That's Destroying Your AI Agent's Performance & Budget

Read article

December 17, 2025

LMCache ROI Calculator: When KV Cache Storage Reduces AI Inference Costs

Read article

December 10, 2025

AI Inference Costs in 2025: The $255B Market's Energy Crisis and Path to Sustainable Scaling

Read article

December 3, 2025

New Hugging Face Integration: Access 300,000+ AI Models with Real-Time Performance Monitoring

Read article

November 26, 2025

The AI Inference Throughput Challenge: Scaling LLM Applications Efficiently

Read article

November 19, 2025

Solving AI Inference Latency: How Slow Response Times Cost You Millions in Revenue

Read article

November 13, 2025

GPU Cost Crisis: How Model Memory Caching Cuts AI Inference Costs Up to 10×

Read article

October 23, 2025

Tensormesh Emerges From Stealth to Slash AI Inference Costs and Latency by up to 10x

Read article