The benchmarking tool is part of our command-line utility, tmesh.
Right now, it includes a single entry point—tmesh-cli benchmark—but it will soon expand into a full CLI and SDK toolkit for using and managing Tensormesh clusters.
To install it, simply:
# pip install tmesh
The tool is invoked using the tmesh-cli command as follow:
tmesh-cli benchmark \
--endpoint "<YOUR_OPEN_AI_API_ENDPOINT>" \
--api-key "<OPTIONAL_API_KEY>"
Example: # tmesh-cli benchmark --endpoint "http://192.168.111.29:30080/" --api-key "vllm_sk_555a1b7ff3e0f617b1240300000000018075f66c9"
The model at the endpoint will automatically be discovered and the benchmark will run until you stop it.
The first message you will receive is the result of the discovery:
endpoint: http://192.168.111.29:30080/v1/chat/completions
api_key: vllm_sk_33fb64fbde9c281dc3d5a0000088403d942c7fc|
normalized endpoint: http://192.168.111.29:30080/v1/
found model: openai/gpt-oss-20b
offload_size: 100
Followed by an indication that you will need to stop the process (using ctrl-c, for example):
NOTE: tmesh-cli benchmark will run forever until you interrupt the process.
And then a definition of the configuration for the synthetic workload that will be generated:
Workload Specifications
Model: openai/gpt-oss-20b
Number of Contexts: 61
Number of Questions per Context: 61
Max Inflight Requests (Load-Balancing): 20
Input Length: 32000
Output Length: 100
Which in this case is going to continually send 61 long contexts (Number of Contexts) that will have 61 randomly generated questions (Number of Questions per Context) appended to them over time.
The workload is designed to stress-test the KV cache offloading buffer by:
We have hardcoded the following configurations: the token context length (Input Length) as well as the token output length (Output Length).

The tool sends requests continuously using a tiling pattern:
The tool is going to display the following every 5 seconds until you stop it:
Elapsed Time: 5.007764101028442
Total Number of Requests Processed: 24
QPS: 4.792558019071052
Global Average TTFT: 1.9451230665047963
Global Average ITL: 0.0025131286119239667
Global Average Prefill Throughput: 46750.16469702823
Global Average Decode Throughput: 2702.9887946346635
Requests Processed in Last 5 second Interval: 24
Interval Average TTFT: 1.9451230665047963
Interval Average ITL: 0.0025131286119239667
Interval Average Prefill Throughput: 46750.16469702823
Interval Average Decode Throughput: 2702.9887946346635
Elapsed Time: 10.008518934249878
Total Number of Requests Processed: 74
QPS: 7.393701354429838
Global Average TTFT: 1.1941296603228595
Global Average ITL: 0.0034814370141559065
Global Average Prefill Throughput: 81783.8627181991
Global Average Decode Throughput: 1513.3255635090504
Requests Processed in Last 5 second Interval: 50
Interval Average TTFT: 0.8336528253555298
Interval Average ITL: 0.003946225047227238
Interval Average Prefill Throughput: 98600.03776836112
Interval Average Decode Throughput: 942.2872125687559
Where:
If you want to compare two deployments, ideally you would be running the benchmark function on two identical models deployed on the same number and type of GPU (ideally on the same GPU provider). Let the benchmark run for the same amount of time (at least a few minutes) on each instance. The two most relevant numbers are
The full documentation for the tool can be found at docs.tmesh.ai. Let us know what you think about this tool and how we could improve it!