Briefing

Benchmarking MTP on vLLM and llama.cpp for Gemma 4 and Qwen 3.6

ai-dev
by /u/FantasticNature7590 · Llama

Benchmark MTP on vLLM and llama.cpp to find the optimal speculative token count per model and measure speedups.

What to do now

Benchmark your own models with MTP on vLLM and llama.cpp to determine the optimal speculative token count and measure speedups.

Summary

Benchmarking Multi‑Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B revealed that vLLM outperforms llama.cpp on Gemma 4, achieving 132.52 tokens per second with a speculative token count of 5, while llama.cpp peaks at 117.70 tok/s on Qwen 3.6 Q8 with n_max = 3. The tests ran 10 times per session, each generating 1,500 tokens, on an AMD Ryzen 9 9950X with an NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM, 92 GB RAM, CUDA 13.1, Ubuntu 24.04). The benchmark configuration disabled prefix caching, used sequential mode on vLLM, and kept the prompt constant across all runs. The results show a 3.34× speedup for Gemma 4 and a 2.59× speedup for Qwen 3.6 when using MTP, confirming that dense models benefit most from speculative decoding.

Optimal speculative token count varies by engine: for vLLM + Gemma 4 the sweet spot is n = 5, whereas for llama.cpp + Qwen 3.6 it is n = 3; higher counts can actually reduce throughput. The decode phase is memory‑bandwidth bound, so MTP amortizes the cost by verifying multiple draft tokens in a single pass, improving acceptance rates. Because inference speed translates directly to compute cost, a 3× speedup can mean either triple the user capacity or a third of the operating expense. These findings suggest that production deployments should benchmark their own hardware and model combinations to identify the optimal speculative token count and realize cost savings.

Key changes

  • vLLM Gemma 4 achieves 132.52 tok/s with n=5, 3.34× faster than baseline 39.69 tok/s
  • llama.cpp Qwen 3.6 Q8 achieves 117.70 tok/s with n_max=3, outperforming vLLM on Qwen
  • Optimal speculative token count is model‑engine specific: vLLM best at n=5, llama.cpp best at n=3
  • Dense models gain most: Gemma 4 3.34× speedup, Qwen 3.6 2.59× speedup
  • Decode phase is memory‑bandwidth bound; MTP amortizes cost by verifying multiple tokens per pass
  • Inference speed directly reduces compute cost or increases user capacity by up to 3×

Affects

internal

Customer impact

Analyzing matches…

Ask about this story

Impact on an agency? Which customers? Compare historically Risks of waiting