Gemma 4 mtp, vllm
Dan Billings — 2026-05-25
Yesterday's Gemma 4 mtp post ended with a list of three things llama.cpp needed to learn before Gemma 4's "assistant" drafter could load: register the gemma4_assistant architecture, honor requires_target_arch, or teach the existing loader to source embeddings from the target. None of those landed overnight.
But Google's announcement was explicit that vLLM was already on the supported list. So today, vLLM. The short version: it works, it gives 1.13× on this 4090, and the more interesting story is why the speedup is so modest compared to last week's Qwen 3.6 result (1.85×).
The 4090-shaped quantization problem
A 26B-total Gemma 4 26B-A4B-it at bf16 is ~52 GB. Even with vLLM's MoE-aware loader you still need the full parameter count in HBM because MoE routing is per-token; the "A4B" (4B active) only saves compute, not memory. So the first decision is which quantization fits a 24 GB card with the drafter alongside.
| Repo | Format | On-disk | Fits 24 GB? |
|---|---|---|---|
google/gemma-4-26B-A4B-it |
bf16 | ~52 GB | No |
RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic |
FP8 | ~26 GB | No (just barely, and Red Hat's card targets B200) |
RedHatAI/gemma-4-26B-A4B-it-NVFP4 |
NVFP4 | ~13 GB | Fits, but NVFP4 needs Blackwell for native acceleration; on Ada it falls back to FP8 emulation |
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit |
AWQ-int4 | ~16 GB | Yes |
The community AWQ build is the obvious choice. vLLM auto-detects AWQ from the repo config; no --quantization flag needed.
The drafter, google/gemma-4-26B-A4B-it-assistant, is 0.4B params and fits trivially at bf16 (~0.8 GB). No quantized drafter is needed.
The working recipe
vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \
--host 0.0.0.0 --port 8000 \
--served-model-name cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \
--max-num-batched-tokens 4096 \
--max-model-len 8192 \
--gpu-memory-utilization 0.93 \
--enable-prefix-caching --enable-chunked-prefill \
--dtype auto \
--speculative-config '{"method":"draft_model","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":2}'
Four flags are non-default and each one fixes a specific startup failure mode I hit in the order they appear:
1. --max-num-batched-tokens 4096
Gemma 4 is multimodal — vLLM logs Resolved architecture: Gemma4ForConditionalGeneration. The encoder side uses bidirectional attention, which makes vLLM auto-disable chunked MM input. With chunking off, max_num_batched_tokens must exceed max_tokens_per_mm_item (=2496 for Gemma 4). The default 2048 fails before the engine even initializes: ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048).
2. --max-model-len 8192
With target + drafter loaded, vLLM reports 17.41 GiB of weights. At --gpu-memory-utilization 0.92 (~22 GB allocatable) that leaves ~4.5 GiB for KV cache + cudagraph + working memory. At 32k context the KV demand overflows that budget: ValueError: No available memory for the cache blocks. Cutting context to 8k brings KV demand down ~4× and leaves clear headroom.
3. --gpu-memory-utilization 0.93
You'd think the answer to the previous problem is "bump utilization to 0.95". It isn't. On this WSL2 host, the NVIDIA driver reports ~22.45 GiB free at startup (it holds ~1.5 GiB itself). Requesting 0.95 × 24 = 22.79 GiB fails at allocation: Free memory on device cuda:0 (22.45/23.99 GiB) on startup is less than desired GPU memory utilization (0.95, 22.79 GiB). 0.93 wants 22.31 GiB, fits, leaves ~4.9 GiB for everything else.
4. --speculative-config '{"method":"draft_model",...}'
This is the actual MTP knob. vLLM accepts the classic two-model draft_model method here even though Gemma 4's drafter is technically an MTP head — vLLM's loader sniffs the drafter's architecture and auto-upgrades the method to mtp at engine init. The journal shows speculative_config=SpeculativeConfig(method='mtp', ...) even though the JSON said draft_model. The auto-upgrade is what does the embedding-sharing trick documented in the next section.
What vLLM does that llama.cpp couldn't
The whole point of yesterday's post was that llama.cpp's loader has no idea what to do with Gemma 4's drafter. It can't load gemma4_assistant, and a retagged-as-gemma4 drafter has two tensors missing — the input embeddings, which Google's MTP design shares with the target. The post listed three upstream changes that would fix it. None have happened.
vLLM, on the same drafter file, logs this on startup:
Detected MTP model. Sharing target model embedding weights with the draft model.
Gemma4 MTP: keeping draft model's own lm_head (draft_dim != backbone_dim).
Gemma4 MTP: draft layer 0 (sliding_attention) -> language_model.model.layers.28.self_attn.attn
Gemma4 MTP: draft layer 1 (sliding_attention) -> language_model.model.layers.28.self_attn.attn
Gemma4 MTP: draft layer 2 (sliding_attention) -> language_model.model.layers.28.self_attn.attn
Gemma4 MTP: draft layer 3 (full_attention) -> language_model.model.layers.29.self_attn.attn
Model loading took 17.41 GiB memory and 34.19 seconds
Each line answers one of the llama.cpp post's open questions:
- "Register
gemma4_assistantas a known model architecture." →Detected MTP model.vLLM has a dedicatedGemma4MTPModelloader. - "Honor the
requires_target_archmetadata key." →Sharing target model embedding weights with the draft model.vLLM uses the target's embedding table; the drafter's missing tensors are expected, not an error. - "Teach the loader to accept a tensor-incomplete file." → The four
draft layer N -> language_model.model.layers.28/29.self_attn.attnlines: the drafter is wired into specific layers of the target's attention stack at runtime. No standalone draft model is being constructed at all.
The whole thing finishes loading in 34 seconds. The same drafter file that llama.cpp considers malformed loads, attaches to the target, and serves traffic.
The num_speculative_tokens sweep
Methodology: 5 iterations × 200 tokens, fixed prompt, temperature=0, seed=42, prefix caching on. Wall-clock time divided into usage.completion_tokens for tok/s. Acceptance from the delta in vllm:spec_decode_num_accepted_tokens_total / vllm:spec_decode_num_draft_tokens_total across the 5 iterations. The harness is scripts/sweep-vllm-mtp.sh in this repo; it hot-swaps the systemd unit through baseline / n=1 / n=2 / n=3 and aggregates the CSV.
| Variant | tok/s | Acceptance | Speedup |
|---|---|---|---|
| baseline | 165.28 | — | 1.00× |
| n=1 | 164.09 | 69.7% | 0.99× |
| n=2 | 185.95 | 61.3% | 1.13× |
| n=3 | 184.40 | 50.0% | 1.12× |
num_speculative_tokens rises.Three observations:
- n=1 is slower than baseline (0.99×). Even with 69.7% acceptance, the overhead of running the drafter + verifying one staked token doesn't pay back when you only get one extra token per round.
- n=2 is the 4090 sweet spot (1.13×, 185.95 tok/s). Best balance of draft count and per-step overhead.
- n=3 (1.12×) is barely behind n=2 and acceptance has already dropped to 50%. vLLM warned at startup:
Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer, which may result in lower acceptance rate.The warning is real; n=3 is right at the inflection.
Why is the speedup so modest?
The Qwen 3.6 post on this same 4090 got 1.85× with the unsloth MTP build at n=3. Gemma 4 here gets 1.13×. Both are state-of-the-art MTP. What's different?
Speculative decoding wins by amortizing the memory-bandwidth cost of one HBM→SM weight load across multiple verified output tokens. The bigger the per-token weight load, the bigger the prize.
- Qwen 3.6 dense (27B): every output token loads all 27B params from HBM. At ~16 GB at Q4_K_XL, that's the dominant cost. Spec decoding amortizes it across N tokens — big win.
- Gemma 4 26B-A4B MoE: each token routes through only ~4B active params plus the shared backbone. The active footprint at 4-bit is more like 3.5–5 GB per token. The baseline is already operating at a much lower memory-bandwidth ceiling. Spec decoding still helps, but the prize is smaller.
The math, roughly: at 165 tok/s on AWQ-int4 we're moving ~3.5 GB × 165 ≈ 580 GB/s of weights over the bus. The 4090 has ~1 TB/s. We're at 60% of theoretical peak without any speculation at all. There's just not as much room for spec decoding to recover.
Next: a dense Gemma 4
If the MoE-blunts-MTP hypothesis is right, a dense Gemma 4 of similar parameter count should show speedup closer to Qwen 3.6's 1.85×. That's the next post. Candidates to verify on disk and inside the 24 GB envelope:
- A dense 31B-class Gemma 4 with the assistant drafter, AWQ-quantized.
- The smaller
gemma-4-E4B-it("effective 4B" dense) — fits more easily but the smaller model size means less wall-time-per-token, so the absolute speedup may be small.
If the dense post comes back with ~1.8× speedup, the hypothesis holds: MoE amortizes the same memory bus that MTP wants to amortize, and the two effects don't stack. If it comes back close to 1.13×, then something else is going on — most likely vLLM's MTP path on Ada is the bottleneck rather than memory bandwidth, and the next investigation is into kernel-side overhead in Gemma4MTPModel.
Reproducibility
- Harness:
Rtx4090GemmaVllmBench.scala— support object plus four sweep IOApps (GemmaVllmBaseline,GemmaVllmDraftModel1,GemmaVllmDraftModel2,GemmaVllmDraftModel3) plusGemmaVllmOffto restore the box to its llama.cpp state. - Bench scripts:
scripts/bench-vllm.sh,scripts/sweep-vllm-mtp.sh. - Chart rendering: a small Free-monad SVG DSL at
ansible.chartsplusMtpSweepChart.scala. The chart above is the output ofsbt --client "runMain ansible.examples.MtpSweepChart"spliced into this post's source. - One command, on a 4090 with Gemma terms accepted and
huggingface-cli loginalready run:./scripts/sweep-vllm-mtp.shproduces the CSV behind the table above. ~30 minutes end-to-end on a cold cache.
Previous: Gemma 4 mtp (llama.cpp can't load the drafter). Next post: dense Gemma 4 — does removing the MoE recover Qwen 3.6's 1.85×?