2026-05-25 — Dan Billings

Gemma 4 mtp, vllm

Dan Billings — 2026-05-25

Yesterday's Gemma 4 mtp post ended with a list of three things llama.cpp needed to learn before Gemma 4's "assistant" drafter could load: register the gemma4_assistant architecture, honor requires_target_arch, or teach the existing loader to source embeddings from the target. None of those landed overnight.

But Google's announcement was explicit that vLLM was already on the supported list. So today, vLLM. The short version: it works, it gives 1.13× on this 4090, and the more interesting story is why the speedup is so modest compared to last week's Qwen 3.6 result (1.85×).

The 4090-shaped quantization problem

A 26B-total Gemma 4 26B-A4B-it at bf16 is ~52 GB. Even with vLLM's MoE-aware loader you still need the full parameter count in HBM because MoE routing is per-token; the "A4B" (4B active) only saves compute, not memory. So the first decision is which quantization fits a 24 GB card with the drafter alongside.

Repo Format On-disk Fits 24 GB?
google/gemma-4-26B-A4B-it bf16 ~52 GB No
RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic FP8 ~26 GB No (just barely, and Red Hat's card targets B200)
RedHatAI/gemma-4-26B-A4B-it-NVFP4 NVFP4 ~13 GB Fits, but NVFP4 needs Blackwell for native acceleration; on Ada it falls back to FP8 emulation
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit AWQ-int4 ~16 GB Yes

The community AWQ build is the obvious choice. vLLM auto-detects AWQ from the repo config; no --quantization flag needed.

The drafter, google/gemma-4-26B-A4B-it-assistant, is 0.4B params and fits trivially at bf16 (~0.8 GB). No quantized drafter is needed.

The working recipe

vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \
  --host 0.0.0.0 --port 8000 \
  --served-model-name cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \
  --max-num-batched-tokens 4096 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.93 \
  --enable-prefix-caching --enable-chunked-prefill \
  --dtype auto \
  --speculative-config '{"method":"draft_model","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":2}'

Four flags are non-default and each one fixes a specific startup failure mode I hit in the order they appear:

1. --max-num-batched-tokens 4096 Gemma 4 is multimodal — vLLM logs Resolved architecture: Gemma4ForConditionalGeneration. The encoder side uses bidirectional attention, which makes vLLM auto-disable chunked MM input. With chunking off, max_num_batched_tokens must exceed max_tokens_per_mm_item (=2496 for Gemma 4). The default 2048 fails before the engine even initializes: ValueError: Chunked MM input disabled but max_tokens_per_mm_item (2496) is larger than max_num_batched_tokens (2048).

2. --max-model-len 8192 With target + drafter loaded, vLLM reports 17.41 GiB of weights. At --gpu-memory-utilization 0.92 (~22 GB allocatable) that leaves ~4.5 GiB for KV cache + cudagraph + working memory. At 32k context the KV demand overflows that budget: ValueError: No available memory for the cache blocks. Cutting context to 8k brings KV demand down ~4× and leaves clear headroom.

3. --gpu-memory-utilization 0.93 You'd think the answer to the previous problem is "bump utilization to 0.95". It isn't. On this WSL2 host, the NVIDIA driver reports ~22.45 GiB free at startup (it holds ~1.5 GiB itself). Requesting 0.95 × 24 = 22.79 GiB fails at allocation: Free memory on device cuda:0 (22.45/23.99 GiB) on startup is less than desired GPU memory utilization (0.95, 22.79 GiB). 0.93 wants 22.31 GiB, fits, leaves ~4.9 GiB for everything else.

4. --speculative-config '{"method":"draft_model",...}' This is the actual MTP knob. vLLM accepts the classic two-model draft_model method here even though Gemma 4's drafter is technically an MTP head — vLLM's loader sniffs the drafter's architecture and auto-upgrades the method to mtp at engine init. The journal shows speculative_config=SpeculativeConfig(method='mtp', ...) even though the JSON said draft_model. The auto-upgrade is what does the embedding-sharing trick documented in the next section.

What vLLM does that llama.cpp couldn't

The whole point of yesterday's post was that llama.cpp's loader has no idea what to do with Gemma 4's drafter. It can't load gemma4_assistant, and a retagged-as-gemma4 drafter has two tensors missing — the input embeddings, which Google's MTP design shares with the target. The post listed three upstream changes that would fix it. None have happened.

vLLM, on the same drafter file, logs this on startup:

Detected MTP model. Sharing target model embedding weights with the draft model.
Gemma4 MTP: keeping draft model's own lm_head (draft_dim != backbone_dim).
Gemma4 MTP: draft layer 0 (sliding_attention) -> language_model.model.layers.28.self_attn.attn
Gemma4 MTP: draft layer 1 (sliding_attention) -> language_model.model.layers.28.self_attn.attn
Gemma4 MTP: draft layer 2 (sliding_attention) -> language_model.model.layers.28.self_attn.attn
Gemma4 MTP: draft layer 3 (full_attention)    -> language_model.model.layers.29.self_attn.attn
Model loading took 17.41 GiB memory and 34.19 seconds

Each line answers one of the llama.cpp post's open questions:

The whole thing finishes loading in 34 seconds. The same drafter file that llama.cpp considers malformed loads, attaches to the target, and serves traffic.

The num_speculative_tokens sweep

Methodology: 5 iterations × 200 tokens, fixed prompt, temperature=0, seed=42, prefix caching on. Wall-clock time divided into usage.completion_tokens for tok/s. Acceptance from the delta in vllm:spec_decode_num_accepted_tokens_total / vllm:spec_decode_num_draft_tokens_total across the 5 iterations. The harness is scripts/sweep-vllm-mtp.sh in this repo; it hot-swaps the systemd unit through baseline / n=1 / n=2 / n=3 and aggregates the CSV.

Variant tok/s Acceptance Speedup
baseline 165.28 1.00×
n=1 164.09 69.7% 0.99×
n=2 185.95 61.3% 1.13×
n=3 184.40 50.0% 1.12×
Gemma 4 26B-A4B AWQ on RTX 4090 — vLLM tok/s vs num_speculative_tokens 0 50 100 150 200 tok/s 1.00× baseline 0.99× n=1 (69.7%) 1.13× n=2 (61.3%) 1.12× n=3 (50.0%)
Throughput across the sweep. n=2 peaks at 1.13× over baseline; acceptance drops monotonically as num_speculative_tokens rises.

Three observations:

Why is the speedup so modest?

The Qwen 3.6 post on this same 4090 got 1.85× with the unsloth MTP build at n=3. Gemma 4 here gets 1.13×. Both are state-of-the-art MTP. What's different?

Speculative decoding wins by amortizing the memory-bandwidth cost of one HBM→SM weight load across multiple verified output tokens. The bigger the per-token weight load, the bigger the prize.

The math, roughly: at 165 tok/s on AWQ-int4 we're moving ~3.5 GB × 165 ≈ 580 GB/s of weights over the bus. The 4090 has ~1 TB/s. We're at 60% of theoretical peak without any speculation at all. There's just not as much room for spec decoding to recover.

Next: a dense Gemma 4

If the MoE-blunts-MTP hypothesis is right, a dense Gemma 4 of similar parameter count should show speedup closer to Qwen 3.6's 1.85×. That's the next post. Candidates to verify on disk and inside the 24 GB envelope:

If the dense post comes back with ~1.8× speedup, the hypothesis holds: MoE amortizes the same memory bus that MTP wants to amortize, and the two effects don't stack. If it comes back close to 1.13×, then something else is going on — most likely vLLM's MTP path on Ada is the bottleneck rather than memory bandwidth, and the next investigation is into kernel-side overhead in Gemma4MTPModel.

Reproducibility


Previous: Gemma 4 mtp (llama.cpp can't load the drafter). Next post: dense Gemma 4 — does removing the MoE recover Qwen 3.6's 1.85×?