2026-05-24 — Dan Billings

Gemma 4 MTP on llama.cpp: the drafter architecture llama.cpp doesn't recognize

Dan Billings — 2026-05-24

Follow-up (2026-05-25): vLLM works where llama.cpp didn't — see Gemma 4 mtp, vllm for the working recipe, the num_speculative_tokens sweep, and the MoE-blunts-MTP punchline.

A few days after the Qwen 3.6 MTP post, Google announced multi-token prediction for Gemma 4. The headline numbers are gaudy — "up to a 3x speedup without any degradation", "~2.2× on Apple Silicon for the 26B MoE." The frameworks they explicitly support are transformers, MLX, vLLM, SGLang, and Ollama. llama.cpp is not on the list.

I wanted to know how much "not on the list" actually mattered. The Qwen 3.6 MTP recipe — --spec-type draft-mtp --spec-draft-n-max 3 — gave 1.85× on this same 4090. The drafter for Gemma 4 is a separate GGUF file (the so-called "assistant"), so the more direct llama.cpp analogue would be the classic external-drafter mode: --spec-type draft-simple --model-draft <assistant.gguf>. Both files were already on disk on my 4090 box. The check should take half an hour.

It did not work. That's the whole point of this post.

What Google ships

A pair of HuggingFace repos per Gemma 4 size:

Target: google/gemma-4-26B-A4B-it (MoE, ~16 GB at Q4_K_M).
Drafter: google/gemma-4-26B-A4B-it-assistant (~310 MB at Q4_K_M).

The drafter is much smaller than a typical small-draft-model would be because Google's MTP architecture has the drafter share the input embedding table with the target. The drafter file therefore doesn't carry its own copy of those embeddings — they're loaded once from the target. Per Google's docs: "shares the input embedding table with the target model … verifying drafted tokens can require loading additional expert weights from memory."

What llama.cpp gives you to work with

At HEAD 5d246a792 (master, 2026-05-24), llama-server --help advertises:

--spec-type {none, draft-simple, draft-eagle3, draft-mtp,
             ngram-simple, ngram-map-k, ngram-map-k4v, ngram-mod, ngram-cache}
--model-draft / -md FNAME
--spec-draft-n-max N        (--draft-max is REMOVED at this HEAD)
--spec-draft-n-min N        (--draft-min is REMOVED at this HEAD)
--spec-draft-p-min P
--n-gpu-layers-draft N
--spec-draft-hf / -hfd <user>/<model>[:quant]

draft-simple is the classic two-model speculative decoding mode. draft-mtp is the new one that powers Qwen 3.6 (drafter baked into the same GGUF). draft-eagle3 is for EAGLE-style heads.

So on paper there's a path: --spec-type draft-simple --model-draft /path/to/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf --spec-draft-n-max 2. Let's try it.

Attempt 1: `--spec-type draft-simple` with the original drafter

... -ngl 99 \
    --model /home/dan/models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --model-draft /home/dan/models/gemma4-drafter/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf \
    --spec-type draft-simple --spec-draft-n-max 2

Refuses to start:

E llama_model_load: error loading model: unknown model architecture: 'gemma4_assistant'
E srv    load_model: [spec] failed to measure draft model memory: failed to load model

llama.cpp uses the general.architecture GGUF metadata key to pick which model loader to invoke. The drafter GGUF that Google ships sets general.architecture = gemma4_assistant — a fresh architecture name, presumably a signal to a future GGUF-aware framework that this is a drafter rather than a standalone model. llama.cpp doesn't have a gemma4_assistant loader registered, so the file never gets past the dispatch step.

Attempt 2: retag the drafter's architecture to `gemma4`

If the only thing wrong is the architecture string, surely rewriting it to gemma4 would let the file load? Easy enough with the gguf-py library that ships in llama-build/gguf-py/. The retag rewrites just the general.architecture value (and the per-arch key prefixes, e.g. gemma4_assistant.block_count → gemma4.block_count) — tensor data is untouched.

A diff between the two files' metadata via GGUFReader:

key                                     original              retagged
─────────────────────────────────────────────────────────────────────
general.architecture                    gemma4_assistant      gemma4
gemma4_assistant.block_count            <int>                 (renamed to gemma4.block_count)
gemma4_assistant.embedding_length       <int>                 (renamed)
... (all gemma4_assistant.* keys)       ...                   gemma4.*
requires_target_arch                    gemma4                gemma4   (unchanged)

Note the requires_target_arch = gemma4 key. That's Google's intended solution: ship the drafter as a distinct architecture (gemma4_assistant) that declares which target it can pair with. If llama.cpp's loader honored requires_target_arch, no retag would be needed. But the loader at HEAD doesn't read that key yet.

Anyway — retagged file in hand. Same invocation, drafter file swapped:

... --model-draft /home/dan/models/gemma4-drafter/gemma-4-26B-A4B-it-assistant.Q4_K_M.retagged.gguf

The loader now recognizes the architecture and gets further. But:

E llama_model_load: error loading model: done_getting_tensors:
    wrong number of tensors; expected 49, got 47

Two tensors short. Those two tensors are exactly the input embeddings that, by Google's MTP design, the drafter doesn't carry — they're meant to be loaded once from the target. The retagged file claims to be a gemma4 model, so llama.cpp expects it to be tensor-complete by gemma4 standards.

We've traded one error for another. The retag isn't a workaround; it's the same problem one layer deeper.

Attempt 3: `--spec-type draft-mtp` (the Qwen flavor)

Maybe the MTP-aware loader knows about the shared-embedding case. Both drafter files, both attempted with --spec-type draft-mtp:

draft-mtp + original drafter:
E llama_model_load: error loading model: unknown model architecture: 'gemma4_assistant'

draft-mtp + retagged drafter:
E llama_model_load: error loading model: done_getting_tensors:
    wrong number of tensors; expected 49, got 47

Identical errors. The --spec-type value doesn't enter the picture — the model loader rejects the file first. The same is true for --spec-type draft-eagle3. Six permutations, six failures at the same line.

What this means

Gemma 4 MTP doesn't work on llama.cpp at HEAD 5d246a792. To make it work, one of the following needs to land upstream:

Register gemma4_assistant as a known model architecture with a loader that knows it's a drafter (no separate input-embedding tensors) and pairs with a gemma4 target at runtime.
Honor the requires_target_arch metadata key when loading any draft-side model, and use that to dispatch to a drafter-specific loader.
Teach the existing gemma4 loader to accept a tensor-incomplete file when used as --model-draft, sourcing the missing embeddings from the target.

(2) is probably the cleanest from a llama.cpp design standpoint; it's also the one that matches Google's intent — they put the metadata key there for a reason.

In the meantime: if you're on llama.cpp and you want MTP, your options today are the ones already documented in the Qwen 3.6 MTP post — unsloth's Qwen 3.6 MTP GGUFs work with --spec-type draft-mtp because the drafter is baked into the main GGUF. Two files is one more file than llama.cpp's loader is ready for.

Tracking the fix

A small Scala IOApp in this repo, ansible.examples.GemmaDraftSimple2, is the executable form of this post. It hot-swaps the gemma4-26b systemd unit to attempt --spec-type draft-simple with the retagged drafter — i.e. attempt 2 above. Anyone bumping past this HEAD can re-run it and watch the error change. The day it doesn't error any more, the workaround in this post is no longer the latest word and someone (maybe me) gets to write the follow-up post with actual numbers.

Sibling IOApps: GemmaMtpOff (control, A4B target alone — works fine), GemmaDraftMtp2 (attempt 3 with retagged drafter), GemmaDraftMtpOriginal (attempt 3 with the original gemma4_assistant-tagged drafter).

What didn't work, in one paragraph

The original drafter GGUF declares an architecture name (gemma4_assistant) that llama.cpp doesn't yet register. Rewriting the architecture name to gemma4 (the "retagged" file in /home/dan/models/gemma4-drafter/) lets llama.cpp's loader try, but it then fails because the drafter is shipped without its input embedding tensors (which Google's MTP architecture shares with the target). --spec-type draft-simple, draft-mtp, and draft-eagle3 all fail with the same loader error — the speculative-decoding strategy never gets a chance to run. Upstream needs either a gemma4_assistant loader, support for the requires_target_arch metadata key, or a tensor-incomplete loader path for draft models. None of these exist at 5d246a792.

Next post: probably either the eval harness or — if upstream moves fast — a "Gemma 4 mtp, revisited" follow-up.

← All writings · Home