2026-05-24 — Dan Billings

Qwen 3.6 MTP on a 4090: 84 tok/s at n=3, 6 tok/s at n=4

Dan Billings — 2026-05-24

This is the first post in a series about running open-source models on a single consumer GPU. The goal of the series is reproducibility: anyone with the same hardware should be able to copy-paste their way to the same running system. No "left as an exercise for the reader."

What we're building

A llama.cpp llama-server running Qwen3.6-27B with multi-token prediction (MTP) enabled, on a single RTX 4090 (24 GB). The MTP head lets the model draft tokens per forward pass and verify them in parallel.

Headline measured on this 4090 (single slot, 65 K context, 256-token completion):

No MTP: 45 tok/s.
MTP, --spec-draft-n-max 3: 84 tok/s. 1.85× speedup, 60.7% acceptance.
MTP, --spec-draft-n-max 4 (and above): falls off a cliff to ~6 tok/s — seven times slower than no MTP at all.

So if you're going to remember one thing from this post: --spec-draft-n-max 3 on a 4090, never above 3. Unsloth's published guide recommends 2 (their reference hardware is an RTX 6000); on the 4090 the optimum is one notch higher and the gain is bigger than they reported, but you have to stop exactly there.

The recipe is:

llama.cpp built from a pinned commit with CUDA 12.8.
Model: unsloth/Qwen3.6-27B-MTP-GGUF at the UD-Q4_K_XL (unsloth dynamic 4-bit) quantization.
A systemd unit that pins the server to GPU 0 with flash-attn, continuous batching, 4 parallel slots, 256 K context, and the new MTP flags.

The whole thing is also expressed as Scala in src/main/scala/ansible/examples/Rtx4090Setup.scala in this repo. For an explanation of the typed DSL architecture, see Type-Safe Home Cluster. You can either run that playbook or follow the steps below by hand — they produce the same result.

Hardware and OS

RTX 4090, 24 GB VRAM.
WSL2 on Windows 11, Ubuntu 24.04.
CUDA Toolkit 12.8 (the apt cuda-toolkit-12-8 package; nvcc lands in /usr/local/cuda-12.8/bin/).

A note about WSL2: /usr/lib/wsl/lib must be on LD_LIBRARY_PATH for the GPU runtime, and the cuda-toolkit-12-8 apt package does not put nvcc on the interactive-shell PATH. We handle both in the systemd unit's Environment= lines and in ~/.bashrc for manual builds. If cmake later refuses to find a CUDA compiler, this is why.

Pinned llama.cpp commit

5d246a792 (master HEAD, 2026-05-24). This includes the cluster of MTP fixes that landed in late May 2026:

#23485 — draft model margin in server
#23563 — NVFP4 MTP scale tensors
#23461 — free draft/MTP resources on sleep (fixes a VRAM leak you definitely want)
#23433 — skip logit computation in MTP path
#23287 — backend sampling for the MTP draft path

Verify the build advertises --spec-type draft-mtp before going further:

git clone https://github.com/ggml-org/llama.cpp /home/dan/llama-build
git -C /home/dan/llama-build checkout 5d246a792
# after building...
/usr/local/bin/llama-server --help | grep -A1 '\--spec-type'
# expected: none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache

The flag was renamed from --spec-type mtp → --spec-type draft-mtp on May 13 2026; if you copy a recipe from before then, swap accordingly.

Build

Dependencies first:

sudo apt install -y cmake build-essential ninja-build ccache cuda-toolkit-12-8

Then configure and build with CUDA on. The -DCUDAToolkit_ROOT and -DCMAKE_CUDA_COMPILER flags are belt-and-suspenders for WSL2 where the PATH may not have been propagated to the cmake subprocess:

cmake -B /home/dan/llama-build/build -S /home/dan/llama-build \
  -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc

cmake --build /home/dan/llama-build/build --config Release -j$(nproc)

This takes 10–20 minutes on first run; subsequent rebuilds with ccache warm are seconds.

Install the binaries:

sudo install -m 0755 /home/dan/llama-build/build/bin/llama-server /usr/local/bin/llama-server
sudo install -m 0755 /home/dan/llama-build/build/bin/llama-bench  /usr/local/bin/llama-bench

Model files

Pull the unsloth MTP build of Qwen3.6-27B at UD-Q4_K_XL (~16.7 GB on disk). Unsloth bakes the MTP head into a single GGUF — there is no separate drafter file to manage. The filename in the repo doesn't carry the MTP tag, only the repo does:

mkdir -p /home/dan/models/mtp
curl -L -o /home/dan/models/mtp/Qwen3.6-27B-UD-Q4_K_XL.gguf \
  "https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/resolve/main/Qwen3.6-27B-UD-Q4_K_XL.gguf?download=true"

VRAM accounting at 256 K context with default FP KV cache, on a 4090: the model weights are ~16 GB at Q4_K_XL, KV at full context for 4 slots is the main other consumer, and MTP itself adds about 1 GB on top of the same model without MTP. We have headroom; if you push context to 1 M with YaRN, plan a future post on what to give up to fit.

The llama-server invocation

The full command, with every flag explained inline. This is what the systemd unit ends up running:

/usr/local/bin/llama-server \
  --model /home/dan/models/mtp/Qwen3.6-27B-UD-Q4_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  --ctx-size 262144 \
  --parallel 4 \
  -fa on \
  -b 2048 \
  -ub 512 \
  --cont-batching \
  --jinja \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --metrics \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"enable_thinking":true}'

Flag by flag:

--model <path> — the GGUF on disk.
--host 0.0.0.0 --port 8080 — listen on the LAN. If you only ever curl from the same box, use 127.0.0.1; we want other machines in the house to hit it.
-ngl 99 — offload all transformer layers to the GPU. The number is "more layers than the model has," which llama.cpp clamps to "all."
--ctx-size 262144 — 256 K tokens, Qwen3.6's native context. Each of the 4 parallel slots gets ctx-size / 4 = 64 K. To reach 1 M you'd add YaRN rope scaling; that's a follow-up post.
--parallel 4 — four concurrent slots. The server multiplexes requests across them.
-fa on — flash attention. Required for the MTP recipe; also a free PP/TG improvement.
-b 2048 -ub 512 — physical batch size and unified batch size. -b 2048 keeps prompt processing fast; -ub 512 is conservative so 4-slot concurrent prefill doesn't OOM. We tried -ub 2048 on the earlier Gemma 4 recipe and it OOMed under sustained load, so we keep the smaller unified batch here.
--cont-batching — required when --parallel > 1. Lets new requests slot in mid-decode rather than waiting for a slot to fully drain.
--jinja — enable jinja chat templating so OpenAI-style tool-use calls round-trip correctly.
--spec-type draft-mtp — this is the MTP flag. Tells llama.cpp to use the MTP head baked into the GGUF as the drafter. The full menu at HEAD is none, draft-simple, draft-eagle3, draft-mtp, ngram-simple, ngram-map-k, ngram-map-k4v, ngram-mod, ngram-cache — draft-mtp is the one that uses the head unsloth ships in this GGUF. The flag was --spec-type mtp before May 13 2026, then renamed; if your build is older you'll get an error and a confused 15 minutes.
--spec-draft-n-max 3 — number of draft tokens per step. This is the most important knob in the recipe. Unsloth recommends 2 based on RTX 6000 measurements; on a 4090 the peak is 3 (1.85× speedup, 60.7% acceptance), and 4 is a cliff (~6 tok/s — 7× worse than no MTP at all). Numbers in the Benchmarks section. Do not go above 3 on this hardware.
--metrics — exposes Prometheus metrics at GET /metrics. Cheap to leave on.
--reasoning-format deepseek — pulls the <think>...</think> block out into message.reasoning_content in the OpenAI-shaped response, so clients can render it separately.
--chat-template-kwargs '{"enable_thinking":true}' — turns on Qwen3.6's thinking mode by default. The single quotes are not optional: systemd's ExecStart= parser eats unquoted double quotes inside JSON if you let it.

Not used — and worth saying out loud:

No --cache-type-k / --cache-type-v. The Gemma 4 recipe in this same repo runs q4_0 KV; for the first Qwen3.6 post I'm leaving KV at FP defaults so the MTP measurement isn't tangled up with KV-quant effects. A future post benchmarks quantized KV on Qwen3.6.
No --model-draft. That's for the external draft-model flavor of speculative decoding. Qwen3.6-MTP's drafter is internal to the GGUF — using --model-draft here is the wrong knob.

Systemd unit

The Scala builder in LlamaServerService renders this:

[Unit]
Description=llama.cpp inference server — qwen36-27b-mtp
After=network.target

[Service]
Type=simple
User=dan
Environment=LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/local/cuda-12.8/lib64:/usr/lib/x86_64-linux-gnu
Environment=CUDA_VISIBLE_DEVICES=0
ExecStart=/usr/local/bin/llama-server --model /home/dan/models/mtp/Qwen3.6-27B-UD-Q4_K_XL.gguf -ngl 99 --ctx-size 262144 -b 2048 -ub 512 --cont-batching --jinja -fa on --parallel 4 --host 0.0.0.0 --port 8080 --spec-type draft-mtp --spec-draft-n-max 3 --metrics --reasoning-format deepseek --chat-template-kwargs '{"enable_thinking":true}'
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Install, reload, and start:

sudo cp qwen36-27b-mtp.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now qwen36-27b-mtp
journalctl -u qwen36-27b-mtp -f

CUDA_VISIBLE_DEVICES=0 pins to the 4090 in case any other GPU shows up (eg. an iGPU under WSL).

Sanity check

curl -s http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen36-27b-mtp",
    "messages": [{"role": "user", "content": "Reply with the single word: ready."}],
    "max_tokens": 8,
    "temperature": 0
  }' | jq -r '.choices[0].message.content'

If you see ready. you're good. If you see a connection refused, check journalctl -u qwen36-27b-mtp — the most common first-run failure is a CUDA library path issue, fixed by the WSL LD_LIBRARY_PATH entry above.

Benchmarks

Test conditions: single slot (--parallel 1), 65 K context, -fa on, default FP KV cache, fixed ~65-token prompt → 256-token completion with temperature: 0. One warmup at 32 tokens before each measurement, server restarted between variants so KV state can't leak through. The deployed config is --parallel 4 --ctx-size 262144; multi-slot numbers are a future post.

`--spec-draft-n-max`	server tok/s	curl tok/s	acceptance	acc/gen drafts	drafter time
(off, `--spec-type none`)	45.5	44.6	—	—	—
1	65.9	63.7	78.9%	112/142	0.31 s
2	80.2	77.1	70.3%	149/212	0.44 s
3 (peak)	84.4	80.7	60.7%	164/270	0.56 s
4 (cliff)	6.3	6.2	48.3%	167/346	1.29 s
6	4.9	4.9	36.9%	174/472	1.70 s

What the columns mean: server tok/s is what llama-server's own print_timing reports for the 256-token generation; curl tok/s is end-to-end wall clock from the client side. They agree within 5%, which is reassuring — nothing weird is happening between client and server. acceptance is what fraction of drafted tokens the target model kept. drafter time is the total wall time the drafter spent across the whole 256-token completion.

What the curve says. Acceptance falls roughly linearly as you raise the draft length: 79% → 70% → 61% → 48% → 37%. Each extra draft slot is harder to predict, no surprise. So far so consistent with unsloth.

But throughput is not monotonic. It rises smoothly from 45 → 66 → 80 → 84 tok/s and then collapses between draft=3 and draft=4. The drop is not gradual; one click of the knob takes you from 1.85× faster than baseline to 7× slower than baseline.

Look at the drafter-time column. The drafter takes 0.56 s at draft=3 and 1.29 s at draft=4 — 2.3× more wall time for one extra draft slot, even though the work should scale roughly linearly. Best guess: at n_max ≥ 4, the kernel batch size pushes CUDA off a fast path it had at n ≤ 3 on a 4090's compute profile. The acceptance drop (61% → 48%) makes it worse but isn't, by itself, enough to cause a 13× regression; the regression is mostly in the drafter, not in the verifier or in wasted accepted-tokens.

The takeaway. On a 4090, with this model + this build, the curve has a hard corner at --spec-draft-n-max 3. Everything ≤ 3 is great. Everything ≥ 4 is worse than turning MTP off. On an RTX 6000 (which is what unsloth measured) the curve apparently rolls off more gently and the recommended setting is 2 — different hardware profile, different sweet spot.

A future post will sweep --parallel {1,2,4,8} and --ctx-size to see whether the corner moves under multi-slot load. For single-slot interactive use, draft=3 is the answer.

What didn't work

--spec-draft-n-max 4 and 6. Most prominent dead end of this whole exercise — and the one most likely to bite someone who follows unsloth's guide and then "tunes up." Quantified above in Benchmarks. Don't.
Prometheus /metrics for MTP counters. I expected llamacpp:n_draft and llamacpp:n_accept to expose the drafted/accepted totals so a sidecar could scrape them over time. At this commit they don't appear there. The acceptance numbers are on stderr — every print_timing block ends with draft acceptance = 0.XX (N accepted / M generated) and there's a draft-mtp: #calls #gen drafts #acc drafts #gen tokens #acc tokens summary alongside it. For now, scrape stderr or journalctl -u qwen36-27b-mtp -o cat | grep 'draft acceptance'.
Reading the wrong print_timing block. Each request emits its own eval time = ... line. If you grep blindly you'll get the warmup (32 tokens), not the measurement (256 tokens). Take the last match per request, not the first. Cost me a confused half hour and an embarrassing earlier draft of this post.
Pre–May 13 builds. The flag was --spec-type mtp; trying to pass draft-mtp errors out. Either pin the commit (5d246a792 is what I used) or s/draft-mtp/mtp/.
Quantized KV (--cache-type-k q4_0 --cache-type-v q4_0). Not tested in this post — the Gemma 4 recipe in the same repo runs q4_0 KV but I deliberately kept KV at FP defaults here so the MTP measurement isn't tangled with KV-quant effects. A future post benchmarks it.
-ub 2048. Inherited caution from the Gemma 4 recipe: under 4-slot concurrent prefill on the earlier setup it OOMed. The deployed config uses -ub 512. Not re-tested here under single slot.

Next post: same recipe on the MoE variant (Qwen3.6-35B-A3B-MTP-GGUF) — does MTP help an MoE as much as it helps a dense, and does the draft-n-max corner move?

← All writings · Home