2026-05-24 — Dan Billings

Qwen 3.6 mtp

Dan Billings — 2026-05-24

This is the first post in a series about running open-source models on a single consumer GPU. The goal of the series is reproducibility: anyone with the same hardware should be able to copy-paste their way to the same running system. No "left as an exercise for the reader."

What we're building

A llama.cpp llama-server running Qwen3.6-27B with multi-token prediction (MTP) enabled, on a single RTX 4090 (24 GB). The MTP head lets the model draft tokens per forward pass and verify them in parallel.

Headline measured on this 4090 (single slot, 65 K context, 256-token completion):

So if you're going to remember one thing from this post: --spec-draft-n-max 3 on a 4090, never above 3. Unsloth's published guide recommends 2 (their reference hardware is an RTX 6000); on the 4090 the optimum is one notch higher and the gain is bigger than they reported, but you have to stop exactly there.

The recipe is:

The whole thing is also expressed as Scala in src/main/scala/ansible/examples/Rtx4090Setup.scala in this repo. You can either run that playbook or follow the steps below by hand — they produce the same result.

Hardware and OS

A note about WSL2: /usr/lib/wsl/lib must be on LD_LIBRARY_PATH for the GPU runtime, and the cuda-toolkit-12-8 apt package does not put nvcc on the interactive-shell PATH. We handle both in the systemd unit's Environment= lines and in ~/.bashrc for manual builds. If cmake later refuses to find a CUDA compiler, this is why.

Pinned llama.cpp commit

5d246a792 (master HEAD, 2026-05-24). This includes the cluster of MTP fixes that landed in late May 2026:

Verify the build advertises --spec-type draft-mtp before going further:

git clone https://github.com/ggml-org/llama.cpp /home/dan/llama-build
git -C /home/dan/llama-build checkout 5d246a792
# after building...
/usr/local/bin/llama-server --help | grep -A1 '\--spec-type'
# expected: none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache

The flag was renamed from --spec-type mtp--spec-type draft-mtp on May 13 2026; if you copy a recipe from before then, swap accordingly.

Build

Dependencies first:

sudo apt install -y cmake build-essential ninja-build ccache cuda-toolkit-12-8

Then configure and build with CUDA on. The -DCUDAToolkit_ROOT and -DCMAKE_CUDA_COMPILER flags are belt-and-suspenders for WSL2 where the PATH may not have been propagated to the cmake subprocess:

cmake -B /home/dan/llama-build/build -S /home/dan/llama-build \
  -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release \
  -DCUDAToolkit_ROOT=/usr/local/cuda-12.8 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc

cmake --build /home/dan/llama-build/build --config Release -j$(nproc)

This takes 10–20 minutes on first run; subsequent rebuilds with ccache warm are seconds.

Install the binaries:

sudo install -m 0755 /home/dan/llama-build/build/bin/llama-server /usr/local/bin/llama-server
sudo install -m 0755 /home/dan/llama-build/build/bin/llama-bench  /usr/local/bin/llama-bench

Model files

Pull the unsloth MTP build of Qwen3.6-27B at UD-Q4_K_XL (~16.7 GB on disk). Unsloth bakes the MTP head into a single GGUF — there is no separate drafter file to manage. The filename in the repo doesn't carry the MTP tag, only the repo does:

mkdir -p /home/dan/models/mtp
curl -L -o /home/dan/models/mtp/Qwen3.6-27B-UD-Q4_K_XL.gguf \
  "https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/resolve/main/Qwen3.6-27B-UD-Q4_K_XL.gguf?download=true"

VRAM accounting at 256 K context with default FP KV cache, on a 4090: the model weights are ~16 GB at Q4_K_XL, KV at full context for 4 slots is the main other consumer, and MTP itself adds about 1 GB on top of the same model without MTP. We have headroom; if you push context to 1 M with YaRN, plan a future post on what to give up to fit.

The llama-server invocation

The full command, with every flag explained inline. This is what the systemd unit ends up running:

/usr/local/bin/llama-server \
  --model /home/dan/models/mtp/Qwen3.6-27B-UD-Q4_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 \
  --ctx-size 262144 \
  --parallel 4 \
  -fa on \
  -b 2048 \
  -ub 512 \
  --cont-batching \
  --jinja \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --metrics \
  --reasoning-format deepseek \
  --chat-template-kwargs '{"enable_thinking":true}'

Flag by flag:

Not used — and worth saying out loud:

Systemd unit

The Scala builder in LlamaServerService renders this:

[Unit]
Description=llama.cpp inference server — qwen36-27b-mtp
After=network.target

[Service]
Type=simple
User=dan
Environment=LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/local/cuda-12.8/lib64:/usr/lib/x86_64-linux-gnu
Environment=CUDA_VISIBLE_DEVICES=0
ExecStart=/usr/local/bin/llama-server --model /home/dan/models/mtp/Qwen3.6-27B-UD-Q4_K_XL.gguf -ngl 99 --ctx-size 262144 -b 2048 -ub 512 --cont-batching --jinja -fa on --parallel 4 --host 0.0.0.0 --port 8080 --spec-type draft-mtp --spec-draft-n-max 3 --metrics --reasoning-format deepseek --chat-template-kwargs '{"enable_thinking":true}'
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Install, reload, and start:

sudo cp qwen36-27b-mtp.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now qwen36-27b-mtp
journalctl -u qwen36-27b-mtp -f

CUDA_VISIBLE_DEVICES=0 pins to the 4090 in case any other GPU shows up (eg. an iGPU under WSL).

Sanity check

curl -s http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen36-27b-mtp",
    "messages": [{"role": "user", "content": "Reply with the single word: ready."}],
    "max_tokens": 8,
    "temperature": 0
  }' | jq -r '.choices[0].message.content'

If you see ready. you're good. If you see a connection refused, check journalctl -u qwen36-27b-mtp — the most common first-run failure is a CUDA library path issue, fixed by the WSL LD_LIBRARY_PATH entry above.

Benchmarks

Test conditions: single slot (--parallel 1), 65 K context, -fa on, default FP KV cache, fixed ~65-token prompt → 256-token completion with temperature: 0. One warmup at 32 tokens before each measurement, server restarted between variants so KV state can't leak through. The deployed config is --parallel 4 --ctx-size 262144; multi-slot numbers are a future post.

--spec-draft-n-max server tok/s curl tok/s acceptance acc/gen drafts drafter time
(off, --spec-type none) 45.5 44.6
1 65.9 63.7 78.9% 112/142 0.31 s
2 80.2 77.1 70.3% 149/212 0.44 s
3 (peak) 84.4 80.7 60.7% 164/270 0.56 s
4 (cliff) 6.3 6.2 48.3% 167/346 1.29 s
6 4.9 4.9 36.9% 174/472 1.70 s

What the columns mean: server tok/s is what llama-server's own print_timing reports for the 256-token generation; curl tok/s is end-to-end wall clock from the client side. They agree within 5%, which is reassuring — nothing weird is happening between client and server. acceptance is what fraction of drafted tokens the target model kept. drafter time is the total wall time the drafter spent across the whole 256-token completion.

What the curve says. Acceptance falls roughly linearly as you raise the draft length: 79% → 70% → 61% → 48% → 37%. Each extra draft slot is harder to predict, no surprise. So far so consistent with unsloth.

But throughput is not monotonic. It rises smoothly from 45 → 66 → 80 → 84 tok/s and then collapses between draft=3 and draft=4. The drop is not gradual; one click of the knob takes you from 1.85× faster than baseline to 7× slower than baseline.

Look at the drafter-time column. The drafter takes 0.56 s at draft=3 and 1.29 s at draft=4 — 2.3× more wall time for one extra draft slot, even though the work should scale roughly linearly. Best guess: at n_max ≥ 4, the kernel batch size pushes CUDA off a fast path it had at n ≤ 3 on a 4090's compute profile. The acceptance drop (61% → 48%) makes it worse but isn't, by itself, enough to cause a 13× regression; the regression is mostly in the drafter, not in the verifier or in wasted accepted-tokens.

The takeaway. On a 4090, with this model + this build, the curve has a hard corner at --spec-draft-n-max 3. Everything ≤ 3 is great. Everything ≥ 4 is worse than turning MTP off. On an RTX 6000 (which is what unsloth measured) the curve apparently rolls off more gently and the recommended setting is 2 — different hardware profile, different sweet spot.

A future post will sweep --parallel {1,2,4,8} and --ctx-size to see whether the corner moves under multi-slot load. For single-slot interactive use, draft=3 is the answer.

What didn't work


Next post: same recipe on the MoE variant (Qwen3.6-35B-A3B-MTP-GGUF) — does MTP help an MoE as much as it helps a dense, and does the draft-n-max corner move?