2026-06-01 — Dan Billings

RTX 5090 + NVFP4 + MTP + TurboQuant: 111 tok/s measured

Dan Billings — 2026-06-01

I was running at 35 tok/s. Technically the 4090 delivered 84 tok/s with MTP at --spec-draft-n-max 3 on a fresh context, but in practice Hermes Agent runs with a near-full 64k context window per slot. At that context size KV cache pressure is real and effective throughput under load dropped to around 35 tok/s.

The 5090 arrived. This post was originally written as a projection before the hardware landed — the architecture reasoning in the sections below holds up, but I've replaced the speculative benchmark table with real measurements and updated the sections that made predictions. Short version: 111.2 tok/s at n=3 on a single slot with the NVFP4-MTP model, and the MTP cliff that killed the 4090 at n=4 does not exist on Blackwell.

Why the 4090 plateaus at 84 tok/s

The 4090 post walks through the full MTP recipe. The short version: Qwen3.6-27B at UD-Q4_K_XL is ~17 GB of weights. At batch=1 (single-token decode), LLM throughput is almost purely bandwidth-bound — you're reading the weights off VRAM for every token. The 4090 has ~1 TB/s of memory bandwidth.

The rough arithmetic: 27B parameters × 0.5 bytes/param (Q4) = 13.5 GB weights read per decode token. At 1 TB/s, that's a theoretical ceiling around 74 tokens/s. Real hardware hits ~60% of peak bandwidth under this workload, which gives ~45 tok/s raw. MTP at n=3 multiplies that to 84 tok/s. That's the ceiling. The 4090 has hit it.

Two things can move that ceiling: more bandwidth, or smaller weights.

What the 5090 changes: bandwidth first

The RTX 5090 has 1.8 TB/s of memory bandwidth and 32 GB of GDDR7 VRAM. The bandwidth number is the important one. Everything else in the inference recipe stays the same — the model, the flags, the MTP approach — but the bandwidth multiplier is 1.8×.

Apply that to the baseline: 45 tok/s × 1.8 = ~81 tok/s raw decode before MTP. With MTP at (presumably) the same n=3 or better, you're looking at 150+ tok/s on the conservative end.

There's a separate compute story for prefill — the FP4 tensor cores on Blackwell do matrix multiplications in a new native format. That dominates at large batch (the Honcho deriver sending 8k-token context batches, long context ingestion, RAG prompts). Prefill on those workloads gets faster in a way that doesn't scale linearly with bandwidth. I don't have numbers for this yet, but it's real and it's additive.

NVFP4 TurboQuant: smaller weights, more VRAM headroom

The model I'm running now is unsloth/Qwen3.6-27B-MTP-GGUF at UD-Q4_K_XL — 17 GB on disk. The NVFP4 TurboQuant variant is michaelw9999/Qwen3.6-27B-NVFP4-MTP-GGUF (already on HuggingFace). At NVFP4, the weight footprint drops to ~14 GB.

Those 3 GB are what make 256k context possible. A single 256k slot at Q4_0 KV costs ~10 GB. On the 4090 with 17 GB of weights and 24 GB total, you have 7 GB left for KV — not enough. On the 5090 with 14 GB of NVFP4 weights and 32 GB total, 14 + 10 = 24 GB. Fits. The TurboQuant weight savings are the unlock, not the bandwidth.

The unsloth TurboQuant calibration applies to NVFP4 the same way it applies to UD-Q4_K_XL: it's a calibrated quantization pipeline that recovers quality lost in naive quants. Expect NVFP4 TurboQuant quality to be roughly comparable to Q4_K_XL on reasoning tasks. The TurboQuant calibration is the reason you don't just run straight 4-bit.

The MTP cliff question on Blackwell — answered

On the 4090, the MTP cliff is sharp: n=3 peaks at 84 tok/s and n=4 drops to 6 tok/s — 7× slower than no MTP. The 4090 post documents exactly where it comes from: the drafter time jumps from 0.56s to 1.29s for a single extra draft slot, strongly suggesting a CUDA kernel batch size threshold on that compute profile.

The cliff location is hardware-specific. Blackwell has different warp schedulers, different fast-path thresholds, different memory access patterns for small batches. The n=3 corner on a 4090 is not a law of physics; it's an artifact of Ada's CUDA implementation.

Measured on the 5090: no cliff. The full sweep, single slot, 65k context, NVFP4-MTP model, q4_0 KV cache, CUDA 13.3, SM_120 native:

`--spec-draft-n-max`	tok/s	acceptance
0 (none)	73.6	—
1	94.4	85%
2	102.6	68%
3 (peak)	111.2	58%
4	106.0	49%
5	96.9	40%
6	95.5	36%
7	89.4	31%
8	95.6	31%

n=3 is still the peak, but the rolloff is completely graceful. n=4 gives 106 tok/s — a 4.7% step down, not the 93% cliff the 4090 had. The curve flattens out around n=5-6 and stays there. The hypothesis was right: the cliff was Ada-specific. On Blackwell the drafter kernel apparently stays on its fast path well past the n=3 boundary.

VRAM accounting: 256k context, one slot

NVFP4 TurboQuant weights at ~14 GB + a single 256k slot at Q4_0 KV (~10 GB) = 24 GB. Fits on the 5090's 32 GB with room to spare. This configuration isn't possible on the 4090: with 17 GB of Q4_K_XL weights and 24 GB total, there's only 7 GB left for KV — not enough for 256k.

You get one slot. parallel=2 at 256k needs 34 GB and OOMs. The tradeoff is full native context vs. concurrency: run parallel=1 at 256k (Qwen3.6's full training context, compression never fires) or parallel=2 at 128k (two concurrent slots for Hermes + Honcho deriver). Same VRAM footprint either way.

Measured benchmarks

Config	4090 measured	5090 measured	gain
No MTP, raw decode	45.5 tok/s	73.6 tok/s	+62%
MTP n=3	84.4 tok/s	111.2 tok/s	+32%
MTP n=4	6.3 tok/s (cliff)	106.0 tok/s	no cliff

The bandwidth scaling projection (~81 tok/s baseline) was close to the mark. The MTP peak projection (~150 tok/s) was too optimistic — real-world bandwidth utilization doesn't scale as cleanly as the arithmetic suggests. 111 tok/s is still a 32% improvement over the 4090's 84 tok/s peak and 3.2× over the practical 35 tok/s under load. The no-cliff behaviour at n=4 was the real surprise.

The 35 tok/s baseline is the number that matters for daily use: it's what you get with a near-full 64k context window under load, not the single-slot benchmark number. KV cache pressure at that context size is the constraint, and it's real.

What 200 tok/s actually changes

This is not "faster is better" in the abstract. There are specific qualitative thresholds.

At ~45 tok/s (current raw), you can read the output as it streams. You wait for the thinking section to finish. You plan your next message while the model is still generating.

At ~84 tok/s (4090 single-slot peak), the output comes faster than you can comfortably read. The thinking section is still visible but feels like a blur rather than a loading bar.

At ~180-200 tok/s, the model is faster than the reading bottleneck. The practical effect is that responses feel instantaneous rather than streamed. You stop tracking the generation and just read the completed output. This is a meaningful qualitative shift — the interaction pattern changes from "watch the model think" to "read the result."

The other place it shows up is the Honcho deriver. The deriver sends 8k-token batches to the LLM to extract observations. At 35 tok/s, a deriver batch that produces 200 tokens of observations takes ~6 seconds. At 200 tok/s, same batch takes ~1 second. The memory system keeps up with the conversation instead of lagging behind it. This matters for the dream cycle too — more observations extracted per session means the 50-observation threshold for deep induction fires sooner.

Context compression also becomes nearly irrelevant: 128k per slot, Hermes threshold at 90%, means compression only fires after 118k tokens in a single slot. A full coding day's worth of conversation fits without any summarization lossy compression at all.

What the numbers mean in practice

At 111 tok/s on a single slot the output arrives faster than comfortable reading speed. The thinking section is a blur rather than a loading bar. At the multi-slot production config (parallel=2, 128k context) throughput drops from the single-slot peak but the qualitative shift holds.

The NVFP4 quality vs Q4_K_XL comparison remains unmeasured for this specific model — the TurboQuant calibration is designed to recover quality loss but the actual perplexity delta hasn't been run. Subjectively the outputs are indistinguishable in day-to-day use.

The playbook

The Ansible config is at src/main/scala/ansible/examples/Rtx5090Setup.scala in this repo. Key differences from the 4090 recipe:

cuda-toolkit-13-3, compiled with -DCMAKE_CUDA_ARCHITECTURES=120 (SM_120 native)
--ctx-size 131072 with --parallel 2 (two 64k slots; 256k single-slot is possible but cuts concurrency)
--spec-draft-n-max 3 (peak; n=4 is also fine at 106 tok/s — no cliff)
--cache-type-k q4_0 --cache-type-v q4_0 (q4_0 KV throughout; fits the 32 GB budget)
Model: michaelw9999/Qwen3.6-27B-NVFP4-MTP-GGUF (~15.2 GiB on disk)
BLACKWELL_NATIVE_FP4 = 1 confirmed in server logs — the FP4 tensor core path is live

WSL2 note: add /usr/lib/wsl/lib to /etc/ld.so.conf.d/ and symlink nvidia-smi from there — the toolkit apt package doesn't do either.

Sweep conditions: single slot, 65k context, q4_0 KV, NVFP4-MTP model, CUDA 13.3, SM_120, one warmup then 256-token completion at temperature=0.

← All writings · Home