RTX 5090 + NVFP4 + MTP + TurboQuant: the case for 200 tok/s
Dan Billings — 2026-06-01
I'm running at 35 tok/s right now. Technically I'm running at 84 tok/s — that's what the 4090 delivers with MTP at --spec-draft-n-max 3 on a fresh context — but in practice, Hermes Agent runs with a near-full 64k context window per slot. At that context size, KV cache pressure is real and effective throughput under load drops to around 35 tok/s. That's not bad. But you notice it. The thinking section scrolls, you drift, you type something before the model finishes.
This is a speculative post. I don't have a 5090 yet. What I do have is a clear picture of why the 4090 plateaus where it does and a reasonable forecast of where a 5090 lands. The numbers in the benchmarks section are projections, not measurements. I'll label them clearly. The architecture reasoning is sound, and I think the conclusion holds: a 5090 running NVFP4 + MTP is a qualitative change, not just a numbers change, and the case for buying one is stronger than it looks at first.
Why the 4090 plateaus at 84 tok/s
The 4090 post walks through the full MTP recipe. The short version: Qwen3.6-27B at UD-Q4_K_XL is ~17 GB of weights. At batch=1 (single-token decode), LLM throughput is almost purely bandwidth-bound — you're reading the weights off VRAM for every token. The 4090 has ~1 TB/s of memory bandwidth.
The rough arithmetic: 27B parameters × 0.5 bytes/param (Q4) = 13.5 GB weights read per decode token. At 1 TB/s, that's a theoretical ceiling around 74 tokens/s. Real hardware hits ~60% of peak bandwidth under this workload, which gives ~45 tok/s raw. MTP at n=3 multiplies that to 84 tok/s. That's the ceiling. The 4090 has hit it.
Two things can move that ceiling: more bandwidth, or smaller weights.
What the 5090 changes: bandwidth first
The RTX 5090 has 1.8 TB/s of memory bandwidth and 32 GB of GDDR7 VRAM. The bandwidth number is the important one. Everything else in the inference recipe stays the same — the model, the flags, the MTP approach — but the bandwidth multiplier is 1.8×.
Apply that to the baseline: 45 tok/s × 1.8 = ~81 tok/s raw decode before MTP. With MTP at (presumably) the same n=3 or better, you're looking at 150+ tok/s on the conservative end.
There's a separate compute story for prefill — the FP4 tensor cores on Blackwell do matrix multiplications in a new native format. That dominates at large batch (the Honcho deriver sending 8k-token context batches, long context ingestion, RAG prompts). Prefill on those workloads gets faster in a way that doesn't scale linearly with bandwidth. I don't have numbers for this yet, but it's real and it's additive.
NVFP4 TurboQuant: smaller weights, more VRAM headroom
The model I'm running now is unsloth/Qwen3.6-27B-MTP-GGUF at UD-Q4_K_XL — 17 GB on disk. The NVFP4 TurboQuant variant is michaelw9999/Qwen3.6-27B-NVFP4-MTP-GGUF (already on HuggingFace). At NVFP4, the weight footprint drops to ~14 GB.
Those 3 GB are what make 256k context possible. A single 256k slot at Q4_0 KV costs ~10 GB. On the 4090 with 17 GB of weights and 24 GB total, you have 7 GB left for KV — not enough. On the 5090 with 14 GB of NVFP4 weights and 32 GB total, 14 + 10 = 24 GB. Fits. The TurboQuant weight savings are the unlock, not the bandwidth.
The unsloth TurboQuant calibration applies to NVFP4 the same way it applies to UD-Q4_K_XL: it's a calibrated quantization pipeline that recovers quality lost in naive quants. Expect NVFP4 TurboQuant quality to be roughly comparable to Q4_K_XL on reasoning tasks. The TurboQuant calibration is the reason you don't just run straight 4-bit.
The MTP cliff question on Blackwell
On the 4090, the MTP cliff is sharp: n=3 peaks at 84 tok/s and n=4 drops to 6 tok/s — 7× slower than no MTP. The 4090 post documents exactly where it comes from: the drafter time jumps from 0.56s to 1.29s for a single extra draft slot, strongly suggesting a CUDA kernel batch size threshold on this specific compute profile.
The cliff location is hardware-specific. Blackwell has a different compute profile — different warp schedulers, different fast-path thresholds, different memory access patterns for small batches. The n=3 corner on a 4090 is not a law of physics; it's an artifact of this generation's CUDA implementation.
My working hypothesis: on a 5090, the cliff moves. n=4 or n=5 may be viable. Finding the cliff is the first benchmark to run when one arrives. If the cliff moves to n=4, that's another ~10-15% speedup on top of the bandwidth and weight gains.
VRAM accounting: 256k context, one slot
NVFP4 TurboQuant weights at ~14 GB + a single 256k slot at Q4_0 KV (~10 GB) = 24 GB. Fits on the 5090's 32 GB with room to spare. This configuration isn't possible on the 4090: with 17 GB of Q4_K_XL weights and 24 GB total, there's only 7 GB left for KV — not enough for 256k.
You get one slot. parallel=2 at 256k needs 34 GB and OOMs. The tradeoff is full native context vs. concurrency: run parallel=1 at 256k (Qwen3.6's full training context, compression never fires) or parallel=2 at 128k (two concurrent slots for Hermes + Honcho deriver). Same VRAM footprint either way.
Projected benchmarks
These are projections from the 4090 numbers scaled by bandwidth ratio and weight savings. Not measurements. I'll update this table when I have the hardware.
| Config | 4090 measured | 5090 projected |
|---|---|---|
| No MTP, raw decode | 45 tok/s | ~81 tok/s |
| MTP n=3 | 84 tok/s | ~150 tok/s |
| MTP at optimal n (TBD) | 84 tok/s | ~180-200 tok/s |
The 5090 projection with MTP at optimal n represents roughly 5-6× over the current real-world 35 tok/s and about 2.4× over the 4090's 84 tok/s clean-context peak. If the MTP cliff moves to n=4 on Blackwell, add another 10-15% on top.
The 35 tok/s baseline is the number that matters for daily use: it's what you get with a near-full 64k context window under load, not the single-slot benchmark number. KV cache pressure at that context size is the constraint, and it's real.
What 200 tok/s actually changes
This is not "faster is better" in the abstract. There are specific qualitative thresholds.
At ~45 tok/s (current raw), you can read the output as it streams. You wait for the thinking section to finish. You plan your next message while the model is still generating.
At ~84 tok/s (4090 single-slot peak), the output comes faster than you can comfortably read. The thinking section is still visible but feels like a blur rather than a loading bar.
At ~180-200 tok/s, the model is faster than the reading bottleneck. The practical effect is that responses feel instantaneous rather than streamed. You stop tracking the generation and just read the completed output. This is a meaningful qualitative shift — the interaction pattern changes from "watch the model think" to "read the result."
The other place it shows up is the Honcho deriver. The deriver sends 8k-token batches to the LLM to extract observations. At 35 tok/s, a deriver batch that produces 200 tokens of observations takes ~6 seconds. At 200 tok/s, same batch takes ~1 second. The memory system keeps up with the conversation instead of lagging behind it. This matters for the dream cycle too — more observations extracted per session means the 50-observation threshold for deep induction fires sooner.
Context compression also becomes nearly irrelevant: 128k per slot, Hermes threshold at 90%, means compression only fires after 118k tokens in a single slot. A full coding day's worth of conversation fits without any summarization lossy compression at all.
What's speculative here
Everything in the benchmarks section is a projection. The bandwidth scaling is a reasonable first-order approximation; real hardware has cache effects, kernel dispatch overhead, and other factors that don't scale linearly. The MTP cliff location on Blackwell is genuinely unknown until someone runs the sweep. The NVFP4 quality comparison to Q4_K_XL is uncharted for this specific model — the TurboQuant calibration mitigates quality loss but the actual numbers aren't published.
The one thing I'm confident in: the bandwidth increase from 4090 → 5090 is real and large, and bandwidth is the bottleneck. The direction of the forecast is right even if the magnitude is approximate.
The playbook, when it arrives
The Ansible config for this is in this repo. It'll follow the same pattern as the 4090 setup — pinned llama.cpp commit, systemd unit with LD_LIBRARY_PATH adjusted for whatever CUDA toolkit Blackwell needs, model pulled from HuggingFace on first run. The flag changes from the 4090 recipe will be:
--ctx-size 262144(1 slot × 256k, or 2 slots × 128k — same VRAM)--parallel 1or--parallel 2depending on whether you want full context or concurrency--spec-draft-n-maxset to whatever the benchmark sweep finds optimal- Model:
michaelw9999/Qwen3.6-27B-NVFP4-MTP-GGUF
Everything else is the same. Same MTP approach, same KV cache strategy, same systemd scaffolding.
When the hardware arrives, I'll run the actual sweep and update the benchmark table. The interesting question to resolve is where exactly the MTP cliff falls on Blackwell and whether NVFP4 shifts the quality baseline in any meaningful way.