my 4090 and I
Observations on local LLMs
2026-06-01 — Dan Billings
A speculative forecast for running Qwen3.6-27B on an RTX 5090 with NVFP4 TurboQuant quantization and MTP. The bandwidth math, VRAM accounting, and why 200 tok/s is a qualitative shift — not just a faster number.
2026-06-01 — Dan Billings
After weeks of running Hermes with Honcho memory, the deriver is accumulating observations and the dream cycle is approaching. An explanation of what deduction and induction passes actually do, surprisal sampling, the peer card, and what changes when dreaming finally fires.
2026-05-30 — Dan Billings
Giving Hermes Agent actual persistent memory via a self-hosted Honcho instance. Dialectical reasoning, VRAM contention, embedding dimension mismatches, and why DeepSeek v4 Pro's cheap tokens are the right engine for this kind of work.
2026-05-25 — Dan Billings
After llama.cpp refused to load the Gemma 4 drafter, I tried what Google's announcement actually said to use. vLLM serves it — 1.13× over baseline at n=2 on a 24 GB 4090. The interesting question is why the speedup is modest compared to Qwen 3.6's 1.85×: MoE blunts MTP.
2026-05-24 — Dan Billings
Google shipped Gemma 4 MTP for transformers / MLX / vLLM / SGLang / Ollama. llama.cpp isn't on the list. I tried anyway. It doesn't work yet — here's exactly where it breaks and what upstream would need to change.
2026-05-24 — Dan Billings
A reproducible walkthrough for getting llama.cpp MTP working on a single RTX 4090 with unsloth/Qwen3.6-27B-MTP-GGUF at UD-Q4_K_XL. Every llama-server flag explained.
← Home