my 4090 and I

Observations on local LLMs

RTX 5090 + NVFP4 + MTP + TurboQuant: the case for 200 tok/s

A speculative forecast for running Qwen3.6-27B on an RTX 5090 with NVFP4 TurboQuant quantization and MTP. The bandwidth math, VRAM accounting, and why 200 tok/s is a qualitative shift — not just a faster number.


Honcho's dream cycle: how an AI memory system teaches itself

After weeks of running Hermes with Honcho memory, the deriver is accumulating observations and the dream cycle is approaching. An explanation of what deduction and induction passes actually do, surprisal sampling, the peer card, and what changes when dreaming finally fires.


Honcho memory on Hermes Agent

Giving Hermes Agent actual persistent memory via a self-hosted Honcho instance. Dialectical reasoning, VRAM contention, embedding dimension mismatches, and why DeepSeek v4 Pro's cheap tokens are the right engine for this kind of work.


Gemma 4 mtp, vllm

After llama.cpp refused to load the Gemma 4 drafter, I tried what Google's announcement actually said to use. vLLM serves it — 1.13× over baseline at n=2 on a 24 GB 4090. The interesting question is why the speedup is modest compared to Qwen 3.6's 1.85×: MoE blunts MTP.


Gemma 4 mtp

Google shipped Gemma 4 MTP for transformers / MLX / vLLM / SGLang / Ollama. llama.cpp isn't on the list. I tried anyway. It doesn't work yet — here's exactly where it breaks and what upstream would need to change.


Qwen 3.6 mtp

A reproducible walkthrough for getting llama.cpp MTP working on a single RTX 4090 with unsloth/Qwen3.6-27B-MTP-GGUF at UD-Q4_K_XL. Every llama-server flag explained.